Generating de novo reference genome assemblies for non-model organisms is a laborious task that often requires a large amount of data from several sequencing platforms and cytogenetic surveys. By using PacBio sequence data and new library creation techniques, we present a de novo, high quality reference assembly for the goat (Capra hircus) that demonstrates a primarily sequencing-based approach to efficiently create new reference assemblies for Eukaryotic species. This goat reference genome was created using 38 million PacBio P5-C3 reads generated from a San Clemente goat using the Celera Assembler PBcR pipeline with PacBio read self-correction. In order to generate the assembly, corrected and filtered reads were pre-assembled into a consensus model using PBDAGCON, and subsequently assembled using the Celera Assembly version 8.2. We generated 5,902 contigs using this method with a contig N50 size of 2.56 megabases. In order to generate chromosome-sized scaffolds, we used the LACHESIS scaffolding method to identify cis-chromosome Hi-C interactions in order to link contigs together. We then compared our new assembly to the existing goat reference assembly to identify large-scale discrepancies. In our comparison, we identified 247 disagreements between the two assemblies consisting of 123 inversions and 124 chromosome-contig relocations. The high quality of this data illustrates how this methodology can be used to efficiently generate new reference genome assemblies without the use of expensive fluorescent cytometry or large quantities of data from multiple sequencing platforms.
The goat (Capra hircus) remains an important livestock species due to the species’ ability to forage and provide milk, meat and wool in arid environments. The current goat reference assembly and annotation borrows heavily from other loosely related livestock species, such as cattle, and may not reflect the unique structural and functional characteristics of the species. We present preliminary data from a new de novo reference assembly for goat that primarily utilizes 38 million PacBio P5-C3 reads generated from an inbred San Clemente goat. This assembly consists of only 5,902 contigs with a contig N50 size of 2.56 megabases which were grouped into scaffolds using cis-chromosome associations generated by the analysis of Hi-C sequence reads. To provide accurate functional genetic annotation, we utilized existing RNA-seq data and generated new data consisting of over 784 million reads from a combination of 27 different developmental timepoints/tissues. This dataset provides a tangible improvement over existing goat genomics resources by correcting over 247 misassemblies in the current goat reference genome and by annotating predicted gene models with actual expressed transcript data. Our goal is to provide a high quality resource to researchers to enable future genomic selection and functional prediction within the field of goat genomics.
Many applications of high throughput sequencing rely on the availability of an accurate reference genome. Errors in the reference genome assembly increase the number of false-positives in downstream analyses. Recently, we have shown that over 33% of the current pig reference genome, Sscrofa10.2, is either misassembled or otherwise unreliable for genomic analyses. Additionally, ~10% of the bases in the assembly are Ns in gaps of an arbitrary size. Thousands of highly fragmented contigs remain unplaced and many genes are known to be missing from the assembly. Here we present a new assembly of the pig genome, Sscrofa11, assembled using 65X PacBio sequencing from T.J. Tabasco, the same Duroc sow used in the assembly of Sscrofa10.2. The PacBio reads were assembled using the Falcon assembly pipeline resulting in 3,206 contigs with an initial contig N50 of 14.5Mb. We used Sscrofa10.2 as a template to scaffold the PacBio contigs, under the assumption that its gross structure is correct, and used PBJelly to fill gaps. Additional gaps were filled using large, sequenced BACs from the original assembly. Following gap filling, the assembly has substantially improved contiguity and contains more sequence than the Sscrofa10.2 assembly. Arrow and Pilon were used to polish the assembly. The contig N50 is now 58.5Mb with 103 gaps remaining. By comparing regions of the two assemblies we show that regions with structural abnormalities we identified in Sscrofa10.2 are resolved in the new PacBio assembly.
Determining compositions and functional capabilities of complex populations is often challenging, especially for sequencing technologies with short reads that do not uniquely identify organisms or genes. Long-read sequencing improves the resolution of these mixed communities, but adoption for this application has been limited due to concerns about throughput, cost and accuracy. The recently introduced PacBio Sequel System generates hundreds of thousands of long and highly accurate single-molecule reads per SMRT Cell. We investigated how the Sequel System might increase understanding of metagenomic communities. In the past, focus was largely on taxonomic classification with 16S rRNA sequencing. Recent expansion to WGS sequencing enables functional profiling as well, with the ultimate goal of complete genome assemblies. Here we compare the complex microbiomes in 5 cow rumen samples, for which Illumina WGS sequence data was also available. To maximize the PacBio single-molecule sequence accuracy, libraries of 2 to 3 kb were generated, allowing many polymerase passes per molecule. The resulting reads were filtered at predicted single-molecule accuracy levels up to 99.99%. Community compositions of the 5 samples were compared with Illumina WGS assemblies from the same set of samples, indicating rare organisms were often missed with Illumina. Assembly from PacBio CCS reads yielded a contig >100 kb in length with 6-fold coverage. Mapping of Illumina reads to the 101 kb contig verified the PacBio assembly and contig sequence. Scaffolding with reads from a PacBio unsheared library produced a complete genome of 2.4 Mb. These results illustrate ways in which long accurate reads benefit analysis of complex communities.
Richard Kuo from the Roslin Institute gave this PAG 2017 talk about using the PacBio Iso-Seq data to generate genome annotations that outperform current gold-standard annotations. Included: findings from a…
We present high quality, phased genome assemblies representative of taurine and indicine cattle, subspecies that differ markedly in productivity-related traits and environmental adaptation. We report a new haplotype-aware scaffolding and polishing pipeline using contigs generated by the trio binning method to produce haplotype-resolved, chromosome-level genome assemblies of Angus (taurine) and Brahman (indicine) cattle breeds. These assemblies were used to identify structural and copy number variants that differentiate the subspecies and we found variant detection was sensitive to the specific reference genome chosen. Six gene families with immune related functions are expanded in the indicine lineage. Assembly of the genomes of both subspecies from a single individual enabled transcripts to be phased to detect allele-specific expression, and to study genome-wide selective sweeps. An indicus-specific extra copy of fatty acid desaturase is under positive selection and may contribute to indicine adaptation to heat and drought.
Background Assemblies of diploid genomes are generally unphased, pseudo-haploid representations that do not correctly reconstruct the two parental haplotypes present in the individual sequenced. Instead, the assembly alternates between parental haplotypes and may contain duplications in regions where the parental haplotypes are sufficiently different. Trio binning is an approach to genome assembly that uses short reads from both parents to classify long reads from the offspring according to maternal or paternal haplotype origin, and is thus helped rather than impeded by heterozygosity. Using this approach, it is possible to derive two assemblies from an individual, accurately representing both parental contributions in their entirety with higher continuity and accuracy than is possible with other methods.Results We used trio binning to assemble reference genomes for two species from a single individual using an interspecies cross of yak (Bos grunniens) and cattle (Bos taurus). The high heterozygosity inherent to interspecies hybrids allowed us to confidently assign >99% of long reads from the F1 offspring to parental bins using unique k-mers from parental short reads. Both the maternal (yak) and paternal (cattle) assemblies contain over one third of the acrocentric chromosomes, including the two largest chromosomes, in single haplotigs.Conclusions These haplotigs are the first vertebrate chromosome arms to be assembled gap-free and fully phased, and the first time assemblies for two species have been created from a single individual. Both assemblies are the most continuous currently available for non-model vertebrates.MbmegabaseskbkilobasesMYAmillions of years agoMHCmajor histocompatibility complexSMRTsingle molecule real time
The domestic pig (Sus scrofa) is important both as a food source and as a biomedical model with high anatomical and immunological similarity to humans. The draft reference genome (Sscrofa10.2) represented a purebred female pig from a commercial pork production breed (Duroc), and was established using older clone-based sequencing methods. The Sscrofa10.2 assembly was incomplete and unresolved redundancies, short range order and orientation errors and associated misassembled genes limited its utility. We present two highly contiguous chromosome-level genome assemblies created with more recent long read technologies and a whole genome shotgun strategy, one for the same Duroc female (Sscrofa11.1) and one for an outbred, composite breed male animal commonly used for commercial pork production (USMARCv1.0). Both assemblies are of substantially higher (>90-fold) continuity and accuracy compared to the earlier reference, and the availability of two independent assemblies provided an opportunity to identify large-scale variants and to error-check the accuracy of representation of the genome. We propose that the improved Duroc breed assembly (Sscrofa11.1) become the reference genome for genomic research in pigs.
The pig is a well-studied model animal of biomedical and agricultural importance. Genes of this species, Sus scrofa, are known from experiments and predictions, and collected at the NCBI reference sequence database section. Gene reconstruction from transcribed gene evidence of RNA-seq now can accurately and completely reproduce the biological gene sets of animals and plants. Such a gene set for the pig is reported here, including human orthologs missing from current NCBI and Ensembl reference pig gene sets, additional alternate transcripts, and other improvements. Methodology for accurate and complete gene set reconstruction from RNA is used: the automated SRA2Genes pipeline of EvidentialGene project.
The antibody repertoire of Bos taurus is characterized by a subset of variable heavy (VH) chain regions with ultralong third complementarity determining regions (CDR3) which, compared to other species, can provide a potent response to challenging antigens like HIV env. These unusual CDR3 can range to over seventy highly diverse amino acids in length and form unique ß-ribbon ‘stalk’ and disulfide bonded ‘knob’ structures, far from the typical antigen binding site. The genetic components and processes for forming these unusual cattle antibody VH CDR3 are not well understood. Here we analyze sequences of Bos taurus antibody VH domains and find that the subset with ultralong CDR3 exclusively uses a single variable gene, IGHV1-7 (VHBUL) rearranged to the longest diversity gene, IGHD8-2. An eight nucleotide duplication at the 3′ end of IGHV1-7 encodes a longer V-region producing an extended F ß-strand that contributes to the stalk in a rearranged CDR3. A low amino acid variability was observed in CDR1 and CDR2, suggesting that antigen binding for this subset most likely only depends on the CDR3. Importantly a novel, potentially AID mediated, deletional diversification mechanism of the B. taurus VH ultralong CDR3 knob was discovered, in which interior codons of the IGHD8-2 region are removed while maintaining integral structural components of the knob and descending strand of the stalk in place. These deletions serve to further diversify cysteine positions, and thus disulfide bonded loops. Hence, both germline and somatic genetic factors and processes appear to be involved in diversification of this structurally unusual cattle VH ultralong CDR3 repertoire.
Short communication: Identification of the pseudoautosomal region in the Hereford bovine reference genome assembly ARS-UCD1.2.
In cattle, the X chromosome accounts for approximately 3 and 6% of the genome in bulls and cows, respectively. In spite of the large size of this chromosome, very few studies report analysis of the X chromosome in genome-wide association studies and genomic selection. This lack of genetic interrogation is likely due to the complexities of undertaking these studies given the hemizygous state of some, but not all, of the X chromosome in males. The first step in facilitating analysis of this gene-rich chromosome is to accurately identify coordinates for the pseudoautosomal boundary (PAB) to split the chromosome into a region that may be treated as autosomal sequence (pseudoautosomal region) and a region that requires more complex statistical models. With the recent release of ARS-UCD1.2, a more complete and accurate assembly of the cattle genome than was previously available, it is timely to fine map the PAB for the first time. Here we report the use of SNP chip genotypes, short-read sequences, and long-read sequences to fine map the PAB (X chromosome:133,300,518) and simultaneously determine the neighboring regions of reduced homology and true pseudoautosomal region. These results greatly facilitate the inclusion of the X chromosome in genome-wide association studies, genomic selection, and other genetic analysis undertaken on this reference genome.The Authors. Published by FASS Inc. and Elsevier Inc. on behalf of the American Dairy Science Association®. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
The world demand for animal-based food products is anticipated to increase by 70% by 2050. Meeting this demand in a way that has a minimal impact on the environment will require the implementation of advanced technologies, and methods to improve the genetic quality of livestock are expected to play a large part. Over the past 10 years, genomic selection has been introduced in several major livestock species and has more than doubled genetic progress in some. However, additional improvements are required. Genomic information of increasing complexity (including genomic, epigenomic, transcriptomic and microbiome data), combined with technological advances for its cost-effective collection and use, will make a major contribution.
A Newly Isolated Bacillus subtilis Strain Named WS-1 Inhibited Diarrhea and Death Caused by Pathogenic Escherichia coli in Newborn Piglets.
Bacillus subtilis is recognized as a safe and reliable human and animal probiotic and is associated with bioactivities such as production of vitamin and immune stimulation. Additionally, it has great potential to be used as an alternative to antimicrobial drugs, which is significant in the context of antibiotic abuse in food animal production. In this study, we isolated one strain of B. subtilis, named WS-1, from apparently healthy pigs growing with sick cohorts on one Escherichia coli endemic commercial pig farm in Guangdong, China. WS-1 can strongly inhibit the growth of pathogenic E. coli in vitro. The B. subtilis strain WS-1 showed typical Bacillus characteristics by endospore staining, biochemical test, enzyme activity analysis, and 16S rRNA sequence analysis. Genomic analysis showed that the B. subtilis strain WS-1 shares 100% genomic synteny with B. subtilis with a size of 4,088,167 bp. Importantly, inoculation of newborn piglets with 1.5 × 1010 CFU of B. subtilis strain WS-1 by oral feeding was able to clearly inhibit diarrhea (p < 0.05) and death (p < 0.05) caused by pathogenic E. coli in piglets. Furthermore, histopathological results showed that the WS-1 strain could protect small intestine from lesions caused by E. coli infection. Collectively, these findings suggest that the probiotic B. subtilis strain WS-1 acts as a potential biocontrol agent protecting pigs from pathogenic E. coli infection. Importance: In this work, one B. subtilis strain (WS-1) was successfully isolated from apparently healthy pigs growing with sick cohorts on one E. coli endemic commercial pig farm in Guangdong, China. The B. subtilis strain WS-1 was identified to inhibit the growth of pathogenic E. coli both in vitro and in vivo, indicating its potential application in protecting newborn piglets from diarrhea caused by E. coli infections. The isolation and characterization will help better understand this bacterium, and the strain WS-1 can be further explored as an alternative to antimicrobial drugs to protect human and animal health.
Whole genome sequence and de novo assembly revealed genomic architecture of Indian Mithun (Bos frontalis).
Mithun (Bos frontalis), also called gayal, is an endangered bovine species, under the tribe bovini with 2n?=?58 XX chromosome complements and reared under the tropical rain forests region of India, China, Myanmar, Bhutan and Bangladesh. However, the origin of this species is still disputed and information on its genomic architecture is scanty so far. We trust that availability of its whole genome sequence data and assembly will greatly solve this problem and help to generate many information including phylogenetic status of mithun. Recently, the first genome assembly of gayal, mithun of Chinese origin, was published. However, an improved reference genome assembly would still benefit in understanding genetic variation in mithun populations reared under diverse geographical locations and for building a superior consensus assembly. We, therefore, performed deep sequencing of the genome of an adult female mithun from India, assembled and annotated its genome and performed extensive bioinformatic analyses to produce a superior de novo genome assembly of mithun.We generated ˜300 Gigabyte (Gb) raw reads from whole-genome deep sequencing platforms and assembled the sequence data using a hybrid assembly strategy to create a high quality de novo assembly of mithun with 96% recovered as per BUSCO analysis. The final genome assembly has a total length of 3.0 Gb, contains 5,015 scaffolds with an N50 value of 1?Mb. Repeat sequences constitute around 43.66% of the assembly. The genomic alignments between mithun to cattle showed that their genomes, as expected, are highly conserved. Gene annotation identified 28,044 protein-coding genes presented in mithun genome. The gene orthologous groups of mithun showed a high degree of similarity in comparison with other species, while fewer mithun specific coding sequences were found compared to those in cattle.Here we presented the first de novo draft genome assembly of Indian mithun having better coverage, less fragmented, better annotated, and constitutes a reasonably complete assembly compared to the previously published gayal genome. This comprehensive assembly unravelled the genomic architecture of mithun to a great extent and will provide a reference genome assembly to research community to elucidate the evolutionary history of mithun across its distinct geographical locations.
Our understanding of the pig transcriptome is limited. RNA transcript diversity among nine tissues was assessed using poly(A) selected single-molecule long-read isoform sequencing (Iso-seq) and Illumina RNA sequencing (RNA-seq) from a single White cross-bred pig. Across tissues, a total of 67,746 unique transcripts were observed, including 60.5% predicted protein-coding, 36.2% long non-coding RNA and 3.3% nonsense-mediated decay transcripts. On average, 90% of the splice junctions were supported by RNA-seq within tissue. A large proportion (80%) represented novel transcripts, mostly produced by known protein-coding genes (70%), while 17% corresponded to novel genes. On average, four transcripts per known gene (tpg) were identified; an increase over current EBI (1.9 tpg) and NCBI (2.9 tpg) annotations and closer to the number reported in human genome (4.2 tpg). Our new pig genome annotation extended more than 6000 known gene borders (5′ end extension, 3′ end extension, or both) compared to EBI or NCBI annotations. We validated a large proportion of these extensions by independent pig poly(A) selected 3′-RNA-seq data, or human FANTOM5 Cap Analysis of Gene Expression data. Further, we detected 10,465 novel genes (81% non-coding) not reported in current pig genome annotations. More than 80% of these novel genes had transcripts detected in >?1 tissue. In addition, more than 80% of novel intergenic genes with at least one transcript detected in liver tissue had H3K4me3 or H3K36me3 peaks mapping to their promoter and gene body, respectively, in independent liver chromatin immunoprecipitation data. These validated results show significant improvement over current pig genome annotations.