Generating de novo reference genome assemblies for non-model organisms is a laborious task that often requires a large amount of data from several sequencing platforms and cytogenetic surveys. By using PacBio sequence data and new library creation techniques, we present a de novo, high quality reference assembly for the goat (Capra hircus) that demonstrates a primarily sequencing-based approach to efficiently create new reference assemblies for Eukaryotic species. This goat reference genome was created using 38 million PacBio P5-C3 reads generated from a San Clemente goat using the Celera Assembler PBcR pipeline with PacBio read self-correction. In order to generate the assembly, corrected and filtered reads were pre-assembled into a consensus model using PBDAGCON, and subsequently assembled using the Celera Assembly version 8.2. We generated 5,902 contigs using this method with a contig N50 size of 2.56 megabases. In order to generate chromosome-sized scaffolds, we used the LACHESIS scaffolding method to identify cis-chromosome Hi-C interactions in order to link contigs together. We then compared our new assembly to the existing goat reference assembly to identify large-scale discrepancies. In our comparison, we identified 247 disagreements between the two assemblies consisting of 123 inversions and 124 chromosome-contig relocations. The high quality of this data illustrates how this methodology can be used to efficiently generate new reference genome assemblies without the use of expensive fluorescent cytometry or large quantities of data from multiple sequencing platforms.
Reference genome assemblies provide important context in genetics by standardizing the order of genes and providing a universal set of coordinates for individual nucleotides. Often due to the high complexity of genic regions and higher copy number of genes involved in immune function, immunity-related genes are often misassembled in current reference assemblies. This problem is particularly ubiquitous in the reference genomes of non-model organisms as they often do not receive the years of curation necessary to resolve annotation and assembly errors. In this study, we reassemble a reference genome of the goat (Capra hircus) using modern PacBio technology in tandem with BioNano Genomics Irys optical maps and Lachesis clustering in order to provide a high quality reference assembly without the need for extensive filtering. Initial PacBio assemblies using P5C4 chemistry achieved contig N50’s of 4 Megabases and a BUSCO completion score of 84.0%, which is comparable to several finished model organism reference assemblies. We used BioNano Genomics’ Irys platform to generate 336 scaffolds from this data with a scaffold N50 of 24 megabases and total genome coverage of 98%. Lachesis interaction maps were used with a clustering algorithm to associate Irys scaffolds into the expected 30 chromosome physical maps. Comparisons of the initial hybrid scaffolds generated from the long read contigs and optical map information to a previously generated RH map revealed that the entirety of the Goat autosome 20 physical map was contained within one scaffold. Additionally, the BioNano scaffolding resolved several difficult regions that contained genes related to innate immunity which were problem regions in previous reference genome assemblies.
A high-quality genome assembly of SMRT Sequences reveals long-range haplotype structure in the diploid mosquito Aedes aegypti
Aedes aegypti is a tropical and subtropical mosquito vector for Zika, yellow fever, dengue fever, chikungunya, and other diseases. The outbreak of Zika in the Americas, which can cause microcephaly in the fetus of infected women, adds urgency to the need for a high-quality reference genome in order to better understand the organism’s biology and its role in transmitting human disease. We describe the first diploid assembly of an insect genome, using SMRT sequencing and the open-source assembler FALCON-Unzip. This assembly has high contiguity (contig N50 1.3 Mb), is more complete than previous assemblies (Length 1.45 Gb with 87% BUSCO genes complete), and is high quality (mean base >QV30). Long-range haplotype structure, in some cases encompassing more than 4 Mb of extremely divergent homologous sequence, is resolved using a combination of the FALCON-Unzip assembler, genome annotation, coverage depth, and pairwise nucleotide alignments.
A high-quality genome assembly of SMRT sequences reveals long range haplotype structure in the diploid mosquito Aedes aegypti
Aedes aegypti is a tropical and subtropical mosquito vector for Zika, yellow fever, dengue fever, and chikungunya. We describe the first diploid assembly of an insect genome, using SMRT Sequencing and the open-source assembler FALCON-Unzip. This assembly has high contiguity (contig N50 1.3 Mb), is more complete than previous assemblies (Length 1.45 Gb with 87% BUSCO genes complete), and is high quality (mean base >QV30 after polishing). Long-range haplotype structure, in some cases encompassing more than 4 Mb of extremely divergent homologous sequence with dramatic differences in coding sequence content, is resolved using a combination of the FALCON-Unzip assembler, genome annotation, coverage depth, and pairwise nucleotide alignments.
Animals in the phylum Hemichordata have provided key understanding of the origins and development of body patterning and nervous system organization. However, efforts to sequence and assemble the genomes of highly heterozygous non-model organisms have proven to be difficult with traditional short read approaches. Long repetitive DNA structures, extensive structural variation between haplotypes in polyploid species, and large genome sizes are limiting factors to achieving highly contiguous genome assemblies. Here we present the highly contiguous de novo assembly and preliminary annotation of an indirect developing hemichordate genome, Schizocardium californicum, using SMRT Sequening long reads.
A high-quality reference genome is an essential resource for primary and applied research across the tree of life. Genome projects for small-bodied, non-model organisms such as insects face several unique challenges including limited DNA input quantities, high heterozygosity, and difficulty of culturing or inbreeding in the lab. Recent progress in PacBio library preparation protocols, sequencing throughput, and read accuracy address these challenges. We present several case studies including the Red Admiral (Vanessa atalanta), Monarch Butterfly (Danaus plexippus), and Anopheles malaria mosquitoes that highlight the benefits of sequencing single individuals for de novo genome assembly projects, and the ease at which these projects can be conducted by individual research labs. Sampled individuals may originate from lab colonies of interest to the research community or be sourced from the wild to better capture natural variation in a focal population. Where genomic DNA quantities are limited, the PacBio Low DNA Input Protocol requires ~100 ng of input DNA. Low DNA input samples with 500 Mb genome size or less can be multiplexed on a single SMRT Cell 8M on the Sequel II System. For samples with more abundant DNA quantity, size-selected libraries may be constructed to maximize sequencing yield. Both low DNA input and size-selected libraries can be used to generate HiFi reads, whose quality is Q20 or above (1% error or less) and lengths range from 10 – 25 kb. With HiFi reads, de novo assembly computation is greatly simplified relative to long read methods due to smaller sequence file sizes and more rapid analysis, resulting in highly accurate, contiguous, complete, and haplotype-resolved assemblies.
In this webinar, Emily Hatas of PacBio shares information about the applications and benefits of SMRT Sequencing in plant and animal biology, agriculture, and industrial research fields. This session contains…
RNA-Seq de novo assembly is an important method to generate transcriptomes for non-model organisms before any downstream analysis. Given many great de novo assembly methods developed by now, one critical issue is that there is no consensus on the evaluation of de novo assembly methods yet. Therefore, to set up a benchmark for evaluating the quality of de novo assemblies is very critical. Addressing this challenge will help us deepen the insights on the properties of different de novo assemblers and their evaluation methods, and provide hints on choosing the best assembly sets as transcriptomes of non-model organisms for the further functional analysis. In this article, we generate a textquotedblleftreal timetextquotedblright transcriptome using PacBio long reads as a benchmark for evaluating five de novo assemblers and two model-based de novo assembly evaluation methods. By comparing the de novo assmblies generated by RNA-Seq short reads with the textquotedblleftreal timetextquotedblright transcriptome from the same biological sample, we find that Trinity is best at the completeness by generating more assemblies than the alternative assemblers, but less continuous and having more misassemblies; Oases is best at the continuity and specificity, but less complete; The performance of SOAPdenovo-Trans, Trans-AByss and IDBA-Tran are in between of five assemblers. For evaluation methods, DETONATE leverages multiple aspects of the assembly set and ranks the assembly set with an average performance as the best, meanwhile the contig score can serve as a good metric to select assemblies with high completeness, specificity, continuity but not sensitive to misassemblies; TransRate contig score is useful for removing misassemblies, yet often the assemblies in the optimal set is too few to be used as a transcriptome.
The emergence of third generation sequencing (3GS; long-reads) is making closer the goal of chromosome-size fragments in de novo genome assemblies. This allows the exploration of new and broader questions on genome evolution for a number of non-model organisms. However, long-read technologies result in higher sequencing error rates and therefore impose an elevated cost of sufficient coverage to achieve high enough quality. In this context, hybrid assemblies, combining short-reads and long-reads provide an alternative efficient and cost-effective approach to generate de novo, chromosome-level genome assemblies. The array of available software programs for hybrid genome assembly, sequence correction and manipulation is constantly being expanded and improved. This makes it difficult for non-experts to find efficient, fast and tractable computational solutions for genome assembly, especially in the case of non-model organisms lacking a reference genome or one from a closely related species. In this study, we review and test the most recent pipelines for hybrid assemblies, comparing the model organism Drosophila melanogaster to a non-model cactophilic Drosophila, D. mojavensis. We show that it is possible to achieve excellent contiguity on this non-model organism using the DBG2OLC pipeline.
Suppressed recombination allows divergence between homologous sex chromosomes and the functionality of their genes. Here, we reveal patterns of the earliest stages of sex-chromosome evolution in the diploid dioecious herb Mercurialis annua on the basis of cytological analysis, de novo genome assembly and annotation, genetic mapping, exome resequencing of natural populations, and transcriptome analysis. The genome assembly contained 34,105 expressed genes, of which 10,076 were assigned to linkage groups. Genetic mapping and exome resequencing of individuals across the species range both identified the largest linkage group, LG1, as the sex chromosome. Although the sex chromosomes of M. annua are karyotypically homomorphic, we estimate that about a third of the Y chromosome has ceased recombining, containing 568 transcripts and spanning 22.3 cM in the corresponding female map. Nevertheless, we found limited evidence for Y-chromosome degeneration in terms of gene loss and pseudogenization, and most X- and Y-linked genes appear to have diverged in the period subsequent to speciation between M. annua and its sister species M. huetii which shares the same sex-determining region. Taken together, our results suggest that the M. annua Y chromosome has at least two evolutionary strata: a small old stratum shared with M. huetii, and a more recent larger stratum that is probably unique to M. annua and that stopped recombining about one million years ago. Patterns of gene expression within the non-recombining region are consistent with the idea that sexually antagonistic selection may have played a role in favoring suppressed recombination.Copyright © 2019, Genetics.
Divergent evolution in the genomes of closely related lacertids, Lacerta viridis and L. bilineata, and implications for speciation.
Lacerta viridis and Lacerta bilineata are sister species of European green lizards (eastern and western clades, respectively) that, until recently, were grouped together as the L. viridis complex. Genetic incompatibilities were observed between lacertid populations through crossing experiments, which led to the delineation of two separate species within the L. viridis complex. The population history of these sister species and processes driving divergence are unknown. We constructed the first high-quality de novo genome assemblies for both L. viridis and L. bilineata through Illumina and PacBio sequencing, with annotation support provided from transcriptome sequencing of several tissues. To estimate gene flow between the two species and identify factors involved in reproductive isolation, we studied their evolutionary history, identified genomic rearrangements, detected signatures of selection on non-coding RNA, and on protein-coding genes.Here we show that gene flow was primarily unidirectional from L. bilineata to L. viridis after their split at least 1.15 million years ago. We detected positive selection of the non-coding repertoire; mutations in transcription factors; accumulation of divergence through inversions; selection on genes involved in neural development, reproduction, and behavior, as well as in ultraviolet-response, possibly driven by sexual selection, whose contribution to reproductive isolation between these lacertid species needs to be further evaluated.The combination of short and long sequence reads resulted in one of the most complete lizard genome assemblies. The characterization of a diverse array of genomic features provided valuable insights into the demographic history of divergence among European green lizards, as well as key species differences, some of which are candidates that could have played a role in speciation. In addition, our study generated valuable genomic resources that can be used to address conservation-related issues in lacertids. © The Author(s) 2018. Published by Oxford University Press.
High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution.
Targeted PCR amplification and high-throughput sequencing (amplicon sequencing) of 16S rRNA gene fragments is widely used to profile microbial communities. New long-read sequencing technologies can sequence the entire 16S rRNA gene, but higher error rates have limited their attractiveness when accuracy is important. Here we present a high-throughput amplicon sequencing methodology based on PacBio circular consensus sequencing and the DADA2 sample inference method that measures the full-length 16S rRNA gene with single-nucleotide resolution and a near-zero error rate. In two artificial communities of known composition, our method recovered the full complement of full-length 16S sequence variants from expected community members without residual errors. The measured abundances of intra-genomic sequence variants were in the integral ratios expected from the genuine allelic variants within a genome. The full-length 16S gene sequences recovered by our approach allowed Escherichia coli strains to be correctly classified to the O157:H7 and K12 sub-species clades. In human fecal samples, our method showed strong technical replication and was able to recover the full complement of 16S rRNA alleles in several E. coli strains. There are likely many applications beyond microbial profiling for which high-throughput amplicon sequencing of complete genes with single-nucleotide resolution will be of use. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data.
Long reads obtained from third-generation sequencing platforms can help overcome the long-standing challenge of the de novo assembly of sequences for the genomic analysis of non-model eukaryotic organisms. Numerous long-read-aided de novo assemblies have been published recently, which exhibited superior quality of the assembled genomes in comparison with those achieved using earlier second-generation sequencing technologies. Evaluating assemblies is important in guiding the appropriate choice for specific research needs. In this study, we evaluated 10 long-read assemblers using a variety of metrics on Pacific Biosciences (PacBio) data sets from different taxonomic categories with considerable differences in genome size. The results allowed us to narrow down the list to a few assemblers that can be effectively applied to eukaryotic assembly projects. Moreover, we highlight how best to use limited genomic resources for effectively evaluating the genome assemblies of non-model organisms. © The Author 2017. Published by Oxford University Press.
Analysis of differential gene expression and alternative splicing is significantly influenced by choice of reference genome.
RNA-seq analysis has enabled the evaluation of transcriptional changes in many species including nonmodel organisms. However, in most species only a single reference genome is available and RNA-seq reads from highly divergent varieties are typically aligned to this reference. Here, we quantify the impacts of the choice of mapping genome in rice where three high-quality reference genomes are available. We aligned RNA-seq data from a popular productive rice variety to three different reference genomes and found that the identification of differentially expressed genes differed depending on which reference genome was used for mapping. Furthermore, the ability to detect differentially used transcript isoforms was profoundly affected by the choice of reference genome: Only 30% of the differentially used splicing features were detected when reads were mapped to the more commonly used, but more distantly related reference genome. This demonstrated that gene expression and splicing analysis varies considerably depending on the mapping reference genome, and that analysis of individuals that are distantly related to an available reference genome may be improved by acquisition of new genomic reference material. We observed that these differences in transcriptome analysis are, in part, due to the presence of single nucleotide polymorphisms between the sequenced individual and each respective reference genome, as well as annotation differences between the reference genomes that exist even between syntenic orthologs. We conclude that even between two closely related genomes of similar quality, using the reference genome that is most closely related to the species being sampled significantly improves transcriptome analysis. © 2019 Slabaugh et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
Microsatellite marker set for genetic diversity assessment of primitive Chitala chitala (Hamilton, 1822) derived through SMRT sequencing technology.
In present study, single molecule-real time sequencing technology was used to obtain a validated set of microsatellite markers for application in population genetics of the primitive fish, Chitala chitala. Assembly of circular consensus sequencing reads resulted into 1164 sequences which contained 2005 repetitive motifs. A total of 100 sequences were used for primer designing and amplification yielded a set of 28 validated polymorphic markers. These loci were used to genotype n?=?72 samples from three distant riverine populations of India, namely Son, Satluj and Brahmaputra, for determining intraspecific genetic variation. The microsatellite loci exhibited high level of polymorphism with PIC values ranging from 0.281 to 0.901. The genetic parameters revealed that mean heterozygosity ranged from 0.6802 to 0.6826 and the populations were found to be genetically diverse (Fst 0.03-0.06). This indicated the potential application of these microsatellite marker set that can used for stock characterization of C. chitala, in the wild. These newly developed loci were assayed for cross transferability in another notopterid fish, Notopterus notopterus.