June 1, 2021  |  

Resources for advanced bioinformaticians working in plant and animal genomes with SMRT Sequencing.

Significant advances in bioinformatics tool development have been made to more efficiently leverage and deliver high-quality genome assemblies with PacBio long-read data. Current data throughput of SMRT Sequencing delivers average read lengths ranging from 10-15 kb with the longest reads exceeding 40 kb. This has resulted in consistent demonstration of a minimum 10-fold improvement in genome assemblies with contig N50 in the megabase range compared to assemblies generated using only short- read technologies. This poster highlights recent advances and resources available for advanced bioinformaticians and developers interested in the current state-of-the-art large genome solutions available as open-source code from PacBio and third-party solutions, including HGAP, MHAP, and ECTools. Resources and tools available on GitHub are reviewed, as well as datasets representing major model research organisms made publically available for community evaluation or interested developers.


June 1, 2021  |  

Impact of DNA quality on PacBio RS II read lengths.

Maximizing the read length of next generation sequencing (NGS) facilitates de novo genome assembly. Currently, the PacBio RS II system leads the industry with respect to maximum possible NGS read lengths. Amplicon Express specializes in preparation of high molecular weight, NGS-grade genomic DNA for a variety of applications, including next generation sequencing. This study was performed to evaluate the effects of gDNA quality on PacBio RS II read length.


June 1, 2021  |  

Single Molecule, Real-Time sequencing of full-length cDNA transcripts uncovers novel alternatively spliced isoforms.

In higher eukaryotic organisms, the majority of multi-exon genes are alternatively spliced. Different mRNA isoforms from the same gene can produce proteins that have distinct properties such as structure, function, or subcellular localization. Thus, the importance of understanding the full complement of transcript isoforms with potential phenotypic impact cannot be underscored. While microarrays and other NGS-based methods have become useful for studying transcriptomes, these technologies yield short, fragmented transcripts that remain a challenge for accurate, complete reconstruction of splice variants. The Iso-Seq protocol developed at PacBio offers the only solution for direct sequencing of full-length, single-molecule cDNA sequences to survey transcriptome isoform diversity useful for gene discovery and annotation. Knowledge of the complete isoform repertoire is also key for accurate quantification of isoform abundance. As most transcripts range from 1 – 10 kb, fully intact RNA molecules can be sequenced using SMRT Sequencing (avg. read length: 10-15 kb) without requiring fragmentation or post-sequencing assembly. Our open-source computational pipeline delivers high-quality, non-redundant sequences for unambiguous identification of alternative splicing events, alternative transcriptional start sites, polyA tail, and gene fusion events. The standard Iso-Seq protocol workflow available for all researchers is presented using a deep dataset of full- length cDNA sequences from the MCF-7 cancer cell line, and multiple tissues (brain, heart, and liver). Detected novel transcripts approaching 10 kb and alternative splicing events are highlighted. Even in extensively profiled samples, the method uncovered large numbers of novel alternatively spliced isoforms and previously unannotated genes.


June 1, 2021  |  

Targeted SMRT Sequencing and phasing using Roche NimbleGen’s SeqCap EZ enrichment

As a cost-effective alternative to whole genome human sequencing, targeted sequencing of specific regions, such as exomes or panels of relevant genes, has become increasingly common. These methods typically include direct PCR amplification of the genomic DNA of interest, or the capture of these targets via probe-based hybridization. Commonly, these approaches are designed to amplify or capture exonic regions and thereby result in amplicons or fragments that are a few hundred base pairs in length, a length that is well-addressed with short-read sequencing technologies. These approaches typically provide very good coverage and can identify SNPs in the targeted region, but are unable to haplotype these variants. Here we describe a targeted sequencing workflow that combines Roche NimbleGen’s SeqCap EZ enrichment technology with Pacific Biosciences’ SMRT Sequencing to provide a more comprehensive view of variants and haplotype information over multi-kilobase regions. While the SeqCap EZ technology is typically used to capture 200 bp fragments, we demonstrate that 6 kb fragments can also be utilized to enrich for long fragments that extend beyond the targeted capture site and well into (and often across) the flanking intronic regions. When combined with the long reads of SMRT Sequencing, multi-kilobase regions of the human genome can be phased and variants detected in exons, introns and intergenic regions.


June 1, 2021  |  

The resurgence of reference quality genome sequence.

Since the advent of Next-Generation Sequencing (NGS), the cost of de novo genome sequencing and assembly have dropped precipitately, which has spurred interest in genome sequencing overall. Unfortunately the contiguity of the NGS assembled sequences, as well as the accuracy of these assemblies have suffered. Additionally, most NGS de novo assemblies leave large portions of genomes unresolved, and repetitive regions are often collapsed. When compared to the reference quality genome sequences produced before the NGS era, the new sequences are highly fragmented and often prove to be difficult to properly annotate. In some cases the contiguous portions are smaller than the average gene size making the sequence not nearly as useful for biologists as the earlier reference quality genomes including of Human, Mouse, C. elegans, or Drosophila. Recently, new 3rd generation sequencing technologies, long-range molecular techniques, and new informatics tools have facilitated a return to high quality assembly. We will discuss the capabilities of the technologies and assess their impact on assembly projects across the tree of life from small microbial and fungal genomes through large plant and animal genomes. Beyond improvements to contiguity, we will focus on the additional biological insights that can be made with better assemblies, including more complete analysis genes in their flanking regulatory context, in-depth studies of transposable elements and other complex gene families, and long-range synteny analysis of entire chromosomes. We will also discuss the need for new algorithms for representing and analyzing collections of many complete genomes at once.


June 1, 2021  |  

Profiling metagenomic communities using circular consensus and Single Molecule, Real-Time Sequencing

There are many sequencing-based approaches to understanding complex metagenomic communities, spanning targeted amplification to whole-sample shotgun sequencing. While targeted approaches provide valuable data at low sequencing depth, they are limited by primer design and PCR amplification. Whole-sample shotgun experiments require a high depth of coverage. As such, rare community members may not be represented in the resulting assembly. Circular-consensus, Single Molecule, Real-Time (SMRT) Sequencing reads in the 1-2 kb range, with >99% consensus accuracy, can be efficiently generated for low amounts of input DNA, e.g. as little as 10 ng of input DNA sequenced in 4 SMRT Cells can generate >100,000 such reads. While throughput is low compared to second-generation sequencing, the reads are a true random sampling of the underlying community. Long read lengths translate to a high number of the reads harboring full genes or even full operons for downstream analysis. Here we present the results of circular-consensus sequencing on a mock metagenomic community with an abundance range of multiple orders of magnitude, and compare the results with both 16S and shotgun assembly methods. We show that even with relatively low sequencing depth, the long-read, assembly-free, random sampling allows to elucidate meaningful information from the very low-abundance community members. For example, given the above low-input sequencing approach, a community member at 1/1,000 relative abundance would generate 100 1-2 kb sequence fragments having 99% consensus accuracy, with a high probability of containing a gene fragment useful for taxonomic classification or functional insight.


June 1, 2021  |  

SMRT Sequencing of the alala genome

Single Molecule Real-Time (SMRT) Sequencing was used to generate long reads for whole genome shotgun sequencing of the genome of the`alala (Hawaiian crow). The ‘alala is endemic to Hawaii, and the only surviving lineage of the crow family, Corvidae, in the Hawaiian Islands. The population declined to less than 20 individuals in the 1990s, and today this charismatic species is extinct in the wild. Currently existing in only two captive breeding facilities, reintroduction of the ‘alala is scheduled to begin in the Fall of 2016. Reintroduction efforts will be assisted by information from the ‘alala genome generated and assembled by SMRT Technology, which will allow detailed analysis of genes associated with immunity, behavior, and learning. Using SMRT Sequencing, we present here best practices for achieving long reads for whole genome shotgun sequencing for complex plant and animal genomes such as the ‘alala genome. With recent advances in SMRTbell library preparation, P6-C4 chemistry and 6-hour movies, the number of useable bases now exceeds 1 Gb per SMRT Cell. Read lengths averaging 10 – 15 kb can be routinely achieved, with the longest reads approaching 70 kb. Furthermore, > 25% of useable bases are in reads greater than 30 kb, advantageous for generating contiguous draft assemblies of contig N50 up to 5 Mb. De novo assemblies of large genomes are now more tractable using SMRT Sequencing as the standalone technology. We also present guidelines for planning out projects for the de novo assembly of large genomes.


June 1, 2021  |  

Cogent: Reconstructing the coding genome from full-length transcriptome sequences

For highly complex and large genomes, a well-annotated genome may be computationally challenging and costly, yet the study of alternative splicing events and gene annotations usually rely on the existence of a genome. Long-read sequencing technology provides new opportunities to sequence full-length cDNAs, avoiding computational challenges that short read transcript assembly brings. The use of single molecule, real-time sequencing from Pacific Biosciences to sequence transcriptomes (the Iso-SeqTM method), which produces de novo, high-quality, full-length transcripts, has revealed an astonishing amount of alternative splicing in eukaryotic species. With the Iso-Seq method, it is now possible to reconstruct the transcribed regions of the genome using just the transcripts themselves. We present Cogent, a tool for finding gene families and reconstructing the coding genome in the absence of a reference genome. Cogent uses k-mer similarities to first partition the transcripts into different gene families. Then, for each gene family, the transcripts are used to build a splice graph. Cogent identifies bubbles resulting from sequencing errors, minor variants, and exon skipping events, and attempts to resolve each splice graph down to the minimal set of reconstructed contigs. We apply Cogent to a Cuttlefish Iso-Seq dataset, for which there is a highly fragmented, Illumina-based draft genome assembly and little annotation. We show that Cogent successfully discovers gene families and can reconstruct the coding region of gene loci. The reconstructed contigs can then be used to visualize alternative splicing events, identify minor variants, and even be used to improve genome assemblies.


June 1, 2021  |  

Full-length cDNA sequencing for genome annotation and analysis of alternative splicing

In higher eukaryotic organisms, the majority of multi-exon genes are alternatively spliced. Different mRNA isoforms from the same gene can produce proteins that have distinct properties and functions. Thus, the importance of understanding the full complement of transcript isoforms with potential phenotypic impact cannot be understated. While microarrays and other NGS-based methods have become useful for studying transcriptomes, these technologies yield short, fragmented transcripts that remain a challenge for accurate, complete reconstruction of splice variants. The Iso-Seq protocol developed at PacBio offers the only solution for direct sequencing of full-length, single-molecule cDNA sequences to survey transcriptome isoform diversity useful for gene discovery and annotation. Knowledge of the complete isoform repertoire is also key for accurate quantification of isoform abundance. As most transcripts range from 1 – 10 kb, fully intact RNA molecules can be sequenced using SMRT Sequencing without requiring fragmentation or post-sequencing assembly. Our open-source computational pipeline delivers high-quality, non-redundant sequences for unambiguous identification of alternative splicing events, alternative transcriptional start sites, polyA tail, and gene fusion events. We applied the Iso-Seq method to the maize (Zea mays) inbred line B73. Full-length cDNAs from six diverse tissues were barcoded and sequenced across multiple size-fractionated SMRTbell libraries. A total of 111,151 unique transcripts were identified. More than half of these transcripts (57%) represented novel, sometimes tissue-specific, isoforms of known genes. In addition to the 2250 novel coding genes and 860 lncRNAs discovered, the Iso-Seq dataset corrected errors in existing gene models, highlighting the value of full-length transcripts for whole gene annotations.


June 1, 2021  |  

Targeted sequencing of genes from soybean using NimbleGen SeqCap EZ and PacBio SMRT Sequencing

Full-length gene capture solutions offer opportunities to screen and characterize structural variations and genetic diversity to understand key traits in plants and animals. Through a combined Roche NimbleGen probe capture and SMRT Sequencing strategy, we demonstrate the capability to resolve complex gene structures often observed in plant defense and developmental genes spanning multiple kilobases. The custom panel includes members of the WRKY plant-defense-signaling family, members of the NB-LRR disease-resistance family, and developmental genes important for flowering. The presence of repetitive structures and low-complexity regions makes short-read sequencing of these genes difficult, yet this approach allows researchers to obtain complete sequences for unambiguous resolution of gene models. This strategy has been applied to genomic DNA samples from soybean coupled with barcoding for multiplexing.


June 1, 2021  |  

Long-read assembly of the Aedes aegypti Aag2 cell line genome resolves ancient endogenous viral elements

Transmission of arboviruses such as Dengue Virus by Aedes aegypti causes debilitating disease across the globe. Disease in humans can include severe acute symptoms such as hemorrhagic fever and organ failure, but mosquitoes tolerate high titers of virus in a persistent infection. The mechanisms responsible for this viral tolerance are unclear. Recent publications highlighted the integration of genetic material from non-retroviral RNA viruses into the genome of the host during infection that relies upon endogenous retro-transcriptase activity from transposons. These endogenous viral elements (EVEs) found in the genome are predicted to be ancient, and at least some EVEs are under purifying selection, suggesting they are beneficial to the host. To characterize EVE biogenesis in a tractable system, we sequenced the Ae. aegypti cell line, Aag2, to 58-fold coverage and present a de novo assembly of the genome. The assembly contains 1.7 Gb of genomic and 255 Mb of alternative haplotype specific sequence, consisting of contigs with a N50 of 1.4 Mb; a value that, when compared with other assemblies of the Aedes genus, is from 1-3 orders of magnitude longer. The Aag2 genome is highly repetitive (70%), most of which is classified as transposable elements (60%). We identify EVEs in the genome homologous to a range of extant viruses, many of which cluster in these regions of repetitive DNA. The contiguous assembly allows for more comprehensive identification of the transposable elements and EVEs that are most likely to be lost in assemblies lacking the read length of SMRT Sequencing.


June 1, 2021  |  

An update on goat genomics

Goats are specialized in dairy, meat and fiber production, being adapted to a wide range of environmental conditions and having a large economic impact in developing countries. In the last years, there have been dramatic advances in the knowledge of the structure and diversity of the goat genome/transcriptome and in the development of genomic tools, rapidly narrowing the gap between goat and related species such as cattle and sheep. Major advances are: 1) publication of a de novo goat genome reference sequence; 2) Development of whole genome high density RH maps, and; 3) Design of a commercial 50K SNP array. Moreover, there are currently several projects aiming at improving current genomic tools and resources. An improved assembly of the goat genome using PacBio reads is being produced, and the design of new SNP arrays is being studied to accommodate the specific needs of this species in the context of very large scale genotyping projects (i.e. breed characterization at an international scale and genomic selection) and parentage analysis. As in other species, the focus has now turned to the identification of causative mutations underlying the phenotypic variation of traits. In addition, since 2014, the ADAPTmap project (www.goatadaptmap.org) has gathered data to explore the diversity of caprine populations at a worldwide scale by using a wide variety of approaches and data.


June 1, 2021  |  

Improving the goat long-read assembly with optical mapping and Hi-C scaffolding

Reference genome assemblies provide important context in genetics by standardizing the order of genes and providing a universal set of coordinates for individual nucleotides. Often due to the high complexity of genic regions and higher copy number of genes involved in immune function, immunity-related genes are often misassembled in current reference assemblies. This problem is particularly ubiquitous in the reference genomes of non-model organisms as they often do not receive the years of curation necessary to resolve annotation and assembly errors. In this study, we reassemble a reference genome of the goat (Capra hircus) using modern PacBio technology in tandem with BioNano Genomics Irys optical maps and Lachesis clustering in order to provide a high quality reference assembly without the need for extensive filtering. Initial PacBio assemblies using P5C4 chemistry achieved contig N50’s of 4 Megabases and a BUSCO completion score of 84.0%, which is comparable to several finished model organism reference assemblies. We used BioNano Genomics’ Irys platform to generate 336 scaffolds from this data with a scaffold N50 of 24 megabases and total genome coverage of 98%. Lachesis interaction maps were used with a clustering algorithm to associate Irys scaffolds into the expected 30 chromosome physical maps. Comparisons of the initial hybrid scaffolds generated from the long read contigs and optical map information to a previously generated RH map revealed that the entirety of the Goat autosome 20 physical map was contained within one scaffold. Additionally, the BioNano scaffolding resolved several difficult regions that contained genes related to innate immunity which were problem regions in previous reference genome assemblies.


June 1, 2021  |  

A comprehensive study of the sugar pine (Pinus lambertiana) transcriptome implemented through diverse next-generation sequencing approaches

The assembly, annotation, and characterization of the sugar pine (Pinus lambertiana Dougl.) transcriptome represents an opportunity to study the genetic mechanisms underlying resistance to the invasive white pine blister rust (Cronartium ribicola) as well as responses to other abiotic stresses. The assembled transcripts also provide a resource to improve the genome assembly. We selected a diverse set of tissues allowing the first comprehensive evaluation of the sugar pine gene space. We have combined short read sequencing technologies (Illumina MiSeq and HiSeq) with the relatively new Pacific Biosciences Iso-Seq approach. From the 2.5 billion and 1.6 million Illumina and PacBio (46 SMRT cells) reads, 33,720 unigenes were de novo assembled. Comparison of sequencing technologies revealed improved coverage with Illumina HiSeq reads and better splice variant detection with PacBio Iso-Seq reads. The genes identified as unique to each library ranges from 199 transcripts (basket seedling) to 3,482 transcripts (female cones). In total, 10,026 transcripts were shared by all libraries. Genes differentially expressed in response to these provided insight on abiotic and biotic stress responses. To analyze orthologous sequences, we compared the translated sequences against 19 plant species, identifying 7,229 transcripts that clustered uniquely among the conifers. We have generated here a high quality transcriptome from one WPBR susceptible and one WPBR resistant sugar pine individual. Through the comprehensive tissue sampling and the depth of the sequencing achieved, detailed information on disease resistance can be further examined.


June 1, 2021  |  

From Sequencing to Chromosomes: New de novo assembly and scaffolding methods improve the goat reference genome

Single-molecule sequencing is now routinely used to assemble complete, high-quality microbial genomes, but these assembly methods have not scaled well to large genomes. To address this problem, we previously introduced the MinHash Alignment Process (MHAP) for overlapping single-molecule reads using probabilistic, locality-sensitive hashing. Integrating MHAP with Celera Assembler (CA) has enabled reference-grade assemblies of model organisms, revealing novel heterochromatic sequences and filling low-complexity gap sequences in the GRCh38 human reference genome. We have applied our methods to assemble the San Clemente goat genome. Combining single-molecule sequencing from Pacific Biosciences and BioNano Genomics generates and assembly that is over 150-fold more contiguous than the latest Capra hircus reference. In combination with Hi-C sequencing, the assembly surpasses reference assemblies, de novo, with minimal manual intervention. The autosomes are each assembled into a single scaffold. Our assembly provides a more complete gene reconstruction, better alignments with Goat 52k chip, and improved allosome reconstruction. In addition to providing increased continuity of sequence, our assembly achieves a higher BUSCO completion score (84%) than the existing goat reference assembly suggesting better quality annotation of gene models. Our results demonstrate that single-molecule sequencing can produce near-complete eukaryotic genomes at modest cost and minimal manual effort.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.