Single-Molecule Real-Time (SMRT) DNA sequencing is unique in that nucleotide incorporation events are monitored in real time, leading to a wealth of kinetic information in addition to the extraction of the primary DNA sequence. The dynamics of the DNA polymerase that is observed adds an additional dimension of sequence-dependent information, and can be used to learn more about the molecule under study. First, the primary sequence itself can be determined more accurately. The kinetic data can be used to corroborate or overturn consensus calls and even enable calling bases in problematic sequence contexts. Second, using the kinetic information, we can detect and discriminate numerous chemical base modifications as a by-product of ordinary sequencing. Examples of applying these capabilities include (i) the characterization of the epigenome of microorganisms by directly sequencing the three common prokaryotic epigenetic base modifications of 4-methylcytosine, 5- methylcytosine and 6-methyladenine; (ii) the characterization of known and novel methyltransferase activities; (iii) the direct sequencing and differentiation of the four eukaryotic epigenetic forms of cytosine (5-methyl, 5-hydroxymethyl, 5-formyl, and 5-carboxylcytosine) with first applications to map them with single base-pair and DNA strand resolution across mammalian genomes; (iv) the direct sequencing and identification of numerous modified DNA bases arising from DNA damage; and (v) an exploration of the mitochondrial genome for known and novel base modifications. We will show our progress towards a generic, open-source algorithm for exploiting kinetic information for any of these purposes.
The assembly of metagenomes is dramatically improved by the long read lengths of SMRT Sequencing. This is demonstrated in an experimental design to sequence a mock community from the Human Microbiome Project, and assemble the data using the hierarchical genome assembly process (HGAP) at Pacific Biosciences. Results of this analysis are promising, and display much improved contiguity in the assembly of the mock community as compared to publicly available short-read data sets and assemblies. Additionally, the use of base modification information to make further associations between contigs provides additional data to improve assemblies, and to distinguish between members within a microbial community. The epigenetic approach is a novel validation method unique to SMRT Sequencing. In addition to whole-genome shotgun sequencing, SMRT Sequencing also offers improved classification resolution and reliability of metagenomic and microbiome samples by the full-length sequencing of 16S rRNA (~1500 bases long). Microbial communities can be detected at the species level in some cases, rather than being limited to the genus taxonomic classification as constrained by short-read technologies. The performance of SMRT Sequencing for these metagenomic samples achieved >99% predicted concordance to reference sequences in cecum, soil, water, and mock control investigations for bacterial 16S. Community samples are estimated to contain from 2.3 and up to 15 times as many species with abundance levels as low as 0.05% compared to the identification of phyla groups.
Capturing the chicken transcriptome with PacBio long read RNA-seq data OR “Chicken in awesome sauce: a recipe for new transcript identification.”
PacBio 2014 User Group Meeting Presentation Slides: Alisha Holloway of the Gladstone Institutes presented on the use of isoform sequencing (Iso-Seq) to improve the annotation of the chicken genome as a model reference for cardiovascular research.
Since the advent of Next-Generation Sequencing (NGS), the cost of de novo genome sequencing and assembly have dropped precipitately, which has spurred interest in genome sequencing overall. Unfortunately the contiguity of the NGS assembled sequences, as well as the accuracy of these assemblies have suffered. Additionally, most NGS de novo assemblies leave large portions of genomes unresolved, and repetitive regions are often collapsed. When compared to the reference quality genome sequences produced before the NGS era, the new sequences are highly fragmented and often prove to be difficult to properly annotate. In some cases the contiguous portions are smaller than the average gene size making the sequence not nearly as useful for biologists as the earlier reference quality genomes including of Human, Mouse, C. elegans, or Drosophila. Recently, new 3rd generation sequencing technologies, long-range molecular techniques, and new informatics tools have facilitated a return to high quality assembly. We will discuss the capabilities of the technologies and assess their impact on assembly projects across the tree of life from small microbial and fungal genomes through large plant and animal genomes. Beyond improvements to contiguity, we will focus on the additional biological insights that can be made with better assemblies, including more complete analysis genes in their flanking regulatory context, in-depth studies of transposable elements and other complex gene families, and long-range synteny analysis of entire chromosomes. We will also discuss the need for new algorithms for representing and analyzing collections of many complete genomes at once.
2015 SMRT Informatics Developers Conference Presentation Slides: Shinichi Morishita of the University of Tokyo presented on how his team has been using SMRT Sequencing to better understand methylomes, metagenomes and structural variation of various eukaryotic genomes.
Characterizing haplotype diversity at the immunoglobulin heavy chain locus across human populations using novel long-read sequencing and assembly approaches
The human immunoglobulin heavy chain locus (IGH) remains among the most understudied regions of the human genome. Recent efforts have shown that haplotype diversity within IGH is elevated and exhibits population specific patterns; for example, our re-sequencing of the locus from only a single chromosome uncovered >100 Kb of novel sequence, including descriptions of six novel alleles, and four previously unmapped genes. Historically, this complex locus architecture has hindered the characterization of IGH germline single nucleotide, copy number, and structural variants (SNVs; CNVs; SVs), and as a result, there remains little known about the role of IGH polymorphisms in inter-individual antibody repertoire variability and disease. To remedy this, we are taking a multi-faceted approach to improving existing genomic resources in the human IGH region. First, from whole-genome and fosmid-based datasets, we are building the largest and most ethnically diverse set of IGH reference assemblies to date, by employing PacBio long-read sequencing combined with novel algorithms for phased haplotype assembly. In total, our effort will result in the characterization of >15 phased haplotypes from individuals of Asian, African, and European descent, to be used as a representative reference set by the genomics and immunogenetics community. Second, we are utilizing this more comprehensive sequence catalogue to inform the design and analysis of novel targeted IGH genotyping assays. Standard targeted DNA enrichment methods (e.g., exome capture) are currently optimized for the capture of only very short (100’s of bp) DNA segments. Our platform uses a modified bench protocol to pair existing capture-array technologies with the enrichment of longer fragments of DNA, enabling the use of PacBio sequencing of DNA segments up to 7 Kb. This substantial increase in contiguity disambiguates many of the complex repeated structures inherent to the locus, while yielding the base pair fidelity required to call SNVs. Together these resources will establish a stronger framework for further characterizing IGH genetic diversity and facilitate IGH genomic profiling in the clinical and research settings, which will be key to fully understanding the role of IGH germline variation in antibody repertoire development and disease.
Over 40% of males and ~16% of female carriers of a FMR1 premutation allele (55-200 CGG repeats) are at risk for developing Fragile X-associated Tremor/Ataxia Syndrome (FXTAS), an adult onset neurodegenerative disorder while, about 20% of female carriers will develop Fragile X-associated Primary Ovarian Insufficiency (FXPOI), in addition to a number of adult-onset clinical problems (FMR1 associated disorders). Marked elevation in FMR1 mRNA levels have been observed with premutation alleles and the resulting RNA toxicity is believed to be the leading molecular mechanism proposed for these disorders. The FMR1 gene, as many housekeeping genes, undergoes alternative splicing. Using long-read isoform sequencing (SMRT) and qRT-PCR we have recently reported that, although the relative abundance of all FMR1 mRNA isoforms is significantly increased in the premutation group compared to controls, there is a disproportionate increase, relative to the overall increase in mRNA, in the abundance of isoforms spliced at both exons 12 and 14. In total, we confirmed the existence of 16 out of 24 predicted isoforms in our samples. However, it is unknown, which isoforms, when overexpressed, may contribute to the premutation pathology. To address this question we have further defined the transcriptional FMR1 isoforms distribution pattern in different tissues, including heart, muscle, brain and testis derived from FXTAS premutation carriers and age-matched controls. Preliminary data indicates the presence of a transcriptional signature of the FMR1 gene, which clusters more by individual than by tissue type. We identified additional isoforms than the 16 reported in our previous study, including a group with particular splice patterns that were observed only in premutations but not in controls. Our findings suggest that the characterization of expression levels of the different FMR1 isoforms is fundamental for understanding the regulation of the FMR1 gene as well as for elucidating the mechanism(s) by which “toxic gain of function” of the FMR1 mRNA may play a role in FXTAS and/or in the other FMR1-associated conditions. In addition to the elevated levels of FMR1 isoforms, the altered abundance/ratio of the corresponding FMRP isomers may affect the overall function of FMRP in premutations.
Library prep and bioinformatics improvements for full-length transcript sequencing on the PacBio Sequel System
The PacBio Iso-Seq method produces high-quality, full-length transcripts of up to 10 kb and longer and has been used to annotate many important plant and animal genomes. Here we describe an improved, simplified library workflow and analysis pipeline that reduces library preparation time, RNA input, and cost. The Iso-Seq V2 Express workflow is a one day protocol that requires only ~300 ng of total RNA input while also reducing the number of reverse transcription and amplification steps down to single reactions. Compared with the previous workflow, the Iso-Seq V2 Express workflow increases the percentage of full-length (FL) reads while achieving a higher average transcript length. At the same time, the Iso-Seq 3 analysis recently released in the SMRT Link 6.0 software is a major improvement over previous versions. Iso-Seq 3 is highly accurate at detecting and removing library artifacts (TSO and RT artifacts) as well as differentiating barcodes on multiplexed samples. Iso-Seq 3 achieves the same output performance in high-quality transcript sequences compared to previous versions while reducing the runtime and memory usage dramatically.
Single cell isoform sequencing (scIso-Seq) identifies novel full-length mRNAs and cell type-specific expression
Single cell RNA-seq (scRNA-seq) is an emerging field for characterizing cell heterogeneity in complex tissues. However, most scRNA-seq methodologies are limited to gene count information due to short read lengths. Here, we combine the microfluidics scRNA-seq technique, Drop-Seq, with PacBio Single Molecule, Real-Time (SMRT) Sequencing to generate full-length transcript isoforms that can be confidently assigned to individual cells. We generated single cell Iso-Seq (scIso-Seq) libraries for chimp and human cerebral organoid samples on the Dolomite Nadia platform and sequenced each library with two SMRT Cells 8M on the PacBio Sequel II System. We developed a bioinformatics pipeline to identify, classify, and filter full-length isoforms at the single-cell level. We show that scIso-Seq reveals full-length isoform information not accessible using short reads that can reveal differences between cell types and amongst different species.
In this ASHG 2016 virtual poster, Flora Tassone from UC Davis describes her study of the molecular mechanisms linked to fragile X syndrome and associated disorders, such as FXTAS. She…
ASHG PacBio Workshop: Identification and characterization of informative genetic structural variants for neurodegenerative diseases
Michael Lutz, from the Duke University Medical Center, discussed a recently published software tool that can now be used in a pipeline with SMRT Sequencing data to find structural variant…
At AGBT 2017, Margaret Roy from Calico Life Sciences discussed a de novo genome sequencing effort for the naked mole rat. This animal has a remarkably long life span and…
In this webinar, Emily Hatas of PacBio shares information about the applications and benefits of SMRT Sequencing in plant and animal biology, agriculture, and industrial research fields. This session contains…
Long-read sequencing technologies like Iso-Seq analysis present researchers with a powerful tool for probing the transcriptomes of many species. The ability to sequence transcripts from end-to-end has revealed transcription complexity…
ASHG PacBio Workshop: Characterization of a large, human-specific tandem repeat array associated with bipolar disorder and schizophrenia
In this ASHG workshop presentation, Janet Song of Stanford School of Medicine shared research on resolving a tandem repeat array implicated in bipolar disorder and schizophrenia. These psychiatric diseases share…