An important need in analyzing complex genomes is the ability to separate and phase haplotypes. While whole genome assembly can deliver this information, it cannot reveal whether there is allele-specific gene or isoform expression. The PacBio Iso-Seq method, which can produce high-quality transcript sequences of 10 kb and longer, has been used to annotate many important plant and animal genomes. We present an algorithm called IsoPhase that post-processes Iso-Seq data for transcript-based haplotyping. We applied IsoPhase to a maize Iso-Seq dataset consisting of two homozygous parents and two F1 cross hybrids. We validated the majority of the SNPs called with IsoPhase against matching short read data and identified cases of allele-specific, gene-level and isoform-level expression.
Library prep and bioinformatics improvements for full-length transcript sequencing on the PacBio Sequel System
The PacBio Iso-Seq method produces high-quality, full-length transcripts of up to 10 kb and longer and has been used to annotate many important plant and animal genomes. Here we describe an improved, simplified library workflow and analysis pipeline that reduces library preparation time, RNA input, and cost. The Iso-Seq V2 Express workflow is a one day protocol that requires only ~300 ng of total RNA input while also reducing the number of reverse transcription and amplification steps down to single reactions. Compared with the previous workflow, the Iso-Seq V2 Express workflow increases the percentage of full-length (FL) reads while achieving a higher average transcript length. At the same time, the Iso-Seq 3 analysis recently released in the SMRT Link 6.0 software is a major improvement over previous versions. Iso-Seq 3 is highly accurate at detecting and removing library artifacts (TSO and RT artifacts) as well as differentiating barcodes on multiplexed samples. Iso-Seq 3 achieves the same output performance in high-quality transcript sequences compared to previous versions while reducing the runtime and memory usage dramatically.
Streamlines SMRTbell library generation using addition-only, single tube strategy for all library types reduces time to results
We have streamlined the SMRTbell library generation protocols with improved workflows to deliver seamless end-to-end solutions from sample to analysis. A key improvement is the development of a single-tube reaction strategy that shortened hands-on time needed to generate each SMRTbell library, reduced time-consuming AM Pure purification steps, and minimized sample-handling induced gDNA damage to improve the integrity of long-insert SMRTbell templates for sequencing. The improved protocols support all large-insert genomic libraries, multiplexed microbial genomes, and amplicon sequencing. These advances enable completion of library preparation in less than a day (approximately 4 hours) and opens opportunities for automated library preparation for large-scale projects. Here we share data summarizing performance of the new SMRTbell Express Template Kit 2.0 representing our solutions for 10 kb and >50 kb large-insert genomic libraries, complete microbial genome assemblies, and high-throughput amplicon sequencing. The improved throughput of the Sequel System with read lengths up to 30 kb and high consensus accuracy (> 99.999% accuracy) makes sequencing with high-quality results increasingly assessible to the community.
Full-length transcriptome sequencing of melanoma cell line complements long-read assessment of genomic rearrangements
Transcriptome sequencing has proven to be an important tool for understanding the biological changes in cancer genomes including the consequences of structural rearrangements. Short read sequencing has been the method of choice, as the high throughput at low cost allows for transcript quantitation and the detection of even rare transcripts. However, the reads are generally too short to reconstruct complete isoforms. Conversely, long-read approaches can provide unambiguous full-length isoforms, but lower throughput has complicated quantitation and high RNA input requirements has made working with cancer samples challenging. Recently, the COLO 829 cell line was sequenced to 50-fold coverage with PacBio SMRT Sequencing. To validate and extend the findings from this effort, we have generated long-read transcriptome data using an updated PacBio Iso-Seq method, the results of which will be shared at the AACR 2019 General Meeting. With this complimentary transcriptome data, we demonstrate how recent innovations in the PacBio Iso-Seq method sample preparation and sequencing chemistry have made long-read sequencing of cancer transcriptomes more practical. In particular, library preparation has been simplified and throughput has increased. The improved protocol has reduced sample prep time from several days to one day while reducing the sample input requirements ten-fold. In addition, the incorporation of unique molecular identifier (UMI) tags into the workflow has improved the bioinformatics analysis. Yield has also increased, with v3 sequencing chemistry typically delivering > 30 Gb per SMRT Cell 1M. By integrating long and short read data, we demonstrate that the Iso-Seq method is a practical tool for annotating cancer genomes with high-quality transcript information.
Sequencing the previously unsequenceable using amplification-free targeted enrichment powered by CRISPR/Cas9
Genomic regions with extreme base composition bias and repetitive sequences have long proven challenging for targeted enrichment methods, as they rely upon some form of amplification. Similarly, most DNA sequencing technologies struggle to faithfully sequence regions of low complexity. This has especially been true for repeat expansion disorders such as Fragile X syndrome, Huntington’s disease and various Ataxias, where the repetitive elements range from several hundreds of bases to tens of kilobases. We have developed a robust, amplification-free targeted enrichment technique, called No-Amp Targeted Sequencing, that employs the CRISPR/Cas9 system. In conjunction with Single Molecule, Real-Time (SMRT) Sequencing, which delivers long reads spanning the entire repeat expansion, high consensus accuracy, and uniform coverage, these previously inaccessible regions are now accessible. This method is completely amplification-free, therefore removing any PCR errors and biases from the experiment. Furthermore, this technique also preserves native DNA molecules, allowing for direct detection and characterization of epigenetic signatures. The No-Amp method is a two-day protocol, compatible with multiplexing of multiple targets and samples in a single reaction, using as little as 1 µg of genomic DNA input per sample. We have successfully targeted a number of repeat expansion disorder loci (HTT, FMR1, ATXN10, C9orf72) with alleles as long as >2700 repeat unites (>13 kb). Using the No-Amp method we have isolated hundreds of individual on-target molecules, allowing for reliable repeat size estimation, mosaicism detection and identification of interruption sequences – all aspects of repeat expansion disorders which are important for better understanding the underlying disease mechanisms.
Background: The sequencing and haplotype phasing of entire gene sequences improves the understanding of the genetic basis of disease and drug response. One example is cystic fibrosis (CF). Cystic fibrosis transmembrane conductance regulator (CFTR) modulator therapies have revolutionized CF treatment, but only in a minority of CF subjects. Observed heterogeneity in CFTR modulator efficacy is related to the range of CFTR mutations; revertant mutations can modify the response to CFTR modulators, and other intronic variations in the ~200 kb CFTR gene have been linked to disease severity. Heterogeneity in the CFTR gene may also be linked to differential responses to CFTR modulators. The Targeted Locus Amplification (TLA) technology from Cergentis can be used to selectively amplify, sequence and phase the entire CFTR gene. With PacBio long-read SMRT Sequencing, TLA amplicons are sequenced intact and long-range phasing information of all fragments in entire amplicons is retrieved. Experimental Design and Methods: The TLA process produces amplicons consisting of 5-10 proximity ligated DNA fragments. TLA was performed on cell line and genomic DNA from Coriell GM12878, which has few heterozygous SNVs in CFTR, and the IB3 cell line, with known haplotypes but heterozygous for the delta508 mutation. All sample types were prepared with high and low density TLA primer sets, targeting coverage of >100 kb of the CFTR gene. Conclusion: We have demonstrated the power and utility of TLA with long-read SMRT Sequencing as a valuable research tool in sequencing and phasing across very long regions of the human genome. This process can be done in an efficient manner, multiplexing multiple genes and samples per SMRT Cell in a process amenable to high-throughput sequencing.
The PacBio Iso-Seq method produces high-quality, full-length transcripts of up to 10 kb and longer and has been used to annotate many important plant and animal genomes. We describe here the full Iso-Seq ecosystem that enables researchers to achieve high-quality genome annotations. The Iso-Seq Express workflow is a 1-day protocol that requires only 60-300 ng of total RNA and supports multiplexing of different tissues. Sequencing on a single SMRT Cell 8M on the Sequel II System produces up to 4 million full-length reads, sufficient to exhaustively characterize a whole transcriptome on the order of 15,000-17,000 genes with 100,000 or more transcripts. Most importantly, the method is supported by a maturing suite of official and community-developed tools. The SMRT Link Iso-Seq application outputs high-quality (>99% accurate), full-length transcript sequences that can optionally be mapped to a reference genome for a single SMRT Cell worth of data in 6-9 hours. For example, the SQANTI2 tool classifies Iso-Seq transcripts against a reference annotation, filters potential library artifacts, and processes information from both long read-only and short read-based quantification. IsoPhase is a tool for identifying allele-specific isoform expression. Cogent has been used to process Iso-Seq transcripts in a genome-independent manner to assess genome assemblies. Finally, IsoAnnot is an up-and-coming tool for identifying differential isoform expression across different samples. We describe how these tools complement each other and provide guidelines to make the best use out of Iso-Seq data for understanding transcriptomes.
Amplification-free protocol for targeted enrichment of repeat expansion genomic regions and SMRT Sequencing
Many genetic disorders are associated with repeat sequence expansions. Obtaining accurate DNA sequence information from these regions will facilitate researchers to further establish the relationship between these genetic disorders and underlying disease mechanisms. Moreover, repeat interruptions have also been shown to act as phenotypic modifiers in some disorders. Targeted sequencing is an economical way to obtain sequence information from one or more defined regions in a genome. However, most targeted enrichment and sequencing methods require some form of DNA amplification. Amplifying large regions with extreme GC content as seen in repeat expansion disorders is challenging and prone to introducing sequence artifacts. DNA amplification also removes any epigenetic signatures present in native DNA. This technique also preserves native DNA molecules for the possibility of direct characterization of epigenetic signatures.
The latest advancements in Sequel II SMRT Sequencing have increased average read lengths up to 50% compared to Sequel II chemistry 1.0 which allows multiplexing of 2-3 small organisms (<500 Mb) such as insects and worms for producing reference quality assemblies, calling structural variants for up to 2 samples with ~3 Gb genomes, analysis of 48 microbial genomes, and up to 8 communities for metagenomic profiling in a single SMRT Cell 8M. With the improved processivity of the new Sequel II sequencing polymerase, more SMRTbell molecules reach rolling circle mode resulting in longer overall read lengths, thus allowing efficient detection of barcodes (up to 80%) in the SMRTbell templates. Multiplexing of genomes larger than microbial organisms is now achievable. In collaboration with the Wellcome Sanger Institute, we have developed a workflow for multiplexing two individual Anopheles coluzzii using as low as 150 ng genomic DNA per individual. The resulting assemblies had high contiguity (contig N50s over 3 Mb) and completeness (>98% of conserved genes) for both individuals. For microbial multiplexing, we multiplexed 48 microbes with varying complexities and sizes ranging 1.6-8.0 Mb in single SMRT Cell 8M. Using a new end-to-end analysis (Microbial Assembly Analysis, SMRT Link 8.0), assemblies resulted in complete circularized genomes (>200-fold coverage) and efficient detection of >3-200 kb plasmids. Finally, the long read lengths (>90 kb) allows detection of barcodes in large insert SMRTbell templates (>15 kb) thus facilitating multiplex of two human samples in 1 SMRT Cell 8M for detecting SVs, Indels and CNVs. Here, we present results and describe workflows for multiplexing samples for specific applications for SMRT Sequencing.
Complete, high-quality microbial genomes are very valuable across a broad array of fields, from environmental studies, to human microbiome health, food pathogen surveillance, etc. Long-read sequencing enables accurate resolution of complex microbial genomes and is becoming the new standard. Here we report our novel Microbial Assembly pipeline to facilitate rapid, large-scale analysis of microbial genomes. We sequenced a 48-plex library with one SMRT Cell 8M on the Sequel II System, demultiplexed, then analyzed the data with Microbial Assembly.
Amplification-free targeted enrichment powered by CRISPR-Cas9 and long-read Single Molecule Real-Time (SMRT) Sequencing can efficiently and accurately sequence challenging repeat expansion disorders
Genomic regions with extreme base composition bias and repetitive sequences have long proven challenging for targeted enrichment methods, as they rely upon some form of amplification. Similarly, most DNA sequencing technologies struggle to faithfully sequence regions of low complexity. This has been especially trying for repeat expansion disorders such as Fragile-X disease, Huntington disease and various Ataxias, where the repetitive elements range from several hundreds of bases to tens of kilobases. We have developed a robust, amplification-free targeted enrichment technique, called No-Amp Targeted Sequencing, that employs the CRISPR-Cas9 system. In conjunction with SMRT Sequencing, which delivers long reads spanning the entire repeat expansion, high consensus accuracy, and uniform coverage, these previously inaccessible regions are now accessible. This method is completely amplification-free, therefore removing any PCR errors and biases from the experiment. Furthermore, this technique also preserves native DNA molecules, allowing for direct detection and characterization of epigenetic signatures. The No-Amp method is a two-day protocol that is compatible with multiplexing of multiple targets and multiple samples in a single reaction, using as little as 1 µg of genomic DNA input per sample. We have successfully targeted a number of repeat expansion disorder loci including HTT, FMR1, C9orf7,2 as well as built an Ataxia panel which consists of 15 different disease-causing repeat expansion regions. Using the No-Amp method we have isolated hundreds of individual on-target molecules, allowing for reliable repeat size estimation, mosaicism detection and identification of interruption sequences with alleles as long as >2700 repeat unites ( >13 kb). In addition to multiplexing several targets, we have also multiplexed at least 20 samples in one experiment making the No-Amp Targeted Sequencing method a cost-effective option. Combining the CRISPR-Cas9 enrichment method with Single Molecule, Real-Time Sequencing provided us with base-level resolution of previously inaccessible regions of the genome, like disease-causing repeat expansions. No-Amp Targeted Sequencing captures, in one experiment, many aspects of repeat expansion disorders which are important for better understanding the underlying disease mechanisms.
A brief animated introduction to Pacific Biosciences’ Single Molecule, Real-Time (SMRT) Sequencing, including the SMRT Cell and ZMW (zero mode waveguide).
Tutorial: Base modification detection, base modification and motif analysis application [SMRT Link v5.0.0]
This tutorial provides an overview of the Base Modification and Motif analysis application for identifying common bacterial epigenetic modifications and analyzing methyltransferase recognition motifs. SMRT Analysis software supports epigenetic research…
This tutorial provides an overview of the Circular Consensus Sequence (CCS) analysis application. The CCS algorithm is used in applications that require distinguishing closely related DNA molecules in the same…
This tutorial provides an overview of the Hierarchical Genome Assembly Process (HGAP4) de novo assembly analysis application. HGAP4 generates accurate de novo assemblies using only PacBio data. HGAP4 is suitable…