June 1, 2021  |  

Detection and phasing of small variants in Genome in a Bottle samples with highly accurate long reads

Introduction: Long-read PacBio SMRT Sequencing has been applied successfully to assemble genomes and detect structural variants. However, due to high raw read error rates of 10-15%, it has remained difficult to call small variants from long reads. Recent improvements in library preparation, sequencing chemistry, and instrument yield have increased length, accuracy, and throughput of PacBio Circular Consensus (CCS) reads, resulting in 10-20 kb “HiFi” reads with mean read quality above 99%. Materials and Methods: We sequenced 11 kb size-selected libraries from the Genome in a Bottle (GIAB) human reference samples HG001, HG002, and HG005 to approximately 30-fold coverage on the Sequel II System with six SMRT Cells 8M each. The CCS algorithm was used to generate highly accurate (average 99.8%) reads of mean length 10-11 kb, which were then mapped to the hs37d5 reference with pbmm2. We detected small variants using Google DeepVariant and compared these variant calls to GIAB benchmarks. Small variants were then phased with WhatsHap. Results: With these long, highly accurate CCS reads, DeepVariant achieves high SNP and Indel accuracy against the GIAB benchmark truth set for all three reference samples. Using WhatsHap, small variants were phased into haplotype blocks with N50 from 82 to 146 kb. The improved mappability of long reads allows detection of variants in many medically relevant genes such as CYP2D6and PMS2that have proven ‘difficult-to-map’ with short reads. We show that small variant precision and recall remain high down to 15-fold coverage. Conclusions: These highly accurate long reads combine the mappability of noisy long reads with the accuracy and small variant detection utility of short reads, which will allow the detection and phasing of variants in regions that have proven recalcitrant to short read sequencing and variant detection.


June 1, 2021  |  

Every species can be a model: Reference-quality PacBio genomes from single insects

A high-quality reference genome is an essential resource for primary and applied research across the tree of life. Genome projects for small-bodied, non-model organisms such as insects face several unique challenges including limited DNA input quantities, high heterozygosity, and difficulty of culturing or inbreeding in the lab. Recent progress in PacBio library preparation protocols, sequencing throughput, and read accuracy address these challenges. We present several case studies including the Red Admiral (Vanessa atalanta), Monarch Butterfly (Danaus plexippus), and Anopheles malaria mosquitoes that highlight the benefits of sequencing single individuals for de novo genome assembly projects, and the ease at which these projects can be conducted by individual research labs. Sampled individuals may originate from lab colonies of interest to the research community or be sourced from the wild to better capture natural variation in a focal population. Where genomic DNA quantities are limited, the PacBio Low DNA Input Protocol requires ~100 ng of input DNA. Low DNA input samples with 500 Mb genome size or less can be multiplexed on a single SMRT Cell 8M on the Sequel II System. For samples with more abundant DNA quantity, size-selected libraries may be constructed to maximize sequencing yield. Both low DNA input and size-selected libraries can be used to generate HiFi reads, whose quality is Q20 or above (1% error or less) and lengths range from 10 – 25 kb. With HiFi reads, de novo assembly computation is greatly simplified relative to long read methods due to smaller sequence file sizes and more rapid analysis, resulting in highly accurate, contiguous, complete, and haplotype-resolved assemblies.


June 1, 2021  |  

A complete solution for high-quality genome annotation using the PacBio Iso-Seq method

The PacBio Iso-Seq method produces high-quality, full-length transcripts of up to 10 kb and longer and has been used to annotate many important plant and animal genomes. We describe here the full Iso-Seq ecosystem that enables researchers to achieve high-quality genome annotations. The Iso-Seq Express workflow is a 1-day protocol that requires only 60-300 ng of total RNA and supports multiplexing of different tissues. Sequencing on a single SMRT Cell 8M on the Sequel II System produces up to 4 million full-length reads, sufficient to exhaustively characterize a whole transcriptome on the order of 15,000-17,000 genes with 100,000 or more transcripts. Most importantly, the method is supported by a maturing suite of official and community-developed tools. The SMRT Link Iso-Seq application outputs high-quality (>99% accurate), full-length transcript sequences that can optionally be mapped to a reference genome for a single SMRT Cell worth of data in 6-9 hours. For example, the SQANTI2 tool classifies Iso-Seq transcripts against a reference annotation, filters potential library artifacts, and processes information from both long read-only and short read-based quantification. IsoPhase is a tool for identifying allele-specific isoform expression. Cogent has been used to process Iso-Seq transcripts in a genome-independent manner to assess genome assemblies. Finally, IsoAnnot is an up-and-coming tool for identifying differential isoform expression across different samples. We describe how these tools complement each other and provide guidelines to make the best use out of Iso-Seq data for understanding transcriptomes.


June 1, 2021  |  

A high-quality PacBio insect genome from 5 ng of input DNA

High-quality insect genomes are essential resources to understand insect biology and to combat them as disease vectors and agricultural pests. It is desirable to sequence a single individual for a reference genome to avoid complications from multiple alleles during de novo assembly. However, the small body size of many insects poses a challenge for the use of long-read sequencing technologies which often have high DNA-input requirements. The previously described PacBio Low DNA Input Protocol starts with ~100 ng of DNA and allows for high-quality assemblies of single mosquitoes among others and represents a significant step in reducing such requirements. Here, we describe a new library protocol with a further 20-fold reduction in the DNA input quantity. Starting with just 5 ng of high molecular weight DNA, we describe the successful sequencing and de novo genome assembly of a single male sandfly (Phlebotomus papatasi, the main vector of the Old World cutaneous leishmaniasis), using HiFi data generated on the PacBio Sequel II System and assembled with FALCON. The assembly shows a high degree of completeness (>97% of BUSCO genes are complete), contiguity (contig N50 of 1 Mb), and sequence accuracy (>98% of BUSCO genes without frameshift errors). This workflow has general utility for small-bodied insects and other plant and animal species for both focused research studies or in conjunction with large-scale genome projects.


June 1, 2021  |  

Amplification-free protocol for targeted enrichment of repeat expansion genomic regions and SMRT Sequencing

Many genetic disorders are associated with repeat sequence expansions. Obtaining accurate DNA sequence information from these regions will facilitate researchers to further establish the relationship between these genetic disorders and underlying disease mechanisms. Moreover, repeat interruptions have also been shown to act as phenotypic modifiers in some disorders. Targeted sequencing is an economical way to obtain sequence information from one or more defined regions in a genome. However, most targeted enrichment and sequencing methods require some form of DNA amplification. Amplifying large regions with extreme GC content as seen in repeat expansion disorders is challenging and prone to introducing sequence artifacts. DNA amplification also removes any epigenetic signatures present in native DNA. This technique also preserves native DNA molecules for the possibility of direct characterization of epigenetic signatures.


June 1, 2021  |  

A complete solution for full-length transcript sequencing using the PacBio Sequel II System

Long read mRNA sequencing methods such as PacBio’s Iso-Seq method offers high-throughput transcriptome profiling in prokaryotic and eukaryotic cells. By avoiding the transcript assembly problem and instead sequencing full-length cDNA, Iso-Seq has emerged as the most reliable technology for annotating isoforms and, in turn, improving proteome predictions in a wide variety of organisms. Improvements in library preparation, sequencing throughput, and bioinformatics has enabled the Iso-Seq method to be complete solution for transcript characterization. The Iso-Seq Express kit is a one-day library prep requiring 60-300 ng of total RNA. The PacBio Sequel II system produces 4-5 million full-length reads, sufficient to profile a whole human transcriptome. Finally, the SQANTI2 software is a powerful tool for categorizing the complex isoforms against reference annotations, while also incorporating orthogonal information such as CAGE peak data, public RNA-seq junction data, and ORF predictions.


June 1, 2021  |  

New advances in SMRT Sequencing facilitate multiplexing for de novo and structural variant studies

The latest advancements in Sequel II SMRT Sequencing have increased average read lengths up to 50% compared to Sequel II chemistry 1.0 which allows multiplexing of 2-3 small organisms (<500 Mb) such as insects and worms for producing reference quality assemblies, calling structural variants for up to 2 samples with ~3 Gb genomes, analysis of 48 microbial genomes, and up to 8 communities for metagenomic profiling in a single SMRT Cell 8M. With the improved processivity of the new Sequel II sequencing polymerase, more SMRTbell molecules reach rolling circle mode resulting in longer overall read lengths, thus allowing efficient detection of barcodes (up to 80%) in the SMRTbell templates. Multiplexing of genomes larger than microbial organisms is now achievable. In collaboration with the Wellcome Sanger Institute, we have developed a workflow for multiplexing two individual Anopheles coluzzii using as low as 150 ng genomic DNA per individual. The resulting assemblies had high contiguity (contig N50s over 3 Mb) and completeness (>98% of conserved genes) for both individuals. For microbial multiplexing, we multiplexed 48 microbes with varying complexities and sizes ranging 1.6-8.0 Mb in single SMRT Cell 8M. Using a new end-to-end analysis (Microbial Assembly Analysis, SMRT Link 8.0), assemblies resulted in complete circularized genomes (>200-fold coverage) and efficient detection of >3-200 kb plasmids. Finally, the long read lengths (>90 kb) allows detection of barcodes in large insert SMRTbell templates (>15 kb) thus facilitating multiplex of two human samples in 1 SMRT Cell 8M for detecting SVs, Indels and CNVs. Here, we present results and describe workflows for multiplexing samples for specific applications for SMRT Sequencing.


June 1, 2021  |  

Unbiased characterization of metagenome composition and function using HiFi sequencing on the PacBio Sequel II System

Recent work comparing metagenomic sequencing methods indicates that a comprehensive picture of the taxonomic and functional diversity of complex communities will be difficult to achieve with one sequencing technology alone. While the lower cost of short reads has enabled greater sequencing depth, the greater contiguity of long-read assemblies and lack of GC bias in SMRT Sequencing has enabled better gene finding. However, since long-read assembly typically requires high coverage for error correction, these benefits have in the past been lost for low-abundance species. The introduction of the Sequel II System has enabled a new, higher throughput, assembly-optional data type that addresses these challenges: HiFi reads. HiFi reads combine QV20 accuracy with long read lengths, eliminating the need for assembly for most metagenome applications, including gene discovery and metabolic pathway reconstruction. In fact, the read lengths and accuracy of HiFi data match or outperform the quality metrics of most metagenome assemblies, enabling cost-effective recovery of intact genes and operons while omitting the resource intensive and data-inefficient assembly step. Here we present the application of HiFi sequencing to both mock and human fecal samples using full-length 16S and shotgun methods. This proof-of-concept work demonstrates the unique strengths of the HiFi method. First, the high correspondence between the expected community composition,16S and shotgun profiling data reflects low context bias. In addition, every HiFi read yields ~5-8 predicted genes, without assembly, using standard tools. If assembly is desired, excellent results can be achieved with Canu and contig binning tools. In summary, HiFi sequencing is a new, cost-effective option for high-resolution functional profiling of metagenomes which complements existing short read workflows.


June 1, 2021  |  

Copy-number variant detection with PacBio long reads

Long-read sequencing of diverse humans has revealed more than 20,000 insertion, deletion, and inversion structural variants spanning more than 12 Mb in a healthy human genome. Most of these variants are too large to detect with short reads and too small for array comparative genome hybridization (aCGH). While the standard approaches to calling structural variants with long reads thrive in the 50 bp to 10 kb size range, they tend to miss exactly the large (>50 kb) copy-number variants that are called more readily with aCGH. Standard algorithms rely on reference-based mapping of reads that fully span a variant or on de novo assembly; and copy-number variants are often too large to be spanned by a single read and frequently involve segmentally duplicated sequence that is not yet included in most de novo assemblies. To comprehensively detect large variants in human genomes, we extended pbsv – a structural variant caller for long reads – to call copy-number variants (CNVs) from read-clipping and read-depth signatures. In human germline benchmark samples, we detect more than 300 CNVs spanning around 10 Mb, and we call hundreds of additional events in re-arranged cancer samples. Together with insertion, deletion, inversion, duplication, and translocation calling from spanning reads, this allows pbsv to comprehensively detect large variants from a single data type.


June 1, 2021  |  

Low-input single molecule HiFi sequencing for metagenomic samples

HiFi sequencing on the PacBio Sequel II System enables complete microbial community profiling of complex metagenomic samples using whole genome shotgun sequences. With HiFi sequencing, highly accurate long reads overcome the challenges posed by the presence of intergenic and extragenic repeat elements in microbial genomes, thus greatly improving phylogenetic profiling and sequence assembly. Recent improvements in library construction protocols enable HiFi sequencing starting from as low as 5 ng of input DNA. Here, we demonstrate comparative analyses of a control sample of known composition and a human fecal sample from varying amounts of input genomic DNA (1 ug, 200 ng, 5 ng), and present the corresponding library preparation workflows for standard, low input, and Ultra-Low methods. We demonstrate that the metagenome assembly, taxonomic assignment, and gene finding analyses are comparable across all methods for both samples, providing access to HiFi sequencing even for DNA-limited sample types.


June 1, 2021  |  

Metagenomic analysis of type II diabetes gut microbiota using PacBio HiFi reads reveals taxonomic and functional differences

In the past decade, the human microbiome has been increasingly shown to play a major role in health. For example, imbalances in gut microbiota appear to be associated with Type II diabetes mellitus (T2DM) and cardiovascular disease. Coronary artery disease (CAD) is a major determinant of the long-term prognosis among T2DM patients, with a 2- to 4-fold increased mortality risk when present. However, the exact microbial strains or functions implicated in disease need further investigation. From a large study with 523 participants (185 healthy controls, 186 T2DM patients without CAD, and 106 T2DM patients with CAD), 3 samples from each patient group were selected for long read sequencing. Each sample was prepared and sequenced on one Sequel II System SMRT Cell, to assess whether long accurate PacBio HiFi reads could yield additional insights to those made using short reads. Each of the 9 samples was subject to metagenomic assembly and binning, taxonomic classification and functional profiling. Results from metagenomic assembly and binning show that it is possible to generate a significant number of complete MAGs (Metagenome Assembled Genomes) from each sample, with over half of the high-quality MAGs being represented by a single circular contig. We show that differences found in taxonomic and functional profiles of healthy versus diabetic patients in the small 9-sample study align with the results of the larger study, as well as with results reported in literature. For example, the abundances of beneficial short- chain fatty acid (SCFA) producers such as Phascolarctobacterium faecium and Faecalibacterium prausnitzii were decreased in T2DM gut microbiota in both studies, while the abundances of quinol and quinone biosynthesis pathways were increased as compared to healthy controls. In conclusion, metagenomic analysis of long accurate HiFi reads revealed important taxonomic and functional differences in T2DM versus healthy gut microbiota. Furthermore, metagenome assembly of long HiFi reads led to the recovery of many complete MAGs and a significant number of complete circular bacterial chromosome sequences.


June 1, 2021  |  

Comprehensive variant detection in a human genome with highly accurate long reads

Introduction: Long-read sequencing has been applied successfully to assemble genomes and detect structural variants. However, due to high raw-read error rates (10-15%), it has remained difficult to call small variants from long reads. Recent improvements in library preparation and sequencing chemistry have increased length, accuracy, and throughput of PacBio circular consensus sequencing (CCS) reads, resulting in 15-20kb reads with average read quality above 99%. Materials and Methods: We sequenced a library from human reference sample HG002 to 18-fold coverage on the PacBio Sequel II with two SMRT Cells 8M. The CCS algorithm was used to generate highly accurate (average 99.9%) 12.9kb reads, which were mapped to the hg19 reference with pbmm2. We detected small variants using Google DeepVariant with a model trained for CCS and phased the variants using WhatsHap. Structural variants were detected with pbsv. Variant calls were evaluated against Genome in a Bottle (GIAB) benchmarks. Results: With these reads, DeepVariant achieves SNP and Indel F1 scores of 99.70% and 96.59% against the GIAB truth set, and pbsv achieves 97.72% recall on structural variants longer than 50bp. Using WhatsHap, small variants were phased into haplotype blocks with 145kb N50. The improved mappability of long reads allows us to align to and detect variants in medically relevant genes such as CYP2D6 and PMS2 that have proven “difficult-to-map” with short reads. Conclusions: These highly accurate long reads combine the mappability and ability to detect structural variants of long reads with the accuracy and ability to detect small variants of short reads.


June 1, 2021  |  

A workflow for the comprehensive detection and prioritization of variants in human genomes with PacBio HiFi reads

PacBio HiFi reads (minimum 99% accuracy, 15-25 kb read length) have emerged as a powerful data type for comprehensive variant detection in human genomes. The HiFi read length extends confident mapping and variant calling to repetitive regions of the genome that are not accessible with short reads. Read length also improves detection of structural variants (SVs), with recall exceeding that of short reads by over 30%. High read quality allows for accurate single nucleotide variant and small indel detection, with precision and recall matching that of short reads. While many tools have been developed to take advantage of these qualities of HiFi reads, there is no end-to-end workflow for the filtering and prioritization of variants uniquely detected with long reads for rare and undiagnosed disease research. We have developed a flexible, modular workflow and web portal for variant analysis from HiFi reads and applied it to a set of rare disease cases unsolved by short-read whole genome sequencing. We expect that broad application of long-read variant detection workflows will solve many more rare disease cases. We have made these tools available at https://github.com/williamrowell/pbRUGD-workflow, and we hope they serve a starting point for developing a robust analysis framework for long read variant detection for rare diseases.


June 1, 2021  |  

Amplification-free targeted enrichment powered by CRISPR-Cas9 and long-read Single Molecule Real-Time (SMRT) Sequencing can efficiently and accurately sequence challenging repeat expansion disorders

Genomic regions with extreme base composition bias and repetitive sequences have long proven challenging for targeted enrichment methods, as they rely upon some form of amplification. Similarly, most DNA sequencing technologies struggle to faithfully sequence regions of low complexity. This has been especially trying for repeat expansion disorders such as Fragile-X disease, Huntington disease and various Ataxias, where the repetitive elements range from several hundreds of bases to tens of kilobases. We have developed a robust, amplification-free targeted enrichment technique, called No-Amp Targeted Sequencing, that employs the CRISPR-Cas9 system. In conjunction with SMRT Sequencing, which delivers long reads spanning the entire repeat expansion, high consensus accuracy, and uniform coverage, these previously inaccessible regions are now accessible. This method is completely amplification-free, therefore removing any PCR errors and biases from the experiment. Furthermore, this technique also preserves native DNA molecules, allowing for direct detection and characterization of epigenetic signatures. The No-Amp method is a two-day protocol that is compatible with multiplexing of multiple targets and multiple samples in a single reaction, using as little as 1 µg of genomic DNA input per sample. We have successfully targeted a number of repeat expansion disorder loci including HTT, FMR1, C9orf7,2 as well as built an Ataxia panel which consists of 15 different disease-causing repeat expansion regions. Using the No-Amp method we have isolated hundreds of individual on-target molecules, allowing for reliable repeat size estimation, mosaicism detection and identification of interruption sequences with alleles as long as >2700 repeat unites ( >13 kb). In addition to multiplexing several targets, we have also multiplexed at least 20 samples in one experiment making the No-Amp Targeted Sequencing method a cost-effective option. Combining the CRISPR-Cas9 enrichment method with Single Molecule, Real-Time Sequencing provided us with base-level resolution of previously inaccessible regions of the genome, like disease-causing repeat expansions. No-Amp Targeted Sequencing captures, in one experiment, many aspects of repeat expansion disorders which are important for better understanding the underlying disease mechanisms.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.