Background: The sequencing and haplotype phasing of entire gene sequences improves the understanding of the genetic basis of disease and drug response. One example is cystic fibrosis (CF). Cystic fibrosis transmembrane conductance regulator (CFTR) modulator therapies have revolutionized CF treatment, but only in a minority of CF subjects. Observed heterogeneity in CFTR modulator efficacy is related to the range of CFTR mutations; revertant mutations can modify the response to CFTR modulators, and other intronic variations in the ~200 kb CFTR gene have been linked to disease severity. Heterogeneity in the CFTR gene may also be linked to differential responses to…
NGS is commonly used for amplicon sequencing in clinical applications to study genetic disorders and detect disease-causing mutations. This approach can be plagued by limited ability to phase sequence variants and makes interpretation of sequence data difficult when pseudogenes are present. Long-read highly accurate amplicon sequencing can provide very accurate, efficient, high throughput (through multiplexing) sequences from single molecules, with read lengths largely limited by PCR. Data is easy to interpret; phased variants and breakpoints are present within high fidelity individual reads. Here we show SMRT Sequencing of the PMS2 and OPN1 (MW and LW) genes using the Sequel System.…
The PacBio Iso-Seq method produces high-quality, full-length transcripts and can characterize a whole transcriptome with a single SMRT Cell 8M. We sequenced an Alzheimer whole brain sample on a single SMRT Cell 8M on the Sequel II System. Using the Iso-Seq bioinformatics pipeline followed by SQANTI2 analysis, we detected 162,290 transcripts for 17,670 genes up to 14 kb in length. More than 60% of the transcripts are novel isoforms, the vast majority of which have supporting cage peak data and polyadenylation signals, demonstrating the utility of long-read sequencing for human disease research.
De novo assemblies of human genomes from accurate (85-90%), continuous long reads (CLR) now approach the human reference genome in contiguity, but the assembly base pair accuracy is typically below QV40 (99.99%), an order-of-magnitude lower than the standard for finished references. The base pair errors complicate downstream interpretation, particularly false positive indels that lead to false gene loss through frameshifts. PacBio HiFi sequence data, which are both long (>10 kb) and very accurate (>99.9%) at the individual sequence read level, enable a new paradigm in human genome assembly. Haploid human assemblies using HiFi data achieve similar contiguity to those using…
To comprehensively detect large variants in human genomes, we have extended pbsv – a structural variant caller for long reads – to call copy-number variants (CNVs) from read-clipping and read-depth signatures. In human germline benchmark samples, we detect more than 300 CNVs spanning around 10 Mb, and we call hundreds of additional events in re-arranged cancer samples. Long-read sequencing of diverse humans has revealed more than 20,000 insertion, deletion, and inversion structural variants spanning more than 12 Mb in a typical human genome. Most of these variants are too large to detect with short reads and too small for array…
Introduction: Long-read PacBio SMRT Sequencing has been applied successfully to assemble genomes and detect structural variants. However, due to high raw read error rates of 10-15%, it has remained difficult to call small variants from long reads. Recent improvements in library preparation, sequencing chemistry, and instrument yield have increased length, accuracy, and throughput of PacBio Circular Consensus (CCS) reads, resulting in 10-20 kb “HiFi” reads with mean read quality above 99%. Materials and Methods: We sequenced 11 kb size-selected libraries from the Genome in a Bottle (GIAB) human reference samples HG001, HG002, and HG005 to approximately 30-fold coverage on the…
Bipolar disorder (BD) is a phenotypically and genetically complex neurological disorder that affects 1% of the worldwide population. There is compelling evidence from family, twin and adoption studies supporting the involvement of a genetic predisposition with estimated heritability up to ~ 80%. The risk in first-degree relatives is ten times higher than in the general population. Linkage and association studies have implicated multiple putative chromosomal loci for BD susceptibility, however no disease genes have yet to be identified. Here, we have fully characterized a ~12 Mb significantly linked (lod score=3.54) genomic region on chromosome Xq24-q27 in an extended family from…
HiFi reads (>99% accurate, 15-20 kb) from the PacBio Sequel II System consistently provide complete and contiguous genome assemblies. In addition to completeness and contiguity, accuracy is of critical importance, as assembly errors complicate downstream analysis, particularly by disrupting gene frames. Metrics used to assess assembly accuracy include: 1) in-frame gene count, 2) kmer consistency, and 3) concordance to a benchmark, where discordances are interpreted as assembly errors. Genome in a Bottle (GIAB) provides a benchmark for the human genome with estimated accuracy of 99.9999% (Q60). Concordance for human HiFi assemblies exceeds Q50, which provides excellent genomes for downstream analysis,…
A high-quality reference genome is an essential resource for primary and applied research across the tree of life. Genome projects for small-bodied, non-model organisms such as insects face several unique challenges including limited DNA input quantities, high heterozygosity, and difficulty of culturing or inbreeding in the lab. Recent progress in PacBio library preparation protocols, sequencing throughput, and read accuracy address these challenges. We present several case studies including the Red Admiral (Vanessa atalanta), Monarch Butterfly (Danaus plexippus), and Anopheles malaria mosquitoes that highlight the benefits of sequencing single individuals for de novo genome assembly projects, and the ease at which…
High-quality insect genomes are essential resources to understand insect biology and to combat them as disease vectors and agricultural pests. It is desirable to sequence a single individual for a reference genome to avoid complications from multiple alleles during de novo assembly. However, the small body size of many insects poses a challenge for the use of long-read sequencing technologies which often have high DNA-input requirements. The previously described PacBio Low DNA Input Protocol starts with ~100 ng of DNA and allows for high-quality assemblies of single mosquitoes among others and represents a significant step in reducing such requirements. Here,…
Long read mRNA sequencing methods such as PacBio’s Iso-Seq method offers high-throughput transcriptome profiling in prokaryotic and eukaryotic cells. By avoiding the transcript assembly problem and instead sequencing full-length cDNA, Iso-Seq has emerged as the most reliable technology for annotating isoforms and, in turn, improving proteome predictions in a wide variety of organisms. Improvements in library preparation, sequencing throughput, and bioinformatics has enabled the Iso-Seq method to be complete solution for transcript characterization. The Iso-Seq Express kit is a one-day library prep requiring 60-300 ng of total RNA. The PacBio Sequel II system produces 4-5 million full-length reads, sufficient to…
The PacBio Iso-Seq method produces high-quality, full-length transcripts of up to 10 kb and longer and has been used to annotate many important plant and animal genomes. We describe here the full Iso-Seq ecosystem that enables researchers to achieve high-quality genome annotations. The Iso-Seq Express workflow is a 1-day protocol that requires only 60-300 ng of total RNA and supports multiplexing of different tissues. Sequencing on a single SMRT Cell 8M on the Sequel II System produces up to 4 million full-length reads, sufficient to exhaustively characterize a whole transcriptome on the order of 15,000-17,000 genes with 100,000 or more…
Recent work comparing metagenomic sequencing methods indicates that a comprehensive picture of the taxonomic and functional diversity of complex communities will be difficult to achieve with one sequencing technology alone. While the lower cost of short reads has enabled greater sequencing depth, the greater contiguity of long-read assemblies and lack of GC bias in SMRT Sequencing has enabled better gene finding. However, since long-read assembly typically requires high coverage for error correction, these benefits have in the past been lost for low-abundance species. The introduction of the Sequel II System has enabled a new, higher throughput, assembly-optional data type that…
Many genetic disorders are associated with repeat sequence expansions. Obtaining accurate DNA sequence information from these regions will facilitate researchers to further establish the relationship between these genetic disorders and underlying disease mechanisms. Moreover, repeat interruptions have also been shown to act as phenotypic modifiers in some disorders. Targeted sequencing is an economical way to obtain sequence information from one or more defined regions in a genome. However, most targeted enrichment and sequencing methods require some form of DNA amplification. Amplifying large regions with extreme GC content as seen in repeat expansion disorders is challenging and prone to introducing sequence…
The latest advancements in Sequel II SMRT Sequencing have increased average read lengths up to 50% compared to Sequel II chemistry 1.0 which allows multiplexing of 2-3 small organisms (98% of conserved genes) for both individuals. For microbial multiplexing, we multiplexed 48 microbes with varying complexities and sizes ranging 1.6-8.0 Mb in single SMRT Cell 8M. Using a new end-to-end analysis (Microbial Assembly Analysis, SMRT Link 8.0), assemblies resulted in complete circularized genomes (>200-fold coverage) and efficient detection of >3-200 kb plasmids. Finally, the long read lengths (>90 kb) allows detection of barcodes in large insert SMRTbell templates (>15 kb)…