With Single Molecule, Real-Time (SMRT) Sequencing and the Sequel System, you can easily and cost effectively generate highly accurate long reads (HiFi reads, >99% single-molecule accuracy) from genes or regions of interest ranging in size from several hundred base pairs to 20 kb. Target all types of variation across relevant genomic regions, including low complexity regions like repeat expansions, promoters, and flanking regions of transposable elements.
To comprehensively detect large variants in human genomes, we have extended pbsv – a structural variant caller for long reads – to call copy-number variants (CNVs) from read-clipping and read-depth signatures. In human germline benchmark samples, we detect more than 300 CNVs spanning around 10 Mb, and we call hundreds of additional events in re-arranged cancer samples. Long-read sequencing of diverse humans has revealed more than 20,000 insertion, deletion, and inversion structural variants spanning more than 12 Mb in a typical human genome. Most of these variants are too large to detect with short reads and too small for array comparative genome hybridization (aCGH). While the standard approaches to calling structural variants with long reads thrive in the 50 bp to 10 kb size range, they tend to miss exactly the large (>50 kb) copy-number variants that are called more readily with aCGH and short reads. Standard algorithms rely on reference-based mapping of reads that fully span a variant or on de novo assembly; and copy-number variants are often too large to be spanned by a single read and frequently involve segmentally duplicated sequence that is not yet included in most de novo assemblies.
Introduction: Long-read sequencing has been applied successfully to assemble genomes and detect structural variants. However, due to high raw-read error rates (10-15%), it has remained difficult to call small variants from long reads. Recent improvements in library preparation and sequencing chemistry have increased length, accuracy, and throughput of PacBio circular consensus sequencing (CCS) reads, resulting in 15-20kb reads with average read quality above 99%. Materials and Methods: We sequenced a library from human reference sample HG002 to 18-fold coverage on the PacBio Sequel II with two SMRT Cells 8M. The CCS algorithm was used to generate highly accurate (average 99.9%) 12.9kb reads, which were mapped to the hg19 reference with pbmm2. We detected small variants using Google DeepVariant with a model trained for CCS and phased the variants using WhatsHap. Structural variants were detected with pbsv. Variant calls were evaluated against Genome in a Bottle (GIAB) benchmarks. Results: With these reads, DeepVariant achieves SNP and Indel F1 scores of 99.70% and 96.59% against the GIAB truth set, and pbsv achieves 97.72% recall on structural variants longer than 50bp. Using WhatsHap, small variants were phased into haplotype blocks with 145kb N50. The improved mappability of long reads allows us to align to and detect variants in medically relevant genes such as CYP2D6 and PMS2 that have proven “difficult-to-map” with short reads. Conclusions: These highly accurate long reads combine the mappability and ability to detect structural variants of long reads with the accuracy and ability to detect small variants of short reads.
A workflow for the comprehensive detection and prioritization of variants in human genomes with PacBio HiFi reads
PacBio HiFi reads (minimum 99% accuracy, 15-25 kb read length) have emerged as a powerful data type for comprehensive variant detection in human genomes. The HiFi read length extends confident mapping and variant calling to repetitive regions of the genome that are not accessible with short reads. Read length also improves detection of structural variants (SVs), with recall exceeding that of short reads by over 30%. High read quality allows for accurate single nucleotide variant and small indel detection, with precision and recall matching that of short reads. While many tools have been developed to take advantage of these qualities of HiFi reads, there is no end-to-end workflow for the filtering and prioritization of variants uniquely detected with long reads for rare and undiagnosed disease research. We have developed a flexible, modular workflow and web portal for variant analysis from HiFi reads and applied it to a set of rare disease cases unsolved by short-read whole genome sequencing. We expect that broad application of long-read variant detection workflows will solve many more rare disease cases. We have made these tools available at https://github.com/williamrowell/pbRUGD-workflow, and we hope they serve a starting point for developing a robust analysis framework for long read variant detection for rare diseases.
Deciphering the genetic basis of human disease requires a comprehensive knowledge of genetic variants irrespective of their class or frequency. Although an impressive number of human genetic variants have been catalogued, a large fraction of the genetic difference that distinguishes two human genomes is still not understood at the base-pair level. This is because the emphasis has been on single-nucleotide variation as opposed to less tractable and more complex genetic variants, including indels and structural variants. The latter, we propose, will have a large impact on human phenotypes but require a more systematic assessment of genomes at deeper coverage and alternate sequencing and mapping technologies. Copyright © 2016 by the Genetics Society of America.
Whole-genome sequencing identifies emergence of a quinolone resistance mutation in a case of Stenotrophomonas maltophilia bacteremia.
Whole-genome sequences for Stenotrophomonas maltophilia serial isolates from a bacteremic patient before and after development of levofloxacin resistance were assembled de novo and differed by one single-nucleotide variant in smeT, a repressor for multidrug efflux operon smeDEF. Along with sequenced isolates from five contemporaneous cases, they displayed considerable diversity compared against all published complete genomes. Whole-genome sequencing and complete assembly can conclusively identify resistance mechanisms emerging in S. maltophilia strains during clinical therapy. Copyright © 2015, American Society for Microbiology. All Rights Reserved.
Until recently, most population-scale genome sequencing studies have focused on identifying single nucleotide variants (SNVs) to explore genetic differences between individuals. Like so many SNV-based genome-wide association studies, however, these efforts have had difficulty identifying causative genetic mechanisms underlying most complex functions. More and more, the genomics community has realised that structural variation is likely responsible for many of the traits and phenotypes that scientists have not been able to attribute to SNVs. This class of variants, defined as genetic differences of 50 bp or larger, accounts for most of the DNA sequence differences between any two people. Structural variants (SVs) are also already known to cause many common and rare diseases including ALS, schizophrenia, leukemia, Carney complex, and Huntington’s disease. Despite the importance of SVs, these larger variants have been understudied and underreported compared to their single-nucleotide counterparts. One reason is that they remain difficult to detect. Their length often means they cannot be fully spanned using short sequencing reads. They also often occur in highly repetitive or GC-rich regions of the genome, making them challenging targets. As such, this class of human genetic variation has remained vastly under-explored in global populations and is now ripe for discovery.
Long single-molecule reads can resolve the complexity of the influenza virus composed of rare, closely related mutant variants
As a result of a high rate of mutations and recombination events, an RNA-virus exists as a heterogeneous “swarm” of mutant variants. The long read length offered by single-molecule sequencing technologies allows each mutant variant to be sequenced in a single pass. However, high error rate limits the ability to reconstruct heterogeneous viral population composed of rare, related mutant variants. In this paper, we present 2SNV, a method able to tolerate the high error-rate of the single-molecule protocol and reconstruct mutant variants. 2SNV uses linkage between single nucleotide variations to efficiently distinguish them from read errors. To benchmark the sensitivity of 2SNV, we performed a single-molecule sequencing experiment on a sample containing a titrated level of known viral mutant variants. Our method is able to accurately reconstruct clone with frequency of 0.2 % and distinguish clones that differed in only two nucleotides distantly located on the genome. 2SNV outperforms existing methods for full-length viral mutant reconstruction. The open source implementation of 2SNV is freely available for download at http://?alan.?cs.?gsu.?edu/?NGS/???q=?content/?2snv.