Structural variation accounts for much of the variation among human genomes. Structural variants of all types are known to cause Mendelian disease and contribute to complex disease. Learn how long-read sequencing is enabling detection of the full spectrum of structural variants to advance the study of human disease, evolution and genetic diversity.
Explore how long-read sequencing enables solving of rare and mendelian diseases.
Despite amazing progress over the past quarter century in the technology to detect genetic variants, intermediate-sized structural variants (50 bp to 50 kb) have remained difficult to identify. Such variants are too small to detect with array comparative genomic hybridization, but too large to reliably discover with short-read DNA sequencing. Recent de novo assemblies of human genomes have demonstrated the power of PacBio Single Molecule, Real-Time (SMRT) Sequencing to fill this technology gap and sensitively identify structural variants in the human genome. While de novo assembly is the ideal method to identify variants in a genome, it requires high depth of coverage. A structural variant discovery approach that utilizes lower coverage would facilitate evaluation of large patient and population cohorts. Here we introduce such an approach and apply it to 10-fold coverage of several human genomes generated on the PacBio Sequel System. To identify structural variants in low-fold coverage whole genome sequencing data, we apply a reference-based, re-sequencing workflow. First, reads are mapped to the human reference genome with a local aligner. The local alignments often end at structural variant loci. To connect co-linear local alignments across structural variants, we apply a novel algorithm that merges alignments into “chains” and refines the alignment edges. Then, the chained alignments are scanned for windows with an excess of insertions or deletions to identify candidate structural variant loci. Finally, the read support at each putative variant locus is evaluated to produce a variant call. Single nucleotide information is incorporated to phase and evaluate the zygosity of each structural variant. In 10-fold coverage human genome sequence, we identify the vast majority of the structural variants found by de novo assembly, thus demonstrating the power of low-fold coverage SMRT Sequencing to affordably and effectively detect structural variants.
Though a role for structural variants in human disease has long been recognized, it has remained difficult to identify intermediate-sized variants (50 bp to 5 kb), which are too small to detect with array comparative genomic hybridization, but too large to reliably discover with short-read DNA sequencing. Recent studies have demonstrated that PacBio Single Molecule, Real-Time (SMRT) sequencing fills this technology gap. SMRT sequencing detects tens of thousands of structural variants in the human genome, approximately five times the sensitivity of short-read DNA sequencing.
Most of the base pairs that differ between two human genomes are in intermediate-sized structural variants (50 bp to 5 kb), which are too small to detect with array comparative genomic hybridization or optical mapping but too large to reliably discover with short-read DNA sequencing. Long-read sequencing with PacBio Single Molecule, Real-Time (SMRT) Sequencing platforms fills this technology gap. PacBio SMRT Sequencing detects tens of thousands of structural variants in a human genome with approximately five times the sensitivity of short-read DNA sequencing. Effective application of PacBio SMRT Sequencing to detect structural variants requires quality bioinformatics tools that account for the characteristics of PacBio reads. To provide such a solution, we developed pbsv, a structural variant caller for PacBio reads that works as a chain of simple stages: 1) map reads to the reference genome, 2) identify reads with signatures of structural variation, 3) cluster nearby reads with similar signatures, 4) summarize each cluster into a consensus variant, and 5) filter for variants with sufficient read support. To evaluate the baseline performance of pbsv, we generated high coverage of a diploid human genome on the PacBio Sequel System, established a target set of structural variants, and then titrated to lower coverage levels. The false discovery rate for pbsv is low at all coverage levels. Sensitivity is high even at modest coverage: above 85% at 10-fold coverage and above 95% at 20-fold coverage. To assess the potential for PacBio SMRT Sequencing to identify pathogenic variants, we evaluated an individual with clinical symptoms suggestive of Carney complex for whom short-read whole genome sequencing was uninformative. The individual was sequenced to 9-fold coverage on the PacBio Sequel System, and structural variants were called with pbsv. Filtering for rare, genic structural variants left six candidates, including a heterozygous 2,184 bp deletion that removes the first coding exon of PRKAR1A. Null mutations in PRKAR1Acause autosomal dominant Carney complex, type 1. The variant was determined to be de novo, and it was classified as likely pathogenic based on ACMG standards and guidelines for variant interpretation. These case studies demonstrate the ability of pbsv to detect structural variants in low-coverage PacBio SMRT Sequencing and suggest the importance of considering structural variants in any study of human genetic variation.
Introduction: Long-read sequencing has revealed more than 20,000 structural variants spanning over 12 Mb in a healthy human genome. Short-read sequencing fails to detect most structural variants but has remained the more effective approach for small variants, due to 10-15% error rates in long reads, and copy-number variants (CNVs), due to lack of effective long-read variant callers. The development of PacBio highly accurate long reads (HiFi reads) with read lengths of 10-25 kb and quality >99% presents the opportunity to capture all classes of variation with one approach.Methods: We sequence the Genome in a Bottle benchmark sample HG002 and an individual with a presumed Mendelian disease with HiFi reads. We call SNVs and indels with DeepVariant and extend the structural variant caller pbsv to call CNVs using read depth and clipping signatures. Results: For 18-fold coverage with 13 kb HiFi reads, variant calling in HG002 achieves an F1 score of 99.7% for SNVs, 96.6% for indels, and 96.4% for structural variants. Additionally, we detect more than 300 CNVs spanning around 10 Mb. For the Mendelian disease case, HiFi reads reveal thousands of variants that were overlooked by short-read sequencing, including a candidate causative structural variant. Conclusions: These results illustrate the ability of HiFi reads to comprehensively detect variants, including those associated with human disease.
In this podcast, Gibbs shares his perspective on the complementary roles genomics and genetics plays in driving our understanding of human biology. Richard says that the Human genome project was…
Jay Shendure, a Professor in the Department of Genome Sciences at the University of Washington School of Medicine explores the role of exome sequencing in clinical genomics. In this Podcast…
In this presentation, Naomichi Matsumoto from Yokohama City University speaks about the use of SMRT Sequencing to solve Mendelian diseases, including the story of how his lab discovered a 12.4…
Dr. Wenger gives attendees an update on PacBio’s long-read sequencing and variant detection capabilities on the Sequel II System and shares recommendations on how to design your own study using…
During the past decade, the search for pathogenic mutations in rare human genetic diseases has involved huge efforts to sequence coding regions, or the entire genome, using massively parallel short-read sequencers. However, the approximate current diagnostic rate is <50% using these approaches, and there remain many rare genetic diseases with unknown cause. There may be many reasons for this, but one plausible explanation is that the responsible mutations are in regions of the genome that are difficult to sequence using conventional technologies (e.g., tandem-repeat expansion or complex chromosomal structural aberrations). Despite the drawbacks of high cost and a shortage of standard analytical methods, several studies have analyzed pathogenic changes in the genome using long-read sequencers. The results of these studies provide hope that further application of long-read sequencers to identify the causative mutations in unsolved genetic diseases may expand our understanding of the human genome and diseases. Such approaches may also be applied to molecular diagnosis and therapeutic strategies for patients with genetic diseases in the future.
CRISPR/Cas9-mediated scanning for regulatory elements required for HPRT1 expression via thousands of large, programmed genomic deletions.
The extent to which non-coding mutations contribute to Mendelian disease is a major unknown in human genetics. Relatedly, the vast majority of candidate regulatory elements have yet to be functionally validated. Here, we describe a CRISPR-based system that uses pairs of guide RNAs (gRNAs) to program thousands of kilobase-scale deletions that deeply scan across a targeted region in a tiling fashion (“ScanDel”). We applied ScanDel to HPRT1, the housekeeping gene underlying Lesch-Nyhan syndrome, an X-linked recessive disorder. Altogether, we programmed 4,342 overlapping 1 and 2 kb deletions that tiled 206 kb centered on HPRT1 (including 87 kb upstream and 79 kb downstream) with median 27-fold redundancy per base. We functionally assayed programmed deletions in parallel by selecting for loss of HPRT function with 6-thioguanine. As expected, sequencing gRNA pairs before and after selection confirmed that all HPRT1 exons are needed. However, HPRT1 function was robust to deletion of any intergenic or deeply intronic non-coding region, indicating that proximal regulatory sequences are sufficient for HPRT1 expression. Although our screen did identify the disruption of exon-proximal non-coding sequences (e.g., the promoter) as functionally consequential, long-read sequencing revealed that this signal was driven by rare, imprecise deletions that extended into exons. Our results suggest that no singular distal regulatory element is required for HPRT1 expression and that distal mutations are unlikely to contribute substantially to Lesch-Nyhan syndrome burden. Further application of ScanDel could shed light on the role of regulatory mutations in disease at other loci while also facilitating a deeper understanding of endogenous gene regulation. Copyright © 2017 American Society of Human Genetics. All rights reserved.
Whole-genome sequencing is increasingly used to identify Mendelian variants in clinical pipelines. These pipelines focus on single-nucleotide variants (SNVs) and also structural variants, while ignoring more complex repeat sequence variants. Here, we consider the problem of genotyping Variable Number Tandem Repeats (VNTRs), composed of inexact tandem duplications of short (6-100 bp) repeating units. VNTRs span 3% of the human genome, are frequently present in coding regions, and have been implicated in multiple Mendelian disorders. Although existing tools recognize VNTR carrying sequence, genotyping VNTRs (determining repeat unit count and sequence variation) from whole-genome sequencing reads remains challenging. We describe a method, adVNTR, that uses hidden Markov models to model each VNTR, count repeat units, and detect sequence variation. adVNTR models can be developed for short-read (Illumina) and single-molecule (Pacific Biosciences [PacBio]) whole-genome and whole-exome sequencing, and show good results on multiple simulated and real data sets.© 2018 Bakhtiari et al.; Published by Cold Spring Harbor Laboratory Press.
PurposeCurrent clinical genomics assays primarily utilize short-read sequencing (SRS), but SRS has limited ability to evaluate repetitive regions and structural variants. Long-read sequencing (LRS) has complementary strengths, and we aimed to determine whether LRS could offer a means to identify overlooked genetic variation in patients undiagnosed by SRS.MethodsWe performed low-coverage genome LRS to identify structural variants in a patient who presented with multiple neoplasia and cardiac myxomata, in whom the results of targeted clinical testing and genome SRS were negative.ResultsThis LRS approach yielded 6,971 deletions and 6,821 insertions?>?50?bp. Filtering for variants that are absent in an unrelated control and overlap a disease gene coding exon identified three deletions and three insertions. One of these, a heterozygous 2,184?bp deletion, overlaps the first coding exon of PRKAR1A, which is implicated in autosomal dominant Carney complex. RNA sequencing demonstrated decreased PRKAR1A expression. The deletion was classified as pathogenic based on guidelines for interpretation of sequence variants.ConclusionThis first successful application of genome LRS to identify a pathogenic variant in a patient suggests that LRS has significant potential for the identification of disease-causing structural variation. Larger studies will ultimately be required to evaluate the potential clinical utility of LRS.
There is great potential for genome sequencing to enhance patient care through improved diagnostic sensitivity and more precise therapeutic targeting. To maximize this potential, genomics strategies that have been developed for genetic discovery – including DNA-sequencing technologies and analysis algorithms – need to be adapted to fit clinical needs. This will require the optimization of alignment algorithms, attention to quality-coverage metrics, tailored solutions for paralogous or low-complexity areas of the genome, and the adoption of consensus standards for variant calling and interpretation. Global sharing of this more accurate genotypic and phenotypic data will accelerate the determination of causality for novel genes or variants. Thus, a deeper understanding of disease will be realized that will allow its targeting with much greater therapeutic precision.