AGBT 2013 Presentation Slides: Cold Spring Harbor Laboratory’s Michael Schatz presented strategies for de novo assembly of crop genomes with PacBio technolgy.
Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers to understand molecular mechanisms in evolution and gain insight into adaptive strategies. With read lengths exceeding 10 kb, we are able to sequence high-quality, closed microbial genomes with associated plasmids, and investigate large genome complexities, such as long, highly repetitive, low-complexity regions and multiple tandem-duplication events. Improved genome quality, observed at 99.9999% (QV60) consensus accuracy, and significant reduction of gap regions in reference genomes (up to and beyond 50%) allow researchers to better understand coding sequences with high confidence, investigate potential regulatory mechanisms in noncoding regions, and make inferences about evolutionary strategies that are otherwise missed by the coverage biases associated with short- read sequencing technologies. Additional benefits afforded by SMRT Sequencing include the simultaneous capability to detect epigenomic modifications and obtain full-length cDNA transcripts that obsolete the need for assembly. With direct sequencing of DNA in real-time, this has resulted in the identification of numerous base modifications and motifs, which genome-wide profiles have linked to specific methyltransferase activities. Our new offering, the Iso-Seq Application, allows for the accurate differentiation between transcript isoforms that are difficult to resolve with short-read technologies. PacBio reads easily span transcripts such that both 5’/3’ primers for cDNA library generation and the poly-A tail are observed. As such, exon configuration and intron retention events can be analyzed without ambiguity. This technological advance is useful for characterizing transcript diversity and improving gene structure annotations in reference genomes. We review solutions available with SMRT Sequencing, from targeted sequencing efforts to obtaining reference genomes (>100 Mb). This includes strategies for identifying microsatellites and conducting phylogenetic comparisons with targeted gene families. We highlight how to best leverage our long reads that have exceeded 20 kb in length for research investigations, as well as currently available bioinformatics strategies for analysis. Benefits for these applications are further realized with consistent use of size selection of input sample using the BluePippin™ device from Sage Science as demonstrated in our genome improvement projects. Using the latest P5-C3 chemistry on model organisms, these efforts have yielded an observed contig N50 of ~6 Mb, with the longest contig exceeding 12.5 Mb and an average base quality of QV50.
Single Molecule, Real-Time (SMRT) Sequencing provides efficient, streamlined solutions to address new frontiers in plant genomes and transcriptomes. Inherent challenges presented by highly repetitive, low-complexity regions and duplication events are directly addressed with multi- kilobase read lengths exceeding 8.5 kb on average, with many exceeding 20 kb. Differentiating between transcript isoforms that are difficult to resolve with short-read technologies is also now possible. We present solutions available for both reference genome and transcriptome research that best leverage long reads in several plant projects including algae, Arabidopsis, rice, and spinach using only the PacBio platform. Benefits for these applications are further realized with consistent use of size-selection of input sample using the BluePippin™ device from Sage Science. We will share highlights from our genome projects using the latest P5- C3 chemistry to generate high-quality reference genomes with the highest contiguity, contig N50 exceeding 1 Mb, and average base quality of QV50. Additionally, the value of long, intact reads to provide a no-assembly approach to investigate transcript isoforms using our Iso-Seq protocol will be presented for full transcriptome characterization and targeted surveys of genes with complex structures. PacBio provides the most comprehensive assembly with annotation when combining offerings for both genome and transcriptome research efforts. For more focused investigation, PacBio also offers researchers opportunities to easily investigate and survey genes with complex structures.
To comprehensively detect large variants in human genomes, we have extended pbsv – a structural variant caller for long reads – to call copy-number variants (CNVs) from read-clipping and read-depth signatures. In human germline benchmark samples, we detect more than 300 CNVs spanning around 10 Mb, and we call hundreds of additional events in re-arranged cancer samples. Long-read sequencing of diverse humans has revealed more than 20,000 insertion, deletion, and inversion structural variants spanning more than 12 Mb in a typical human genome. Most of these variants are too large to detect with short reads and too small for array comparative genome hybridization (aCGH). While the standard approaches to calling structural variants with long reads thrive in the 50 bp to 10 kb size range, they tend to miss exactly the large (>50 kb) copy-number variants that are called more readily with aCGH and short reads. Standard algorithms rely on reference-based mapping of reads that fully span a variant or on de novo assembly; and copy-number variants are often too large to be spanned by a single read and frequently involve segmentally duplicated sequence that is not yet included in most de novo assemblies.
Long-read sequencing of diverse humans has revealed more than 20,000 insertion, deletion, and inversion structural variants spanning more than 12 Mb in a healthy human genome. Most of these variants are too large to detect with short reads and too small for array comparative genome hybridization (aCGH). While the standard approaches to calling structural variants with long reads thrive in the 50 bp to 10 kb size range, they tend to miss exactly the large (>50 kb) copy-number variants that are called more readily with aCGH. Standard algorithms rely on reference-based mapping of reads that fully span a variant or on de novo assembly; and copy-number variants are often too large to be spanned by a single read and frequently involve segmentally duplicated sequence that is not yet included in most de novo assemblies. To comprehensively detect large variants in human genomes, we extended pbsv – a structural variant caller for long reads – to call copy-number variants (CNVs) from read-clipping and read-depth signatures. In human germline benchmark samples, we detect more than 300 CNVs spanning around 10 Mb, and we call hundreds of additional events in re-arranged cancer samples. Together with insertion, deletion, inversion, duplication, and translocation calling from spanning reads, this allows pbsv to comprehensively detect large variants from a single data type.
Introduction: Long-read sequencing has revealed more than 20,000 structural variants spanning over 12 Mb in a healthy human genome. Short-read sequencing fails to detect most structural variants but has remained the more effective approach for small variants, due to 10-15% error rates in long reads, and copy-number variants (CNVs), due to lack of effective long-read variant callers. The development of PacBio highly accurate long reads (HiFi reads) with read lengths of 10-25 kb and quality >99% presents the opportunity to capture all classes of variation with one approach.Methods: We sequence the Genome in a Bottle benchmark sample HG002 and an individual with a presumed Mendelian disease with HiFi reads. We call SNVs and indels with DeepVariant and extend the structural variant caller pbsv to call CNVs using read depth and clipping signatures. Results: For 18-fold coverage with 13 kb HiFi reads, variant calling in HG002 achieves an F1 score of 99.7% for SNVs, 96.6% for indels, and 96.4% for structural variants. Additionally, we detect more than 300 CNVs spanning around 10 Mb. For the Mendelian disease case, HiFi reads reveal thousands of variants that were overlooked by short-read sequencing, including a candidate causative structural variant. Conclusions: These results illustrate the ability of HiFi reads to comprehensively detect variants, including those associated with human disease.
ASHG PacBio Workshop: Characterization of a large, human-specific tandem repeat array associated with bipolar disorder and schizophrenia
In this ASHG workshop presentation, Janet Song of Stanford School of Medicine shared research on resolving a tandem repeat array implicated in bipolar disorder and schizophrenia. These psychiatric diseases share…
In this ASHG workshop presentation, Stuart Scott of the Icahn School of Medicine at Mount Sinai, presented on using the PacBio system for amplicon sequencing in pharmacogenomics and clinical genomics…
Studying microbial genomics and infectious disease? Learn how the PacBio Sequel II System can help advance your research, with first-hand perspectives from scientists who are investigating SARS-CoV-2 and COVID-19. In…
Adaptive archaic introgression of copy number variants and the discovery of previously unknown human genes
As they migrated out of Africa and into Europe and Asia, anatomically modern humans interbred with archaic hominins, such as Neanderthals and Denisovans. The result of this genetic introgression on the recipient populations has been of considerable interest, especially in cases of selection for specific archaic genetic variants. Hsieh et al. characterized adaptive structural variants and copy number variants that are likely targets of positive selection in Melanesians. Focusing on population-specific regions of the genome that carry duplicated genes and show an excess of amino acid replacements provides evidence for one of the mechanisms by which genetic novelty can arise and result in differentiation between human genomes.Science, this issue p. eaax2083INTRODUCTIONCharacterizing genetic variants underlying local adaptations in human populations is one of the central goals of evolutionary research. Most studies have focused on adaptive single-nucleotide variants that either arose as new beneficial mutations or were introduced after interbreeding with our now-extinct relatives, including Neanderthals and Denisovans. The adaptive role of copy number variants (CNVs), another well-known form of genomic variation generated through deletions or duplications that affect more base pairs in the genome, is less well understood, despite evidence that such mutations are subject to stronger selective pressures.RATIONALEThis study focuses on the discovery of introgressed and adaptive CNVs that have become enriched in specific human populations. We combine whole-genome CNV calling and population genetic inference methods to discover CNVs and then assess signals of selection after controlling for demographic history. We examine 266 publicly available modern human genomes from the Simons Genome Diversity Project and genomes of three ancient homininstextemdasha Denisovan, a Neanderthal from the Altai Mountains in Siberia, and a Neanderthal from Croatia. We apply long-read sequencing methods to sequence-resolve complex CNVs of interest specifically in the Melanesianstextemdashan Oceanian population distributed from Papua New Guinea to as far east as the islands of Fiji and known to harbor some of the greatest amounts of Neanderthal and Denisovan ancestry.RESULTSConsistent with the hypothesis of archaic introgression outside Africa, we find a significant excess of CNV sharing between modern non-African populations and archaic hominins (P = 0.039). Among Melanesians, we observe an enrichment of CNVs with potential signals of positive selection (n = 37 CNVs), of which 19 CNVs likely introgressed from archaic hominins. We show that Melanesian-stratified CNVs are significantly associated with signals of positive selection (P = 0.0323). Many map near or within genes associated with metabolism (e.g., ACOT1 and ACOT2), development and cell cycle or signaling (e.g., TNFRSF10D and CDK11A and CDK11B), or immune response (e.g., IFNLR1). We characterize two of the largest and most complex CNVs on chromosomes 16p11.2 and 8p21.3 that introgressed from Denisovans and Neanderthals, respectively, and are absent from most other human populations. At chromosome 16p11.2, we sequence-resolve a large duplication of >383 thousand base pairs (kbp) that originated from Denisovans and introgressed into the ancestral Melanesian population 60,000 to 170,000 years ago. This large duplication occurs at high frequency (>79%) in diverse Melanesian groups, shows signatures of positive selection, and maps adjacent to Homo sapienstextendashspecific duplications that predispose to rearrangements associated with autism. On chromosome 8p21.3, we identify a Melanesian haplotype that carries two CNVs, a ~6-kbp deletion, and a ~38-kbp duplication, with a Neanderthal origin and that introgressed into non-Africans 40,000 to 120,000 years ago. This CNV haplotype occurs at high frequency (44%) and shows signals consistent with a partial selective sweep in Melanesians. Using long-read sequencing genomic and transcriptomic data, we reconstruct the structure and complex evolutionary history for these two CNVs and discover previously undescribed duplicated genes (TNFRSF10D1, TNFRSF10D2, and NPIPB16) that show an excess of amino acid replacements consistent with the action of positive selection.CONCLUSIONOur results suggest that large CNVs originating in archaic hominins and introgressed into modern humans have played an important role in local population adaptation and represent an insufficiently studied source of large-scale genetic variation that is absent from current reference genomes.Large adaptive-introgressed CNVs at chromosomes 8p21.3 and 16p11.2 in Melanesians.The magnifying glasses highlight structural differences between the archaic (top) and reference (bottom) genomes. Neanderthal (red) and Denisovan (blue) haplotypes encompassing large CNVs occur at high frequencies in Melanesians (44 and 79%, respectively) but are absent (black) in all non-Melanesians. These CNVs create positively selected genes (TNFRSF10D1, TNFRSF10D2, and NPIPB16) that are absent from the reference genome.Copy number variants (CNVs) are subject to stronger selective pressure than single-nucleotide variants, but their roles in archaic introgression and adaptation have not been systematically investigated. We show that stratified CNVs are significantly associated with signatures of positive selection in Melanesians and provide evidence for adaptive introgression of large CNVs at chromosomes 16p11.2 and 8p21.3 from Denisovans and Neanderthals, respectively. Using long-read sequence data, we reconstruct the structure and complex evolutionary history of these polymorphisms and show that both encode positively selected genes absent from most human populations. Our results collectively suggest that large CNVs originating in archaic hominins and introgressed into modern humans have played an important role in local population adaptation and represent an insufficiently studied source of large-scale genetic variation.
We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA ) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33-79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.
In order to provide a comprehensive resource for human structural variants (SVs), we generated long-read sequence data and analyzed SVs for fifteen human genomes. We sequence resolved 99,604 insertions, deletions, and inversions including 2,238 (1.6 Mbp) that are shared among all discovery genomes with an additional 13,053 (6.9 Mbp) present in the majority, indicating minor alleles or errors in the reference. Genotyping in 440 additional genomes confirms the most common SVs in unique euchromatin are now sequence resolved. We report a ninefold SV bias toward the last 5 Mbp of human chromosomes with nearly 55% of all VNTRs (variable number of tandem repeats) mapping to this portion of the genome. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity. Copyright © 2018 Elsevier Inc. All rights reserved.
Several algorithms have been developed that use high-throughput sequencing technology to characterize structural variations (SVs). Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions, likewise, duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants.We developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing datasets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real datasets. In the simulation experiments, using a 30× coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state-of-the-art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (<5% for the top 50 predictions).TARDIS source code is available at https://github.com/BilkentCompGen/tardis, and a corresponding Docker image is available at https://hub.docker.com/r/alkanlab/tardis/.Supplementary data are available at Bioinformatics online. © The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: firstname.lastname@example.org.
Many of our major crop species are polyploids, containing more than one genome or set of chromosomes. Polyploid crops present unique challenges, including difficulties in genome assembly, in discriminating between multiple gene and sequence copies, and in genetic mapping, hindering use of genomic data for genetics and breeding. Polyploid genomes may also be more prone to containing structural variation, such as loss of gene copies or sequences (presence–absence variation) and the presence of genes or sequences in multiple copies (copy-number variation). Although the two main types of genomic structural variation commonly identified are presence–absence variation and copy-number variation, we propose that homeologous exchanges constitute a third major form of genomic structural variation in polyploids. Homeologous exchanges involve the replacement of one genomic segment by a similar copy from another genome or ancestrally duplicated region, and are known to be extremely common in polyploids. Detecting all kinds of genomic structural variation is challenging, but recent advances such as optical mapping and long-read sequencing offer potential strategies to help identify structural variants even in complex polyploid genomes. All three major types of genomic structural variation (presence–absence, copy-number, and homeologous exchange) are now known to influence phenotypes in crop plants, with examples of flowering time, frost tolerance, and adaptive and agronomic traits. In this review, we summarize the challenges of genome analysis in polyploid crops, describe the various types of genomic structural variation and the genomics technologies and data that can be used to detect them, and collate information produced to date related to the impact of genomic structural variation on crop phenotypes. We highlight the importance of genomic structural variation for the future genetic improvement of polyploid crops.
Copy-number variants in clinical genome sequencing: deployment and interpretation for rare and undiagnosed disease.
Current diagnostic testing for genetic disorders involves serial use of specialized assays spanning multiple technologies. In principle, genome sequencing (GS) can detect all genomic pathogenic variant types on a single platform. Here we evaluate copy-number variant (CNV) calling as part of a clinically accredited GS test.We performed analytical validation of CNV calling on 17 reference samples, compared the sensitivity of GS-based variants with those from a clinical microarray, and set a bound on precision using orthogonal technologies. We developed a protocol for family-based analysis of GS-based CNV calls, and deployed this across a clinical cohort of 79 rare and undiagnosed cases.We found that CNV calls from GS are at least as sensitive as those from microarrays, while only creating a modest increase in the number of variants interpreted (~10 CNVs per case). We identified clinically significant CNVs in 15% of the first 79 cases analyzed, all of which were confirmed by an orthogonal approach. The pipeline also enabled discovery of a uniparental disomy (UPD) and a 50% mosaic trisomy 14. Directed analysis of select CNVs enabled breakpoint level resolution of genomic rearrangements and phasing of de novo CNVs.Robust identification of CNVs by GS is possible within a clinical testing environment.