2015 SMRT Informatics Developers Conference Presentation Slides: Ali Bashir of Mount Sinai School of Medicine discussed methods for characterizing structural variation in human genomes across a variety of coverage levels.
Making the most of long reads: towards efficient assemblers for reference quality, de novo reconstructions
2015 SMRT Informatics Developers Conference Presentation Slides: Gene Myers, Ph.D., Founding Director, Systems Biology Center, Max Planck Institute delivered the keynote presentation. He talked about building efficient assemblers, the importance of random error distribution in sequencing data, and resolving tricky repeats with very long reads. He also encouraged developers to release assembly modules openly, and noted that data should be straightforward to parse since sharing data interfaces is easier than sharing software interfaces.
The Genome in a Bottle Consortium is developing the reference materials, reference methods , and reference data n
ASHG 2015 GRC workshop talk by Tina Graves-Lindsay
Purpose: Clinical laboratories, research laboratories and technology developers all need DNA samples with reliably known genotypes in order to help validate and improve their methods. The Genome in a Bottle Consortium (genomeinabottle.org) has been developing Reference Materials with high-accuracy whole genome sequences to support these efforts.Methodology: Our pilot reference material is based on Coriell sample NA12878 and was released in May 2015 as NIST RM 8398 (tinyurl.com/giabpilot). To minimize bias and improve accuracy, 11 whole-genome and 3 exome data sets produced using 5 different technologies were integrated using a systematic arbitration method . The Genome in a Bottle Analysis Group is adapting these methods and developing new methods to characterize 2 families, one Asian and one Ashkenazi Jewish from the Personal Genome Project, which are consented for public release of sequencing and phenotype data. We have generated a larger and even more diverse data set on these samples, including high-depth Illumina paired-end and mate-pair, Complete Genomics, and Ion Torrent short-read data, as well as Moleculo, 10X, Oxford Nanopore, PacBio, and BioNano Genomics long-read data. We are analyzing these data to provide an accurate assessment of not just small variants but also large structural variants (SVs) in both “easy” regions of the genome and in some “hard” repetitive regions. We have also made all of the input data sources publicly available for download, analysis, and publication.Results: Our arbitration method produced a reference data set of 2,787,291 single nucleotide variants (SNVs), 365,135 indels, 2744 SVs, and 2.2 billion homozygous reference calls for our pilot genome. We found that our call set is highly sensitive and specific in comparison to independent reference data sets. We have also generated preliminary assemblies and structural variant calls for the next 2 trios from long read data and are currently integrating and validating these.Discussion: We combined the strengths of each of our input datasets to develop a comprehensive and accurate benchmark call set. In the short time it has been available, over 20 published or submitted papers have used our data. Many challenges exist in comparing to our benchmark calls, and thus we have worked with the Global Alliance for Genomics and Health to develop standardized methods, performance metrics, and software to assist in its use. Zook et al, Nat Biotech. 2014.
Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing
The human reference sequence has provided a foundation for studies of genome structure, human variation, evolutionary biology, and disease. At the time the reference was originally completed there were some loci recalcitrant to closure; however, the degree to which structural variation and diversity affected our ability to produce a representative genome sequence at these loci was still unknown. Many of these regions in the genome are associated with large, repetitive sequences and exhibit complex allelic diversity such producing a single, haploid representation is not possible. To overcome this challenge, we have sequenced DNA from two hydatidiform moles (CHM1 and CHM13), which are essentially haploid. CHM13 was sequenced with the latest PacBio technology (P6-C5) to 52X genome coverage and assembled using Daligner and Falcon v0.2 (GCA_000983455.1, CHM13_1.1). Compared to the first mole (CHM1) PacBio assembly (GCA_001007805.1, 54X) contig N50 of 4.5Mb, the contig N50 of CHM13_1.1 is almost 13Mb, and there is a 13-fold reduction in the number of contigs. This demonstrates the improved contiguity of sequence generated with the new chemistry. We annotated 50,188 RefSeq transcripts of which only 0.63% were split transcripts, and the repetitive and segmental duplication content was within the expected range. These data all indicate an extremely high quality assembly. Additionally, we sequenced CHM13 DNA using Illumina SBS technology to 60X coverage, aligned these reads to the GRCh37, GRCh38, and CHM13_1.1 assemblies and performed variant calling using the SpeedSeq pipeline. The number of single nucleotide variants (SNV) and indels was comparable between GRCh37 and GRCh38. Regions that showed increased SNV density in GRCh38 compared to GRCh37 could be attributed to the addition of centromeric alpha satellite sequence to the reference assembly. Alternatively, regions of decreased SNV density in GRCh38 were concentrated in regions that were improved from BAC based sequencing of CHM1 such as 1p12 and 1q21 containing the SRGAP2 gene family. The alignment of PacBio reads to GRCh37 and GRCh38 assemblies allowed us to resolve complex loci such as the MHC region where the best alignment was to the DBB (A2-B57-DR7) haplotype. Finally, we will discuss how combining the two high quality mole assemblies can be used for benchmarking and novel bioinformatics tool development.
Reference genome assemblies provide important context in genetics by standardizing the order of genes and providing a universal set of coordinates for individual nucleotides. Often due to the high complexity of genic regions and higher copy number of genes involved in immune function, immunity-related genes are often misassembled in current reference assemblies. This problem is particularly ubiquitous in the reference genomes of non-model organisms as they often do not receive the years of curation necessary to resolve annotation and assembly errors. In this study, we reassemble a reference genome of the goat (Capra hircus) using modern PacBio technology in tandem with BioNano Genomics Irys optical maps and Lachesis clustering in order to provide a high quality reference assembly without the need for extensive filtering. Initial PacBio assemblies using P5C4 chemistry achieved contig N50’s of 4 Megabases and a BUSCO completion score of 84.0%, which is comparable to several finished model organism reference assemblies. We used BioNano Genomics’ Irys platform to generate 336 scaffolds from this data with a scaffold N50 of 24 megabases and total genome coverage of 98%. Lachesis interaction maps were used with a clustering algorithm to associate Irys scaffolds into the expected 30 chromosome physical maps. Comparisons of the initial hybrid scaffolds generated from the long read contigs and optical map information to a previously generated RH map revealed that the entirety of the Goat autosome 20 physical map was contained within one scaffold. Additionally, the BioNano scaffolding resolved several difficult regions that contained genes related to innate immunity which were problem regions in previous reference genome assemblies.
From Sequencing to Chromosomes: New de novo assembly and scaffolding methods improve the goat reference genome
Single-molecule sequencing is now routinely used to assemble complete, high-quality microbial genomes, but these assembly methods have not scaled well to large genomes. To address this problem, we previously introduced the MinHash Alignment Process (MHAP) for overlapping single-molecule reads using probabilistic, locality-sensitive hashing. Integrating MHAP with Celera Assembler (CA) has enabled reference-grade assemblies of model organisms, revealing novel heterochromatic sequences and filling low-complexity gap sequences in the GRCh38 human reference genome. We have applied our methods to assemble the San Clemente goat genome. Combining single-molecule sequencing from Pacific Biosciences and BioNano Genomics generates and assembly that is over 150-fold more contiguous than the latest Capra hircus reference. In combination with Hi-C sequencing, the assembly surpasses reference assemblies, de novo, with minimal manual intervention. The autosomes are each assembled into a single scaffold. Our assembly provides a more complete gene reconstruction, better alignments with Goat 52k chip, and improved allosome reconstruction. In addition to providing increased continuity of sequence, our assembly achieves a higher BUSCO completion score (84%) than the existing goat reference assembly suggesting better quality annotation of gene models. Our results demonstrate that single-molecule sequencing can produce near-complete eukaryotic genomes at modest cost and minimal manual effort.
Reference quality de novo genome assemblies were once solely the domain of large, well-funded genome projects. While next-generation short read technology removed some of the cost barriers, accurate chromosome-scale assembly remains a real challenge. Here we present efforts to de novo assemble the goat (Capra hircus) genome. Through the combination of single-molecule technologies from Pacific Biosciences (sequencing) and BioNano Genomics (optical mapping) coupled with high-throughput chromosome conformation capture sequencing (Hi-C), an inbred San Clemente goat genome has been sequenced and assembled to a high degree of completeness at a relatively modest cost. Starting with 38 million PacBio reads, we integrated the MinHash Alignment Process (MHAP) with the Celera Assembler (CA) to produce an assembly composed of 3110 contigs with a contig N50 size of 4.7 Mb. This assembly was scaffolded with BioNano genome maps derived from a single IrysChip into 333 scaffolds with an N50 of 23.1 Mb including the complete scaffolding of chromosome 20. Finally, cis-chromosome associations were determined by Hi-C, yielding complete reconstruction of all autosomes into single scaffolds with a final N50 of 91.7 Mb. We hope to demonstrate that our methods are not only cost effective, but improve our ability to annotate challenging genomic regions such as highly repetitive immune gene clusters.
Structural variant calling combining Illumina and low-coverage Pacbio Detection of large genomic variation (structural variants) has proven challenging using short-read methods. Long-read approaches which can span these large events have promise to dramatically expand the ability to accurately call structural variants. Although sequencing with Pacific Biosciences (Pacbio) long-read technology has become increasingly high throughput, generating high coverage with the technology can still be limiting and investigators often would like to know what pacbio coverages are adequate to call structural variants. Here, we present a method to identify a substantially higher fraction of structural variants in the human genome using low-coverage pacbio data by multiple strategies for ensembling data types and algorithms. Algorithmically, we combine three structural variant callers: PBHoney by Adam English, Sniffles by Fritz Sedlazeck, and Parliament by Adam English (which we have modified to improve for speed). Parliament itself uses a combination of Pacbio and Illumina data with a number of short-read callers (Breakdancer, Pindel, Crest, CNVnator, Delly, and Lumpy). We show that the outputs of these three programs are largely complementary to each other, with each able to uniquely access different sets of structural variants at different coverages. Combining them together can more than double the recall of true structural variants from a truth set relative to sequencing with Illumina alone, with substantial improvements even at low pacbio coverages (3x – 7x). This allows us to present for the first time cost-benefit tradeoffs to investigators about how much pacbio sequencing will yield what improvements in SV-calling. This work also builds upon the foundational work of Genome in a Bottle led by Justin Zook in establishing a truth set for structural variants in the Ashkenazim-Jewish trio data recently released. This work demonstrates the power of this benchmark set – one of the first of its kind for structural variation data – to help understand and refine the accuracies of calling structural variants with a number of approaches.
Most of the base pairs that differ between two human genomes are in intermediate-sized structural variants (50 bp to 5 kb), which are too small to detect with array comparative genomic hybridization or optical mapping but too large to reliably discover with short-read DNA sequencing. Long-read sequencing with PacBio Single Molecule, Real-Time (SMRT) Sequencing platforms fills this technology gap. PacBio SMRT Sequencing detects tens of thousands of structural variants in a human genome with approximately five times the sensitivity of short-read DNA sequencing. Effective application of PacBio SMRT Sequencing to detect structural variants requires quality bioinformatics tools that account for the characteristics of PacBio reads. To provide such a solution, we developed pbsv, a structural variant caller for PacBio reads that works as a chain of simple stages: 1) map reads to the reference genome, 2) identify reads with signatures of structural variation, 3) cluster nearby reads with similar signatures, 4) summarize each cluster into a consensus variant, and 5) filter for variants with sufficient read support. To evaluate the baseline performance of pbsv, we generated high coverage of a diploid human genome on the PacBio Sequel System, established a target set of structural variants, and then titrated to lower coverage levels. The false discovery rate for pbsv is low at all coverage levels. Sensitivity is high even at modest coverage: above 85% at 10-fold coverage and above 95% at 20-fold coverage. To assess the potential for PacBio SMRT Sequencing to identify pathogenic variants, we evaluated an individual with clinical symptoms suggestive of Carney complex for whom short-read whole genome sequencing was uninformative. The individual was sequenced to 9-fold coverage on the PacBio Sequel System, and structural variants were called with pbsv. Filtering for rare, genic structural variants left six candidates, including a heterozygous 2,184 bp deletion that removes the first coding exon of PRKAR1A. Null mutations in PRKAR1Acause autosomal dominant Carney complex, type 1. The variant was determined to be de novo, and it was classified as likely pathogenic based on ACMG standards and guidelines for variant interpretation. These case studies demonstrate the ability of pbsv to detect structural variants in low-coverage PacBio SMRT Sequencing and suggest the importance of considering structural variants in any study of human genetic variation.
High-quality de novo genome assembly and intra-individual mitochondrial instability in the critically endangered kakapo
The kakapo (Strigops habroptila) is a large, flightless parrot endemic to New Zealand. It is highly endangered with only ~150 individuals remaining, and intensive conservation efforts are underway to save this iconic species from extinction. These include genetic studies to understand critical genes relevant to fertility, adaptation and disease resistance, and genetic diversity across the remaining population for future breeding program decisions. To aid with these efforts, we have generated a high-quality de novo genome assembly using PacBio long-read sequencing. Using the new diploid-aware FALCON-Unzip assembler, the resulting genome of 1.06 Gb has a contig N50 of 5.6 Mb (largest contig 29.3 Mb), >350-times more contiguous compared to a recent short-read assembly of a closely related parrot (kea) species. We highlight the benefits of the higher contiguity and greater completeness of the kakapo genome assembly through examples of fully resolved genes important in wildlife conservation (contrasted with fragmented and incomplete gene resolution in short-read assemblies), in some cases even providing sequence for regions orthologous to gaps of missing sequence in the chicken reference genome. We also highlight the complete resolution of the kakapo mitochondrial genome, fully containing the mitochondrial control region which is missing from the previous dedicated kakapomitochondrial genome NCBI entry. For this region, we observed a marked heterogeneity in the number of tandem repeats in different mtDNAmolecules from a single bird tissue, highlighting the enhanced molecular resolution uniquely afforded by long-read, single-molecule PacBio sequencing.
Structural variants (SVs) – genomic differences =50 base pairs – are few by count compared to single nucleotide variants (SNVs) and indels but include most of the base pairs that differ between two humans.
Dan Geraghty explains that while there have been decades’ worth of studies associating the genetics of the major histocompatibility complex (MHC), and the highly polymorphic HLA class 1 and 2…
Listen to Mendelspod’s podcast with PacBio CEO, Mike Hunkapiller, as he recounts his memories from the first human draft genome project 15 years ago.