ASHG 2015 GRC workshop talk by Tina Graves-Lindsay
Most of the base pairs that differ between two human genomes are in intermediate-sized structural variants (50 bp to 5 kb), which are too small to detect with array comparative genomic hybridization or optical mapping but too large to reliably discover with short-read DNA sequencing. Long-read sequencing with PacBio Single Molecule, Real-Time (SMRT) Sequencing platforms fills this technology gap. PacBio SMRT Sequencing detects tens of thousands of structural variants in a human genome with approximately five times the sensitivity of short-read DNA sequencing. Effective application of PacBio SMRT Sequencing to detect structural variants requires quality bioinformatics tools that account for the characteristics of PacBio reads. To provide such a solution, we developed pbsv, a structural variant caller for PacBio reads that works as a chain of simple stages: 1) map reads to the reference genome, 2) identify reads with signatures of structural variation, 3) cluster nearby reads with similar signatures, 4) summarize each cluster into a consensus variant, and 5) filter for variants with sufficient read support. To evaluate the baseline performance of pbsv, we generated high coverage of a diploid human genome on the PacBio Sequel System, established a target set of structural variants, and then titrated to lower coverage levels. The false discovery rate for pbsv is low at all coverage levels. Sensitivity is high even at modest coverage: above 85% at 10-fold coverage and above 95% at 20-fold coverage. To assess the potential for PacBio SMRT Sequencing to identify pathogenic variants, we evaluated an individual with clinical symptoms suggestive of Carney complex for whom short-read whole genome sequencing was uninformative. The individual was sequenced to 9-fold coverage on the PacBio Sequel System, and structural variants were called with pbsv. Filtering for rare, genic structural variants left six candidates, including a heterozygous 2,184 bp deletion that removes the first coding exon of PRKAR1A. Null mutations in PRKAR1Acause autosomal dominant Carney complex, type 1. The variant was determined to be de novo, and it was classified as likely pathogenic based on ACMG standards and guidelines for variant interpretation. These case studies demonstrate the ability of pbsv to detect structural variants in low-coverage PacBio SMRT Sequencing and suggest the importance of considering structural variants in any study of human genetic variation.
Structural variants (SVs) – genomic differences =50 base pairs – are few by count compared to single nucleotide variants (SNVs) and indels but include most of the base pairs that differ between two humans.
Fast and effective variant calling algorithms have been crucial to the successful application of DNA sequencing in human genetics. In particular, joint calling – in which reads from multiple individuals are pooled to increase power for shared variants – is an important tool for population surveys of variation. Joint calling was applied by the 1000 Genomes Project to identify variants across many individuals each sequenced to low coverage (about 5-fold). This approach successfully found common small variants, but broadly missed structural variants and large indels for which short-read sequencing has limited sensitivity. To support use of large variants in rare disease and common trait association studies, it is necessary to perform population-scale surveys with a technology effective at detecting indels and structural variants, such as PacBio SMRT Sequencing. For these studies, it is important to have a joint calling workflow that works with PacBio reads. We have developed pbsv, an indel and structural variant caller for PacBio reads, that provides a two-step joint calling workflow similar to that used to build the ExAC database. The first stage, discovery, is performed separately for each sample and consolidates whole genome alignments into a sparse representation of potentially variant loci. The second stage, calling, is performed on all samples together and considers only the signatures identified in the discovery stage. We applied the pbsv joint calling workflow to PacBio reads from twenty human genomes, with coverage ranging from 5-fold to 80-fold per sample for a total of 460-fold. The analysis required only 102 CPU hours, and identified over 800,000 indels and structural variants, including hundreds of inversions and translocations, many times more than discovered with short-read sequencing. The workflow is scalable to thousands of samples. The ongoing application of this workflow to thousands of samples will provide insight into the evolution and functional importance of large variants in human evolution and disease.
FALCON-Phase integrates PacBio and HiC data for de novo assembly, scaffolding and phasing of a diploid Puerto Rican genome (HG00733)
Haplotype-resolved genomes are important for understanding how combinations of variants impact phenotypes. The study of disease, quantitative traits, forensics, and organ donor matching are aided by phased genomes. Phase is commonly resolved using familial data, population-based imputation, or by isolating and sequencing single haplotypes using fosmids, BACs, or haploid tissues. Because these methods can be prohibitively expensive, or samples may not be available, alternative approaches are required. de novo genome assembly with PacBio Single Molecule, Real-Time (SMRT) data produces highly contiguous, accurate assemblies. For non-inbred samples, including humans, the separate resolution of haplotypes results in higher base accuracy and more contiguous assembled sequences. Two primary methods exist for phased diploid genome assembly. The first, TrioCanu requires Illumina data from parents and PacBio data from the offspring. The long reads from the child are partitioned into maternal and paternal bins using parent-specific sequences; the separate PacBio read bins are then assembled, generating two fully phased genomes. An alternative approach (FALCON-Unzip) does not require parental information and separates PacBio reads, during genome assembly, using heterozygous SNPs. The length of haplotype phase blocks in FALCON-Unzip is limited by the magnitude and distribution of heterozygosity, the length of sequence reads, and read coverage. Because of this, FALCON-Unzip contigs typically contain haplotype-switch errors between phase blocks, resulting in primary contig of mixed parental origin. We developed FALCON-Phase, which integrates Hi-C data downstream of FALCON-Unzip to resolve phase switches along contigs. We applied the method to a human (Puerto Rican, HG00733) and non-human genome assemblies and evaluated accuracy using samples with trio data. In a cattle genome, we observe >96% accuracy in phasing when compared to TrioCanu assemblies as well as parental SNPs. For a high-quality PacBio assembly (>90-fold Sequel coverage) of a Puerto Rican individual we scaffolded the FALCON-Phase contigs, and re-phased the contigs creating a de novo scaffolded, phased diploid assembly with chromosome-scale contiguity.