In Genome-wide Study, Long Reads Prove Critical for Structural Variant Discovery
Wednesday, April 15, 2015
In a paper just published in BMC Genomics, a team of scientists led by Baylor’s Human Genome Sequencing Center reports a thorough analysis of structural variation in a personal genome. What makes this study special is the large number of different technologies applied and the sheer volume of data gathered and analyzed for this single genome. The paper also includes the first known analysis of structural variation in a diploid human genome using SMRT® Sequencing, with 10x coverage from PacBio® long reads.
Lead authors Adam English and William Salerno and their collaborators at a number of institutions describe the results obtained from a structural variant calling pipeline they have developed called Parliament. (Check out the full paper: “Assessing structural variation in a personal genome—towards a human reference diploid genome.”)
Structural variants account for the majority of variable bases in a human genome, according to the authors, who note that it will be important to detect and characterize these elements to understand their clinical relevance. Despite their significance, these variants are not as well understood as single nucleotide variants and other small variants. Through this effort to establish new ways to find and analyze structural variants, the scientists determined that short-read sequencing technologies alone miss a good amount of this kind of variation in a genome.
Working with a well-characterized genome, the team combined array CGH data with genome sequence data from Illumina, SOLiD, and PacBio systems, as well as a genome map from BioNano Genomics for the most comprehensive data set possible. That information was fed into Parliament, a pipeline for consensus structural variant calling that can be used with multiple data sets and detection approaches. The data sets were analyzed in various permutations within Parliament, which identified more than 31,000 loci representing possible structural variants. Of those, nearly 10,000 — spanning almost 60 Mb and nearly 2% of the reference genome — were supported by deep-dive genome analysis.
Of the 9,777 confirmed structural variants, the authors report that 3,801 were identified solely by PacBio long-read sequence data using PBHoney, part of the Parliament pipeline, “indicating the importance of read length when characterizing structural variation.” English et al. make the case for using multiple data sources to improve structural variant detection. “The addition of long-read data can more than triple the number of [structural variants] detectable in a personal genome,” they write.
The team has made Parliament publicly available through cloud-based service provider DNAnexus. “Implementation of Parliament on local compute requires independent installation of multiple discovery tools and a local assembler, imposing a burden of systems administration and resource consumption,” English et al. write, explaining why they chose DNAnexus to handle the computational side of this project. The cloud provider is now hosting the workflow established by the team — including a pipeline that takes BAM files and generates structural variant calls — as well as data generated in this project.
The authors hope their work helps establish a gold-standard catalog of human structural variation. “The present work identifies upper (4.5%) and lower (1.8%) estimates of the extent of structural variation in a personal genome and characterizes the impact of various resequencing methods,” they write. They also note that “as with [single nucleotide variants], many [structural variants] in a personal genome represent rare or private variants not observed in databases,” highlighting the need to sequence many individuals to obtain a deeper understanding of the extent and diversity of structural variants in the human population and their link to disease.