Toward a Gold Standard for Human Structural Variation
Thursday, February 2, 2017
Scientists from the University of Washington and McDonnell Genome Institute recently reported in Genome Research the results of an in-depth assessment of structural variation in the human genome using SMRT Sequencing technology. They found far more variation than expected and suggest using this approach to establish a comprehensive database of structural variants that would aid future studies.
“Discovery and genotyping of structural variation from long-read haploid genome sequence data” comes from lead author John Huddleston, senior author Evan Eichler, and collaborators. The team fully sequenced two haploid human cell lines (CHM1 and CHM13) with SMRT Sequencing to greater than 60-fold coverage each. Then, using an assembly-based approach called SMRT-SV, they mined the data for structural variants ranging in size from as small as 2 bp to as large as 28 kb. “While our understanding of single-nucleotide variants (SNVs) is beginning to approach nearly complete sensitivity for the euchromatic portion of the genome, structural variants or SVs … have fared far worse because of their stronger association with repetitive DNA,” the authors report. “We sought to build a verifiable gold standard for human genetic variation by first eliminating the complexity of diploidy and then applying an alternate sequencing technology that improves sensitivity over repetitive regions of the human genome.”
The team conducted a thorough investigation of insertions, deletions, and other types of structural variation. Across the board, they discovered significantly more variation than anticipated, detecting five times as many indels (>7 bp) and structural variants (<1 kb) as other methods could find. “The theoretical amount of genetic variation in a single human diploid genome far exceeds expectations established by previous whole-genome studies,” they write. “Although this represents only a fraction of variant sites between two haplotypes, this missing variation accounts for most of the variant base pairs between two human genomes.”
Much of the variation was associated with DNA that was repetitive, GC-rich, or low complexity — all characteristics known to challenge short-read sequencers. “Long-read sequence technology can access these regions because alignments are sufficiently anchored within the flanks,” Huddleston et al. report.
The SMRT-SV protocol resolved more than 460,000 structural variants, nearly 90% of which have been missed even in highly regarded initiatives such as the 1000 Genomes Project. The approach was validated by targeting specific SVs for follow-up analysis as well as by studying 30 other human genomes to confirm the presence of these variants. Results indicate that “the majority of missed variants we discovered are common variants in the human population,” the authors report.
To learn more about the quest for structural variation going back to the Human Genome Project, check out this recent interview with Evan Eichler from Mendelspod.