In the first podcast of a new series on the applications of long-read sequencing, Mendelspod host Theral Timpson interviewed Marc Salit, leader of the Genome Scale Measurements Group at the National Institute of Standards and Technology. Their conversation focused on how and why NIST is involved in establishing baseline measurements for the human genome.
Salit, along with Justin Zook and their team at NIST, are managing the Genome in a Bottle (GIAB) Consortium to develop reference materials, data, and methods needed to assess whole human genome sequencing. Their goal is to establish a physical reference genome as a standard against which subsequent measurements can be compared, providing a foundation for the translation of whole genome sequencing into clinical practice.
As part of the GIAB Consortium’s efforts, we’re working with NIST (alongside many others from the PacBio community) to contribute both long-read sequencing data and analysis methods to help Salit and his team achieve this vision. Already NIST has SMRT® Sequencing data for the NA12878 genome as well as an Ashkenazi Jewish trio from the Personal Genome Project. They recently released this data in a bioRxiv pre-print entitled “Extensive sequencing of seven human genomes to characterize benchmark reference materials.”
Here are a few highlights from the conversation:
NIST’s progress to date
In Salit’s estimation, characterization of the first GIAB reference material (NIST RM 8398) is about two-thirds complete for small variants. “We can confidently call reference alleles, SNPs, and indels in about 77% of the genome,” he said. “We’re only starting to have confidence and methodology to understand how confident we should be about structural variations.” The remaining parts of the genome are more difficult to characterize, due to long homopolymer regions, long duplicated regions, or highly repetitive regions that are challenging to access with short-read data. Another issue is larger, more complex genomic regions where there is enormous genetic variation across populations, which makes it difficult to determine if the assembly process gets it right with 100% certainty.
Data confidence
As part of the characterization process, the GIAB team is integrating data in a systematic arbitration fashion, from a variety of technologies and platforms. “We try to find evidence that backs up the call from each platform, then look for unambiguous calls where you’ve got strong supporting evidence that there are no technical artifacts or systematic sequencing errors,” Salit said. He is attempting to take the best evidence to make confident calls across the genome for complex and structural variants in particular. Long-read, highly accurate sequencing data from PacBio is having a real impact in this effort. “It’s made a major difference with what part of the genome we can see,” Salit told Timpson. “Instead of looking at the world through a soda straw, we’re now looking at the world at least through a paper towel tube.”
Defining success
Salit and his team at NIST are working toward a future in which whole genome sequencing will be used as a clinical test to answer the tough questions around therapeutic decision-making. Their focus is on ensuring that those answers are demonstrably safe and efficacious for clinical applications, clearing the path for regulated applications of sequencing technology. For Salit, NIST holds unique power in its ability to convene the scientific community to address questions around basic measurement science: What are we measuring? Are we all measuring the same thing? How do we demonstrate that? Ultimately, GIAB is building an infrastructure “on top of which the century of biology will come into being,” Salit said. “We build the sewers of science, and when they’re working, nobody notices. When they’re not working, everybody notices.”
New podcasts in this Mendelspod series will be coming soon, and we’ll keep you posted with highlights as they become available.