Beyond the $1,000 Genome: An Interview with Mark Gerstein
Monday, October 5, 2015
Next-generation sequencing has many people excited about the prospect of the $1,000 genome, however recent discoveries show that short-read sequencing technologies miss important genomic elements, driving scientists to look for an alternative approach. Mark Gerstein, co-director of the computational biology and bioinformatics program at Yale, argues that the true $1,000 human genome has yet to arrive. In a recent conversation with Mendelspod host Theral Timpson, he discussed some of the important, deep technical questions that must first be addressed. This is the second in a series of podcast interviews focused on long-read sequencing, and we have included some highlights from the conversation below.
Pseudogenes and the non-coding portion of the human genome
For Gerstein, “the majority of $1,000 genome sequencing focuses mostly on SNPs — on the 3 to 4 million single nucleotide variants that distinguish every person.” While valuable, these SNPs do not represent the full extent of variation. “The non-coding regions are our most glaring omission — there’s a lot more to know,” Gerstein said. “There’s so much of our genome and we’re focusing myopically on genes.”
On the topic of pseudogenes, once thought of as junk DNA, Gerstein said, “these pseudogenes also carry a much better record of our history. Since many are functional, large chunks can be transcribed and maybe even translated. In essence, transcribed pseudogenes function as non-coding RNA, carrying out a regulatory function — not as a protein-coding gene, but as a non-coding RNA.” We also now know that the human genome has about as many pseudogenes as genes, and sometimes even more since they are under much less constraint, Gerstein noted.
What is making it possible to look at these more difficult regions is the increasing sophistication of sequencing technologies. “Integrating Pacific Biosciences technology has a lot of promise in genomic sequencing, by allowing us to fill in regions and provide high-quality sequencing,” Gerstein said.
Big data and the challenge of privacy
The mass sequencing of the human genome is producing vast quantities of sequence data. Gerstein told Timpson that there are “three challenges in thinking about how to organize large amounts of genomic information: agreeing on our fundamental, biological understanding of the right structure of the genome; genome interpretation; and privacy, which dramatically complicates queries. This is a dominating issue and will likely circumscribe our progress.”
Gerstein broke down the privacy challenge into two distinct topics: “There’s the legal bit that has to do with ethics, legal, and regulatory structures,” he said, “and there’s the technical bit. Is it meaningful to encrypt millions of genomes? Can you query across them?” He made the point that genomics has long been the poster child for open data, but that privacy issues are now introducing new hurdles. “It runs into a wall when it comes to an individual patient’s genome or records, which is more about individual protections, which is thornier,” he said. “I hope we’ll get to a point where people think they should own their own data — not a company, a doctor, or a hospital.”
The Mendelspod interview series on this topic continues, and we will keep you posted with highlights as new podcasts come out.