Genome Research Paper: Resolve Complex Genomic Regions for a ‘Fraction of the Cost’ With SMRT Sequencing
Tuesday, January 21, 2014
A new Genome Research paper describes the application of Single Molecule, Real-Time (SMRT®) Sequencing to resolve repeat-heavy genomic regions in important reference genomes such as human and chimpanzee. In the process, the authors drew some important conclusions about cost, pooling, and coverage requirements for this type of work.
“Reconstructing complex regions of genomes using long-read sequencing technology” comes from lead author John Huddleston and senior author Evan Eichler at the University of Washington, along with collaborators at Washington University, the University of Bari, Bilkent University, and Pacific Biosciences.
In the paper, Eichler and his collaborators note the steep cost of finishing a BAC clone to high quality using Sanger sequencing. That problem has proliferated as short-read sequencing leaves more genomes in draft form, the authors write. “Although we can generate much more sequence, the short sequence read data and inability to scaffold across repetitive structures translates into more gaps, missing data, and more incomplete references assemblies,” according to the paper.
To find a more cost-effective alternative, they tested the PacBio® sequencing platform in complex genomic regions. In the first project, the team sequenced eight BAC clones representing a 1.3 Mbp region of chromosome 17q21.31 from a hydatidiform mole sample and assembled results using HGAP and Quiver. The region is known for having high-identity segmental duplications as well as large structural polymorphisms. They report an average of 245x coverage per clone; each clone assembled into a single contig, and six of the eight clones only required a single SMRT Cell. After comparing differences between the PacBio assembly and an existing high-quality Sanger assembly, the authors say, the new sequence showed 99.994% identity to Sanger. To validate the mismatches, “we targeted 44 differences using Illumina® sequencing and find that PacBio and Sanger assemblies share a comparable number of validated variants, albeit with different sequence context biases,” they add.
In the second project, Huddleston et al. performed similar work on a nearly 800 Kb region of the chimpanzee genome that has a significant number of duplications. The study, involving five BAC clones, again demonstrated the accuracy of the sequencing platform and assembly protocol. A validation procedure using BAC-end and fosmid-end sequences confirmed “the order, orientation, and sequence accuracy of the clone-based assembly of this complex region of the chimpanzee genome,” the scientists write.
In the paper, the team drew several conclusions from their efforts. One was that results in the first project could have been generated with almost exactly the same degree of accuracy using 100x coverage instead of more than 200x. This was confirmed using random downsampling of the data set. Separately, they demonstrated successful pooling of BACs, showing that pools of two or three samples could be properly separated post-sequencing to achieve high-quality assemblies of each clone.
One of the goals of this study was to evaluate whether PacBio sequencing could bring back the quality of Sanger-finished genomes without the prohibitive cost. They conclude that SMRT Sequencing can indeed accomplish this task “for a fraction of the cost and time of traditional finishing approaches.” The authors report that sequencing a single BAC clone with Sanger costs $4,000 to $5,000, while the same task using PacBio costs approximately $625. That cost would decrease further using pooled BACs, they note.
Finally, the authors suggest taking advantage of existing BAC-end sequence data to select clones that span gaps in important draft genomes and using SMRT Sequencing to increase the quality of those assemblies. “The approach we have described provides a strategy to resolve these more structurally complex regions during the final stages of assembly, ensuring that the 1000-2000 genes mapping therein become incorporated within future mammalian genome assemblies,” they write.