Two Worlds of Genome Assemblers
Friday, June 14, 2013
by Jonas Korlach, Chief Scientific Officer
Finished genomes were the focus of last month’s Sequencing, Finishing, Analysis in the Future (SFAF) meeting in Santa Fe, New Mexico. In addition to several presentations, including a talk by Adam Phillippy from the National Biodefense Analysis and Countermeasures Center that demonstrated the ability to generate high-quality, finished microbial genomes using just long-read PacBio data, several papers have appeared recently describing the same principle: the HGAP/Quiver Nature Methods paper, the FDA’s Salmonella Javiana outbreak genome publication, a blog entry by the University of Maryland using HGAP, and a preprint by Adam Phillippy and colleagues describing a similar genome assembly strategy and results.
These presentations and papers highlight the fact that SMRT® Sequencing, in conjunction with the appropriate bioinformatics tools, achieves highly accurate genomes, exceeding 99.999% accuracy, despite a higher single-pass error rate. This is possible because final genome assemblies build sequence through consensus(1); as the errors in SMRT Sequencing are random, very high consensus accuracy can be achieved.
Long reads and consensus are also at the heart of the genome assemblers appropriate for our type of sequencing reads. These overlap-based assemblers such as Celera® Assembler or MIRA — originally developed during the era of Sanger sequencing — are robust to errors. The long reads provide ample information about which reads belong together in pair-wise alignments, thereby connecting them properly for a correct genome assembly. In contrast, short-read technologies have largely relied on de Bruijn graph-based assemblers: short reads are fragmented further from which a K-mer graph is constructed and the assembly is derived.(2) As such, de Bruijn graph assemblers are very sensitive to single-read errors, which is why there has been a focus on single-pass sequence read accuracy in recent years.(3) Overlap and de Bruijn assemblers therefore differ fundamentally in their approach, highlighting the fact that the right bioinformatic tools need to be applied together with different types of sequencing data, and different parameters need to be evaluated for their performance.
We are excited about the application of these new assembly strategies to large numbers of microbial genomes (e.g., in the context of the 100K Foodborne Pathogen Genome Project) to close the large gap that currently exists between draft genomes and finished genomes in GenBank. Finished microbial genomes are the foundation for functional genomics studies, comparative genomics, forensics, microbial outbreak source identification, and phylogenetic analysis, and are thereby crucial for understanding microbes and advancing the field of microbiology.(4) Sequencing microbial genomes de novo, i.e. without the need for a pre-existing reference genome, is important to capture novel elements, such as plasmids or phages. These are sometimes referred to as the accessory genome(5), and can make the crucial difference between a commensal, harmless bacterium and a serious, perhaps drug-resistant pathogen.
1. S. Junemann, F. J. Sedlazeck, K. Prior et al., Nat Biotechnol 31 (4), 294 (2013).
2. J. R. Miller, S. Koren, and G. Sutton, Genomics 95 (6), 315 (2010).
3. N. J. Loman, C. Constantinidou, J. Z. Chan et al., Nat Rev Microbiol 10 (9), 599 (2012).
4. C. M. Fraser, J. A. Eisen, K. E. Nelson et al., J Bacteriol 184 (23), 6403 (2002).
5. D. Croll and B. A. McDonald, PLoS Pathog 8 (4), e1002608 (2012).