Optimizing Eukaryotic De Novo Genome Assembly: Webinar Recording Available
Wednesday, July 9, 2014
Our webinar on eukaryotic genome assembly attracted a great crowd, and now we’re making the full recording available to the community. The session featured great hands-on information and best practices for working with Single Molecule, Real-Time (SMRT®) Sequencing data. “Optimizing Eukaryotic Genome Assembly with Long-Read Sequencing” featured three excellent speakers — Michael Schatz and James Gurtowski from Cold Spring Harbor Laboratory and Sergey Koren from the National Biodefense Analysis and Countermeasures Center — and was hosted by our own CSO Jonas Korlach.
Schatz kicked off the session with an overview of assemblers for PacBio® data (as well as recommendations for when to use each one) and a look at the challenges of short-read assemblies. He also set expectations around long-read data, noting that for genomes less than 100 Mb, users should expect a nearly perfect assembly from the automated workflow. Genomes up to 1 Gb should be represented in a high-quality assembly with a contig N50 of at least 1 Mb. Genomes larger than that will have shorter contig N50 stats and will require larger computational power, he added.
Next, Gurtowski gave an in-depth look at hybrid assemblies in which shorter reads are used to correct errors in longer reads. He provided step-by-step instructions for the use of ECTools, a new portfolio of publicly available assembly tools developed in the Schatz lab. He noted that the pipeline was developed to be modular, so users could run the whole workflow or just pick out the elements that would be most helpful to them. Finally, Gurtowski alerted attendees that the choice of assembler for the pre-assembly step is dependent on the data, so he recommends using several and evaluating results across them.
Koren presented data on chromosome-scale assembly, reporting the new MinHash Alignment Process (MHAP) he developed to dramatically reduce the need for processing power in genome assemblies. (Adam Phillippy also spoke about this tool at our recent user group meeting.) Koren used the example of a Drosophila assembly to show that traditional assemblers required 629,000 CPU hours while MHAP was able to complete the same assembly with just 1,086 CPU hours, and even resulting in slightly higher quality. He also performed a live demo of the automated MHAP pipeline, showing how to tune parameters such as memory usage as you go.
After the speakers completed their presentations, there was a lively Q&A session that is also captured in the webinar recording. Discussions ranged from the impact of highly polymorphic regions on assembly quality to the highly technical, such as the use of unitigs or contigs for ECTools and how to combine PacBio data generated with different chemistries.
View the webinar recording