The Gapless Assembly: Scientists Describe Workflow for Producing Complete Eukaryote Genome
Thursday, August 20, 2015
|Sunflowers with verticillium wilt caused by V. dahliae|
In a new mBio publication, scientists from Wageningen University and KeyGene in The Netherlands report results from several strategies used to assemble the genome of a filamentous fungus, and describe the specific pipeline they recommend for sequencing and assembling eukaryotic genomes.
“Single-Molecule Real-Time Sequencing Combined with Optical Mapping Yields Completely Finished Fungal Genome” comes from lead authors Luigi Faino and Michael Seidl, senior author Bart Thomma, and collaborators. Using Verticillium dahliae as a model, which is a plant pathogen responsible for the damaging verticillium wilt disease in many crop species, they compared short-read and long-read sequencing approaches and incorporated optical mapping data to develop the method that generated the highest-quality assembly for the 36 Mb genome. This particular fungus was an ideal fit for the project, the authors note, due to its extensive genomic rearrangements and enrichment for repetitive elements.
Starting with an exploration of hybrid strategies for assembly, they used a previously generated short-read assembly and employed optical mapping data, resulting in more than 4,500 contigs. This was followed by filling gaps in the assembly with SMRT® Sequencing data, which brought the total contig count down to about 500.
The researchers also tested single-step and two-step hybrid assemblies using both long and short reads, adding optical mapping data and using assemblers such as SPAdes. All of these approaches left a number of gaps in the assembly.
Next, they moved on to assemblies produced solely from PacBio® data, testing various levels of genome coverage and both the MHAP and HGAP assemblers. “All assemblies based on six or more SMRT Cells generated comparable assembly outputs, with a total assembly size of ~36.5 Mb composed of up to 49 contigs, an N50 that exceeded 2.9 Mb, and a largest contig exceeding 5.5 Mb in all cases,” the team reports. “All assemblies based on PacBio sequencing outperformed the hybrid assemblies as long as the sequencing depth exceeded 72x.” The authors noted that HGAP delivered a more accurate genome assembly due to the extra genome polishing step in the assembly protocol, whereas MHAP delivered a more contiguous genome assembly in instances of lower genome coverage.
Lastly, optical map data was used to improve upon the PacBio-only assembly. “We were able to show that a combination of PacBio-generated long reads and optical mapping yields a gapless telomere-to-telomere genome assembly,” the scientists write, “allowing in-depth genome analyses to facilitate functional studies into an organism’s biology.”
The team next sequenced another V. dahliae strain using the PacBio-and-optical-map strategy and yielded another gapless assembly complete with eight telomere-to-telomere chromosomes, which they used to correct the orientation of several scaffolds in a previously generated Sanger assembly of that strain.
Armed with these assemblies, the scientists delved into a study of transposable and repetitive elements in the fungus, finding that long-terminal-repeat retrotransposons were the most common transposable element in both genomes. “Strikingly, in total, the repetitive elements in the V. dahliae genomes amount to 12%, which is 3 times higher than all previous estimates for these genomes,” they report. Identifying and understanding these elements is especially important for this plant pathogen, the authors add, because transposable elements and repeat-driven expansion have been critical factors in its virulence.
The team concludes that these findings show the utility of SMRT Sequencing and optical mapping for producing cost-effective, complete genome assemblies for complex eukaryotic organisms.