MaSuRCA Mega-Reads Assembly Technique for haplotype resolved genome assembly of hybrid PacBio and Illumina Data
The developments in DNA sequencing technology over the past several years have enabled large number of scientists to obtain sequences for the genomes of their interest at a fairly low cost. Illumina Sequencing was the dominant whole genome sequencing technology over the past few years due to its low cost. The Illumina reads are short (up to 300bp) and thus most of those draft genomes produced from Illumina data are very fragmented which limits their usability in practical scenarios. Longer reads are needed for more contiguous genomes. Recently Pacbio sequencing made significant advances in developing cost-effective long-read (>10000bp) sequencing technology and their data, although several times more expensive than Illumina, can be used to produce high quality genomes. Pacbio data can be used for de novo assembly, however due to its high error rate high coverage of the genome is required this raising the cost barrier. A solution for cost-effective genomes is to combine Pacbio and Illumina data leveraging the low error rates of the short Illumina reads and the length of the Pacbio reads. We have developed MaSuRCA mega-reads assembler for efficient assembly of hybrid data sets and we demonstrate that it performs well compared to the other published hybrid techniques. Another important benefit of the long reads is their ability to link the haplotype differences. The mega-reads approach corrects each Pacbio read independently and thus haplotype differences are preserved. Thus, leveraging the accuracy of the Illumina data and the length of the Pacbio reads, MaSuRCA mega-reads can produce haplotype-resolved genome assemblies, where each contig has sequence from a single haplotype. We present preliminary results on haplotype-resolved genome assemblies of faux (proof-of-concept) and real data.