Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, then assembles the mega-reads using cabog assembler, was designed for reads. we apply this technique a large data set illumina pacbio sequences from species aegilops tauschii, extremely repetitive plant genome that has resisted previous attempts at assembly. show resulting assembled contigs far larger than in any assembly, with an n50 contig size 486,807 nucleotides. compare independently produced optical maps evaluate their large-scale accuracy, high-quality bacterial artificial chromosome (bac)-based assemblies base-level accuracy. © 2017 zimin et al.; published by cold spring harbor laboratory press.
1%.>Journal: Genome research
DOI: 10.1101/gr.213405.116
Year: 2017