Highly accurate read mapping of third generation sequencing reads for improved structural variation analysis
Characterizing genomic structural variations (SV) is vital for understanding how genomes evolve. Furthermore, SVs are known for playing a role in a wide range of diseases including cancer, autism, and schizophrenia. Nevertheless, due to their complexity they remain harder to detect and less understood than single nucleotide variations. Recently, third-generation sequencing has proven to be an invaluable tool for detecting SVs. The markedly higher read length not only allows single reads to span a SV, it also enables reliable mapping to repetitive regions of the genome. These regions often contain SVs and are inaccessible to short-read mapping. However, current sequencing technologies like PacBio show a raw read error rate of 10% or more consisting mostly of insertions and deletions. Especially in repetitive regions the high error rate causes current mapping methods to fail finding exact borders for SVs, to split up large deletions and insertions into several small ones, or in some cases, like inversions, to fail reporting them at all. Furthermore, for complex SVs it is not possible to find one end-to-end alignment for a given read. The decision of when to split a read into two or more separate alignments without knowledge of the underlying SV poses an even bigger challenge to current read mappers. Here we present NextGenMap-LR for long single molecule PacBio reads which addresses these issues. NextGenMap-LR uses a fast k-mer search to quickly find anchor regions between parts of a read and the reference and evaluates them using a vectorized implementation of the Smith-Waterman (SW) algorithm. The resulting high-quality anchors are then used to determine whether a read spans an SV and has to be split or can be aligned contiguously. Finally, NextGenMap-LR uses a banded SW algorithm to compute the final alignment(s). In this last step, to account for both the sequencing error and real genomic variations, we employ a non-affine gap model that penalizes gap extensions for longer gaps less than for shorter ones. Based on simulated as well as verified human breast cancer SV data we show how our approach significantly improves mapping of long reads around SVs. The non-affine gap model is especially effective at more precisely identifying the position of the breakpoint, and the enhanced scoring scheme enables subsequent variation callers to identify SVs that would have been missed otherwise.