Software Tools Optimized for Long Reads Improve Detection of Complex Structural Variants
Thursday, August 24, 2017
[Update April 30, 2018: This paper is now published in Nature Methods.]
Sniffles and NGMLR, structural variant detection and alignment algorithms developed in the Schatz lab for long-read sequence data, are already familiar to many in the PacBio community. Now, a preprint is available so users can see how these open-source tools perform in a variety of conditions.
“Accurate detection of complex structural variations using single molecule sequencing” comes from lead author Fritz Sedlazeck at Baylor College of Medicine, senior author Michael Schatz at Johns Hopkins University, and collaborators. The team notes that long-read sequencing has introduced a much more comprehensive means of discovering structural variants, many of which are missed by short-read sequence data. To take advantage of that capability, the scientists developed NGMLR, “a fast and accurate aligner for long reads,” and Sniffles, which “successively scans within and between the alignments to identify all types of [structural variants],” according to the paper. Sniffles is unique in its ability to routinely detect nested variants.
The scientists describe evaluating the performance of these tools for structural variant discovery using data from several different sequencing platforms. The tools were tested on data from a breast cancer genome, healthy human genomes, and Arabidopsis. Using a simulated human data set, the team found that NGMLR and Sniffles outperformed other algorithms such as BWA-MEM and PBHoney, detecting nearly 95% of structural variants with no false discoveries. While more than 94% of variants called by PacBio were confirmed by other platforms, the scientists report that “Oxford Nanopore had substantially worse concordance. … This systematic bias for deletions in the Oxford Nanopore data is most likely an error in the base calling.”
Sedlazeck et al. also found a concerning trend in structural variant calls when using short-read data. The authors note, “Using the short-read approach we detect, on average, 27 times more translocation events compared to using Sniffles within presumably healthy human data sets,” they note. An investigation into this phenomenon determined that mis-mapping of short reads in low-complexity regions leads to insertions being misidentified as translocations. “Overall, we could rule out 1,869 (83.18%) of the Illumina-based translocation calls as false,” they report.
Finally, the scientists assessed how much coverage is necessary to see the full picture of structural variation. For a healthy human genome, 15-fold coverage of SMRT Sequencing “has a precision of ~80% and recall of 69.64%,” they write. Boosting that to 30-fold coverage achieved similar comepleteness for the much more complex cancer genome. “This translates to a potential price reduction of several tens of thousands of dollars per sample,” they add. “These requirements will be reduced even more in the years to come as the throughput and read length increase and sequencing error rates decrease.”
“The versatility of these methods enables an unprecedented view into structural variations in the human genome and other genomes from long read single molecule sequencing data,” the scientists write. They predict that these and related improvements “will usher in a new era of high quality genome sequences for a broad range of research and clinical applications, and lead to new insights into polymorphic variation, pathogenic conditions, and the forces of evolution.”