An improved circular consensus algorithm with an application to detection of HIV-1 Drug-Resistance Associated Mutations (DRAMs)
Scientists who require confident resolution of heterogeneous populations across complex regions have been unable to transition to short-read sequencing methods. They continue to depend on Sanger Sequencing despite its cost and time inefficiencies. Here we present a new redesigned algorithm that allows the generation of circular consensus sequences (CCS) from individual SMRT Sequencing reads. With this new algorithm, dubbed CCS2, it is possible to reach arbitrarily high quality across longer insert lengths at a lower cost and higher throughput than Sanger Sequencing. We apply this new algorithm, dubbed CCS2, to the characterization of the HIV-1 K103N drug-resistance associated mutation, which is both important clinically, and represents a challenge due to regional sequence context. A mutation was introduced into the 3rd position of amino acid position 103 (A>C substitution) of the RT gene on a pNL4-3 backbone by site-directed mutagenesis. Regions spanning ~1,300 bp were PCR amplified from both the non-mutated and mutant (K103N) plasmids, and were sequenced individually and as a 50:50 mixture. Sequencing data were analyzed using the new CCS2 algorithm, which uses a fully-generative probabilistic model of our SMRT Sequencing process to polish consensus sequences to arbitrarily high accuracy. This result, previously demonstrated for multi-molecule consensus sequences with the Quiver algorithm, is made possible by incorporating per-Zero Mode Waveguide (ZMW) characteristics, thus accounting for the intrinsic changes in the sequencing process that are unique to each ZMW. With CCS2, we are able to achieve a per-read empirical quality of QV30 with 19X coverage. This yields ~5000 1.3 kb consensus sequences with a collective empirical quality of ~QV40. Additionally, we demonstrate a 0% miscall rate in both unmixed samples, and estimate a 48:52% frequency for the K103N mutation in the mixed sample, consistent with data produced by orthogonal platforms.