1Pacific Biosciences (PacBio), Menlo Park, United States; 2Baylor College of Medicine, Human Genome Sequencing Center, Houston, United States
A heteroduplex is a double-stranded sequence comprised of two non-complementary strands that can form during PCR. These mixed-template artifacts produce misleading results in downstream analysis, e.g., false haplotypes during diplotyping. Unlike short-read technologies, PacBio Single- Molecule Real-Time sequencing produces strand-level base calls. Heteroduplex signatures can be directly observed and corrected using the stranded sub-read data. Our new method is integrated in the circular consensus sequence algorithm which generates accurate HiFi data from sub-reads.
The transformation of PacBio subreads into high accuracy HiFi reads is done by the circular consensus sequence (CCS) algorithm. During CCS, an intermediate draft sequence is generated, and subreads are mapped and aligned to the draft. The heteroduplex algorithm (hd-finder) takes the subread alignments and generates a read pileup whereby variants are identified. At each site, the bases are sorted and counted by strand. The 2×2 count data is subjected to a Fisher’s exact test. The fraction of significant sites across the draft is used to determine if a read contains heteroduplex. Heteroduplex flagged reads are split by strand and reprocessed resulting in two HiFi reads, one for each strand.
We demonstrate the accuracy of the hd-finder algorithm is >94% by using a heteroduplex enriched amplicon library. We also show that applying the hd-finder to amplified datasets improves the quality of downstream analysis of important human genes.
The heteroduplex algorithm is a powerful new method for improving HiFi amplicon targets. The method has been released (v6.3.) and is documented https://ccs.how/faq/mode-heteroduplex- filtering.html.