Introduction: Long-read sequencing has been applied successfully to assemble genomes and detect structural variants. However, due to high raw-read error rates (10-15%), it has remained difficult to call small variants from long reads. Recent improvements in library preparation and sequencing chemistry have increased length, accuracy, and throughput of PacBio circular consensus sequencing (CCS) reads, resulting in 15-20kb reads with average read quality above 99%. Materials and Methods: We sequenced a library from human reference sample HG002 to 18-fold coverage on the PacBio Sequel II with two SMRT Cells 8M. The CCS algorithm was used to generate highly accurate (average 99.9%) 12.9kb reads, which were mapped to the hg19 reference with pbmm2. We detected small variants using Google DeepVariant with a model trained for CCS and phased the variants using WhatsHap. Structural variants were detected with pbsv. Variant calls were evaluated against Genome in a Bottle (GIAB) benchmarks. Results: With these reads, DeepVariant achieves SNP and Indel F1 scores of 99.70% and 96.59% against the GIAB truth set, and pbsv achieves 97.72% recall on structural variants longer than 50bp. Using WhatsHap, small variants were phased into haplotype blocks with 145kb N50. The improved mappability of long reads allows us to align to and detect variants in medically relevant genes such as CYP2D6 and PMS2 that have proven “difficult-to-map” with short reads. Conclusions: These highly accurate long reads combine the mappability and ability to detect structural variants of long reads with the accuracy and ability to detect small variants of short reads.
June 1, 2021 |