Detection and phasing of small variants in Genome in a Bottle samples with highly accurate long reads
Introduction: Long-read PacBio SMRT Sequencing has been applied successfully to assemble genomes and detect structural variants. However, due to high raw read error rates of 10-15%, it has remained difficult to call small variants from long reads. Recent improvements in library preparation, sequencing chemistry, and instrument yield have increased length, accuracy, and throughput of PacBio Circular Consensus (CCS) reads, resulting in 10-20 kb “HiFi” reads with mean read quality above 99%. Materials and Methods: We sequenced 11 kb size-selected libraries from the Genome in a Bottle (GIAB) human reference samples HG001, HG002, and HG005 to approximately 30-fold coverage on the Sequel II System with six SMRT Cells 8M each. The CCS algorithm was used to generate highly accurate (average 99.8%) reads of mean length 10-11 kb, which were then mapped to the hs37d5 reference with pbmm2. We detected small variants using Google DeepVariant and compared these variant calls to GIAB benchmarks. Small variants were then phased with WhatsHap. Results: With these long, highly accurate CCS reads, DeepVariant achieves high SNP and Indel accuracy against the GIAB benchmark truth set for all three reference samples. Using WhatsHap, small variants were phased into haplotype blocks with N50 from 82 to 146 kb. The improved mappability of long reads allows detection of variants in many medically relevant genes such as CYP2D6and PMS2that have proven ‘difficult-to-map’ with short reads. We show that small variant precision and recall remain high down to 15-fold coverage. Conclusions: These highly accurate long reads combine the mappability of noisy long reads with the accuracy and small variant detection utility of short reads, which will allow the detection and phasing of variants in regions that have proven recalcitrant to short read sequencing and variant detection.