Video Poster: Accurate, comprehensive variant calling in difficult-to-map genes using HiFi reads
Introduction: Around 5% (1,168) of protein-coding genes in the human genome contain an exon that is difficult to map with typical next-generation sequencing (NGS) read lengths due to homologous pseudogenes or segmental duplications. Among the difficult-to-map genes are 193 with known medical relevance, including CYP2D6, GBA, SMN1/2, and VWF. Long-read DNA sequencing provides increased mappability, accessing many of the difficult-to-map regions by connecting the homologous exon to neighboring unique sequence. Until recently, the read-level accuracy of long-read sequencing had made it challenging to accurately call small variants. The recently developed HiFi reads from the PacBio Sequel II System provide both long read length (15 kb – 25 kb) for mappability and high read quality (>99%) for accurate variant calling, expanding the regions of the genome that are able to be characterized to high precision and recall. Materials and Methods: Human reference sample HG002 was sequenced to 35-fold HiFi read coverage on the PacBio Sequel II System. Matched 35-fold coverage with NGS reads was obtained on the Illumina NovaSeq 6000. Reads were mapped to the GRCh38 reference genome using pbmm2 for HiFi reads and BWA for NGS reads. Small variants were called using DeepVariant. The variant callsets were compared to each other and to the Genome in a Bottle (GIAB) v4.1 benchmark within exons previously reported to be problematic for NGS. Results: For difficult-to-map exons within the GIAB benchmark, HiFi reads detect 1,269 true benchmark variants, 21% more than are detected with NGS reads (1,053). Small variant precision in difficult-to-map exons is 97.7% for HiFi reads, markedly higher than 92.0% for NGS reads. Extending outside of the benchmark, HiFi reads detect 241 small variants missed by NGS reads across 42 difficult-to-map exons of medically relevant genes, including 14 variants in C4A, 5 in SMN1, and 2 in STRC. Conclusion: HiFi reads have both high mappability and high read quality, which enables accurate small variant calling in difficult-to-map genes that are challenging for NGS. We predict that large-scale use of HiFi reads in disease cohort studies will discover additional disease genes and variants that have remained beyond the reach of NGS.