Introduction: There are many clinically important genes in “dark” regions of the human genome. These regions are characterized as dark due to a paucity of NGS coverage as a result of short-read sequencing or mapping difficulties. Low NGS sequencing yield can arise in these regions due to the presence of various repeat elements or biased base composition while inaccurate mapping is attributable to segmental duplications. Long-read sequencing coupled with an optimized, robust enrichment method has the potential to illuminate these dark regions.
Materials and Methods: Using PacBio highly accurate long-read (HiFi) Sequencing, coupled with a long-PCR targeted enrichment method, we investigated two important dark region genes that are challenging to accurately type with short-read sequencing due to associated pseudogenes: CYP21A2, responsible for congenital adrenal hyperplasia, and GBA, responsible for Gaucher disease. For each gene, our aim was to cover regions of pathogenic mutations in a single contiguous sequence or set of sequences that can be assayed in a single reaction. CYP21A2 and an associated pseudogene CYP21A1P were co-amplified in a single long-range PCR reaction generating a 10.2 kb and 8.9 kb amplicon, respectively. Similarly, GBA and an associated pseudogene GBAP1 were co-amplified in a single long-range PCR reaction generating a 12.6 kb and 16.0 kb amplicon, respectively. Seven Coriell samples for the CYP21A2 target region and 13 Coriell samples for the GBA target region containing known pathogenic mutations were studied in replicate. SMRTbell libraries were generated from pooled amplicons for each target gene and sequenced on a PacBio Sequel II System. Accounting for replicates, each library contained a multiplex of 24 samples. A new PacBio sequence clustering algorithm, pbAA, designed for rapid analysis of HiFi reads from amplicons was used in variant typing.
Results: All pathogenic CYP21A2 and GBA variants were accurately called in the test samples. These variants included whole-gene deletions, gene duplication, gene fusions, and recombinant exons. Additionally, phasing of complex heterozygotes was achieved.
Conclusion: We demonstrate that long-read HiFi Sequencing provides new opportunities for sequencing clinically relevant but previously dark regions of the human genome that are underrepresented in short-read sequencing. Accurate long reads provide important phasing information, identify structural variations, and avoid potential confusion with pseudogenes. SMRT Sequencing of these regions enables a better understanding of the relationship between genetic factors and personal health and has the potential to ultimately help guide health-related decisions.