AGBT 2024 | 2024
Egor Dolzhenko1, Graham S Erwin2, Katherine Wang2, Zev Kronenberg1, William J Rowell1, Anna C Ferrari3, Garrison Pease3, Daniel Schwartz3, Benjamin Gartrell3, Ahmed Aboumohamed3, Alex Sankin3, Pedro Maria3, Kara Watts3, John M Greally4, Patrick Wilkinson5, Yashoda Rajpurohit5, John Loffredo5, Denis Smirnov5, Manuel A Sepulveda5, Charles G Drake5, Alex Robertson1, Michael P Snyder2, Michael A Eberle11. PacBio, Menlo CA, USA; 2. Stanford University,CA, USA; 3. Montefiore-Einstein Cancer Center, NY, USA; 4. Einstein Epigenomics Center, NY, USA; 5. Janssen Research and Development LLC, PA, USA
The human genome contains thousands of repeat-rich polymorphic regions whose structure has not been systematically described. These regions produce large collections of variant calls sometimes called variation clusters. Variation clusters are typically excluded from tertiary analysis because it is difficult to interpret and catalog them. One example is the 3.5 Kbp region in an intron of the KCNMB2 gene which contains over 30 constituent simple repeats that jointly create many insertions, deletions, and mismatches in alignments of reads over this region. The corresponding variant calls are often incorrectly prioritized as potentially pathogenic, requiring significant resources to curate and rule out. To address these issues, we propose a novel computational framework to systematically detect, annotate, and catalog variation clusters. A distinguishing characteristic of our approach relative to traditional variant calling methods, is the ability to annotate and resolve entire regions of high sequence polymorphism as single units instead of fragmenting them into variation clusters. These regions can be subsequently genotyped using the recently developed tandem repeat genotyping tool (TRGT). We show that our method can accurately detect reference coordinates and resolve structures of KCNMB2, MUC1, CEL, INS and 20 other medically relevant variable number tandem repeats. Using real and simulated data we also show that our method can locate and call pathogenic expansions of 50 disease-causing repeats and nine polyalanine repeats composed of highly variable motif sequences. To demonstrate the usefulness of our method for cancer genome studies, we applied it to normal, polyp, and adenocarcinoma PacBio HiFi samples originating from the same individual and identified a tandem repeat that progressively expands in length from normal to polyp to adenocarcinoma samples in the 5′ UTR of LIMD1, a reported tumor suppressor gene. To further highlight our ability to resolve variation we characterized differences in repeat composition and methylation between three prostate tumors and their normal counterparts and also a panel of 100 unrelated genomes. To make all these analyses accessible to other genome researchers, we are releasing a learning resource with tutorials describing how to catalog variation in the polymorphic regions of the human genome in publicly available PacBio HiFi samples.