Telomeres and centromeres have long vexed genomic scientists. In the early days of genome sequencing, many researchers took it for granted that assembling these highly repetitive regions was essentially impossible.
That’s why a new preprint posted to bioRxiv is so exciting. Scientists from Weill Cornell Medicine and Colorado State University describe the use of PacBio long-read whole genome sequencing to analyze and assemble telomeres, characterizing the heterogeneity of these elements across three human genomes from the Genome in a Bottle collection (HG001, HG002, HG005).
“Haplotype Diversity and Sequence Heterogeneity of Human Telomeres” comes from lead authors Kirill Grigorev (@LankyCyril) and Jonathan Foox (@jfoox), senior author Chris Mason (@mason_lab), and collaborators. They took on this project to overcome existing challenges with assembling telomeres and to establish a better protocol that others could replicate.
“Given their length and repetitive nature, telomeric regions are not easily reconstructed from short read sequencing, making telomere sequence resolution a very costly and generally intractable problem,” the authors write. “We describe a framework for extracting telomeric reads from single-molecule sequencing experiments, describing their sequence variation and motifs, and for haplotype inference.”
Short reads, which are typically no more than a few hundred bases, can read DNA in telomeric regions, but during alignment they struggle to differentiate the highly repetitive regions and to represent them accurately without collapsing several repeats into one. Highly accurate long PacBio CCS reads, known as HiFi reads, produced by SMRT Sequencing can represent tens of thousands of base pairs in one long stretch. This greatly reduces the alignment challenge, facilitating the accurate assembly of even the most repetitive regions in the genome.
“We find that long telomeric stretches can be accurately captured with long-read sequencing,” the scientists report. In the preprint, they describe the ability to observe sequence heterogeneity, discover novel and known non-canonical motifs, and create motif composition maps. Their framework, known as edgeCase, was validated with PacBio sequencing data sets from the Genome in a Bottle consortium.
While the team’s results confirmed that TTAGGG, the canonical repeat associated with telomeric regions, is the dominant motif, there was “a surprising diversity of repeat variations” including known and novel variants. This previously untapped diversity was masked by “the necessary bias towards the canonical motif during the selection of short reads,” the scientists suggest. “Telomeric regions with higher content of non-canonical repeats are less likely to be identified through the use of short reads, and instead, long reads appear to be more suitable for this purpose,” they add.
The team concludes: “The identified variations in long range contexts enable clustering of SMRT reads into distinct haplotypes at ends of chromosomes, and thus provide a new means of diplotype mapping and reveal the existence and motif composition of such diplotypes on a multi-Kbp scale.”
March 4, 2020 | Human genetics research