April 21, 2020  |  

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.

The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.


April 21, 2020  |  

Intragenomic heterogeneity of intergenic ribosomal DNA spacers in Cucurbita moschata is determined by DNA minisatellites with variable potential to form non-canonical DNA conformations.

The intergenic spacer (IGS) of rDNA is frequently built of long blocks of tandem repeats. To estimate the intragenomic variability of such knotty regions, we employed PacBio sequencing of the Cucurbita moschata genome, in which thousands of rDNA copies are distributed across a number of loci. The rRNA coding regions are highly conserved, indicating intensive interlocus homogenization and/or high selection pressure. However, the IGS exhibits high intragenomic structural diversity. Two repeated blocks, R1 (300-1250 bp) and R2 (290-643 bp), account for most of the IGS variation. They exhibit minisatellite-like features built of multiple periodically spaced short GC-rich sequence motifs with the potential to adopt non-canonical DNA conformations, G-quadruplex-folded and left-handed Z-DNA. The mutual arrangement of these motifs can be used to classify IGS variants into five structural families. Subtle polymorphisms exist within each family due to a variable number of repeats, suggesting the coexistence of an enormous number of IGS variants. The substantial length and structural heterogeneity of IGS minisatellites suggests that the tempo of their divergence exceeds the tempo of the homogenization of rDNA arrays. As frequently occurring among plants, we hypothesize that their instability may influence transcription regulation and/or destabilize rDNA units, possibly spreading them across the genome. © The Author(s) 2019. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.