Though the advent of long-read sequencing technologies has led to a leap in contiguity of de novo genome assemblies, current reference genomes of higher organisms still do not provide unbroken sequences of complete chromosomes. Despite reads in excess of 30 000 base pairs, there are still repetitive structures that cannot be resolved by current state-of-the-art assemblers. The most challenging of these structures are tandemly arrayed repeats, which occur in the genomes of all eukaryotes. Untangling tandem repeat clusters is exceptionally difficult, since the rare differences between repeat copies are obscured by the high error rate of long reads. Solving this problem would constitute a major step towards computing fully assembled genomes. Here, we demonstrate by example of the Drosophila Histone Complex that via machine learning algorithms, it is possible to exploit the underlying distinguishing patterns of single nucleotide variants of repeats from very noisy data to resolve a large and highly conserved repeat cluster. The ideas explored in this paper are a first step towards the automated assembly of complex repeat structures and promise to be applicable to a wide range of eukaryotic genomes. © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.
Journal: Nucleic acids research