The central goal of medical genomics is to understand the inherited basis of sequence variation that underlies human physiology, evolution, and disease. Functional association studies currently ignore millions of bases that span each centromeric region and acrocentric short arm. These regions are enriched in long arrays of tandem repeats, or satellite DNAs, that are known to vary extensively in copy number and repeat structure in the human population. Satellite sequence variation in the human genome is often so large that it is detected cytogenetically, yet due to the lack of a reference assembly and informatics tools to measure this variability, contemporary high-resolution disease association studies are unable to detect causal variants in these regions. Nevertheless, recently uncovered associations between satellite DNA variation and human disease support that these regions present a substantial and biologically important fraction of human sequence variation. Therefore, there is a pressing and unmet need to detect and incorporate this uncharacterized sequence variation into broad studies of human evolution and medical genomics. Here I discuss the current knowledge of satellite DNA variation in the human genome, focusing on centromeric satellites and their potential implications for disease.
Eukaryotic genomes are replete with repeated sequences in the form of transposable elements (TEs) dispersed across the genome or as satellite arrays, large stretches of tandemly repeated sequences. Many satellites clearly originated as TEs, but it is unclear how mobile genetic parasites can transform into megabase-sized tandem arrays. Comprehensive population genomic sampling is needed to determine the frequency and generative mechanisms of tandem TEs, at all stages from their initial formation to their subsequent expansion and maintenance as satellites. The best available population resources, short-read DNA sequences, are often considered to be of limited utility for analyzing repetitive DNA due to the challenge of mapping individual repeats to unique genomic locations. Here we develop a new pipeline called ConTExt that demonstrates that paired-end Illumina data can be successfully leveraged to identify a wide range of structural variation within repetitive sequence, including tandem elements. By analyzing 85 genomes from five populations of Drosophila melanogaster, we discover that TEs commonly form tandem dimers. Our results further suggest that insertion site preference is the major mechanism by which dimers arise and that, consequently, dimers form rapidly during periods of active transposition. This abundance of TE dimers has the potential to provide source material for future expansion into satellite arrays, and we discover one such copy number expansion of the DNA transposon hobo to approximately 16 tandem copies in a single line. The very process that defines TEs-transposition-thus regularly generates sequences from which new satellites can arise.© 2018 McGurk and Barbash; Published by Cold Spring Harbor Laboratory Press.
Y chromosomes control essential male functions in many species, including sex determination and fertility. However, because of obstacles posed by repeat-rich heterochromatin, knowledge of Y chromosome sequences is limited to a handful of model organisms, constraining our understanding of Y biology across the tree of life. Here, we leverage long single-molecule sequencing to determine the content and structure of the nonrecombining Y chromosome of the primary African malaria mosquito, Anopheles gambiae. We find that the An. gambiae Y consists almost entirely of a few massively amplified, tandemly arrayed repeats, some of which can recombine with similar repeats on the X chromosome. Sex-specific genome resequencing in a recent species radiation, the An. gambiae complex, revealed rapid sequence turnover within An. gambiae and among species. Exploiting 52 sex-specific An. gambiae RNA-Seq datasets representing all developmental stages, we identified a small repertoire of Y-linked genes that lack X gametologs and are not Y-linked in any other species except An. gambiae, with the notable exception of YG2, a candidate male-determining gene. YG2 is the only gene conserved and exclusive to the Y in all species examined, yet sequence similarity to YG2 is not detectable in the genome of a more distant mosquito relative, suggesting rapid evolution of Y chromosome genes in this highly dynamic genus of malaria vectors. The extensive characterization of the An. gambiae Y provides a long-awaited foundation for studying male mosquito biology, and will inform novel mosquito control strategies based on the manipulation of Y chromosomes.
Single-molecule sequencing resolves the detailed structure of complex satellite DNA loci in Drosophila melanogaster.
Highly repetitive satellite DNA (satDNA) repeats are found in most eukaryotic genomes. SatDNAs are rapidly evolving and have roles in genome stability and chromosome segregation. Their repetitive nature poses a challenge for genome assembly and makes progress on the detailed study of satDNA structure difficult. Here, we use single-molecule sequencing long reads from Pacific Biosciences (PacBio) to determine the detailed structure of all major autosomal complex satDNA loci in Drosophila melanogaster, with a particular focus on the 260-bp and Responder satellites. We determine the optimal de novo assembly methods and parameter combinations required to produce a high-quality assembly of these previously unassembled satDNA loci and validate this assembly using molecular and computational approaches. We determined that the computationally intensive PBcR-BLASR assembly pipeline yielded better assemblies than the faster and more efficient pipelines based on the MHAP hashing algorithm, and it is essential to validate assemblies of repetitive loci. The assemblies reveal that satDNA repeats are organized into large arrays interrupted by transposable elements. The repeats in the center of the array tend to be homogenized in sequence, suggesting that gene conversion and unequal crossovers lead to repeat homogenization through concerted evolution, although the degree of unequal crossing over may differ among complex satellite loci. We find evidence for higher-order structure within satDNA arrays that suggest recent structural rearrangements. These assemblies provide a platform for the evolutionary and functional genomics of satDNAs in pericentric heterochromatin. © 2017 Khost et al.; Published by Cold Spring Harbor Laboratory Press.
Tandemly-repeated sequences represent a unique class of eukaryotic DNA. Their content in the genome of higher eukaryotes mounts to tens of percents. However, the evolution of this class of sequences is poorly-studied. In our paper, 62 families of Mus musculus tandem repeats are analyzed by bioinformatic methods, and 7 of them are analyzed by fluorescence in situ hybridization. It is shown that the same tandem repeat sets co-occure only in closely related species of mice. But even in such species we observe differences in localization on the chromosomes and the number of individual tandem repeats. With increasing evolutionary distance only some of the tandem repeat families remain common for different species. It is shown, that the use of a combination of bioinformatics and molecular biology techniques is very perspective for further studies of the evolution of tandem repeats.
Genomic studies rely on accurate chromosome assemblies to explore sequence-based models of cell biology, evolution and biomedical disease. However, even the extensively studied human genome has not yet reached a complete, ‘telomere-to-telomere’, chromosome assembly. The largest assembly gaps remain in centromeric regions and acrocentric short arms, sites known to contain megabase-sized arrays of tandem repeats, or satellite DNAs. This review aims to briefly address the progress and challenges of generating correct assemblies of satellite DNA arrays. Although the focus is placed on the human genome, many concepts presented here are applicable to other genomes.
Long arrays of near-identical tandem repeats are a common feature of centromeric and subtelomeric regions in complex genomes. These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools. Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, e.g. assembly, long reads allow direct inference of satellite higher order repeat structure. To automate characterization of local centromeric tandem repeat sequence variation we have designed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), that takes advantage of Pacific Bioscience long-reads from whole-genome sequencing datasets. By operating on reads prior to assembly, our approach provides a more comprehensive set of repeat-structure variants and is not impacted by rearrangements or sequence underrepresentation due to misassembly.We demonstrate the utility of Alpha-CENTAURI in characterizing repeat structure for alpha satellite containing reads in the hydatidiform mole (CHM1, haploid-like) genome. The pipeline is designed to report local repeat organization summaries for each read, thereby monitoring rearrangements in repeat units, shifts in repeat orientation and sites of array transition into non-satellite DNA, typically defined by transposable element insertion. We validate the method by showing consistency with existing centromere high order repeat references. Alpha-CENTAURI can, in principle, run on any sequence data, offering a method to generate a sequence repeat resolution that could be readily performed using consensus sequences available for other satellite families in genomes without high-quality reference assemblies.Documentation and source code for Alpha-CENTAURI are freely available at http://github.com/volkansevim/alpha-CENTAURI CONTACT: firstname.lastname@example.orgSupplementary information: Supplementary data are available at Bioinformatics online.© The Author 2016. Published by Oxford University Press.
The content of repetitive DNA in avian genomes is considerably less than in other investigated vertebrates. The first descriptions of tandem repeats were based on the results of routine biochemical and molecular biological experiments. Both satellite DNA and interspersed repetitive elements were annotated using library-based approach and de novo repeat identification in assembled genome. The development of deep-sequencing methods provides datasets of high quality without preassembly allowing one to annotate repetitive elements from unassembled part of genomes. In this work, we search the chicken assembly and annotate high copy number tandem repeats from unassembled short raw reads. Tandem repeat (GGAAA)n has been identified and found to be the second after telomeric repeat (TTAGGG)n most abundant in the chicken genome. Furthermore, (GGAAA)n repeat forms expanded arrays on the both arms of the chicken W chromosome. Our results highlight the complexity of repetitive sequences and update data about organization of sex W chromosome in chicken.
A substantial portion of the genomes of most multicellular eukaryotes consists of large arrays of tandemly repeated sequence, collectively called satellite DNA. The processes generating and maintaining different satellite DNA abundances across lineages are important to understand as satellites have been linked to chromosome mis-segregation, disease phenotypes, and reproductive isolation between species. While much theory has been developed to describe satellite evolution, empirical tests of these models have fallen short because of the challenges in assessing satellite repeat regions of the genome. Advances in computational tools and sequencing technologies now enable identification and quantification of satellite sequences genome-wide. Here, we describe some of these tools and how their applications are furthering our knowledge of satellite evolution and function. Copyright © 2018 Elsevier Ltd. All rights reserved.
Mitochondrial genomes of two diplectanids (Platyhelminthes: Monogenea) expose paraphyly of the order Dactylogyridea and extensive tRNA gene rearrangements.
Recent mitochondrial phylogenomics studies have reported a sister-group relationship of the orders Capsalidea and Dactylogyridea, which is inconsistent with previous morphology- and molecular-based phylogenies. As Dactylogyridea mitochondrial genomes (mitogenomes) are currently represented by only one family, to improve the phylogenetic resolution, we sequenced and characterized two dactylogyridean parasites, Lamellodiscus spari and Lepidotrema longipenis, belonging to a non-represented family Diplectanidae.The L. longipenis mitogenome (15,433 bp) contains the standard 36 flatworm mitochondrial genes (atp8 is absent), whereas we failed to detect trnS1, trnC and trnG in L. spari (14,614 bp). Both mitogenomes exhibit unique gene orders (among the Monogenea), with a number of tRNA rearrangements. Both long non-coding regions contain a number of different (partially overlapping) repeat sequences. Intriguingly, these include putative tRNA pseudogenes in a tandem array (17 trnV pseudogenes in L. longipenis, 13 trnY pseudogenes in L. spari). Combined nucleotide diversity, non-synonymous/synonymous substitutions ratio and average sequence identity analyses consistently showed that nad2, nad5 and nad4 were the most variable PCGs, whereas cox1, cox2 and cytb were the most conserved. Phylogenomic analysis showed that the newly sequenced species of the family Diplectanidae formed a sister-group with the Dactylogyridae + Capsalidae clade. Thus Dactylogyridea (represented by the Diplectanidae and Dactylogyridae) was rendered paraphyletic (with high statistical support) by the nested Capsalidea (represented by the Capsalidae) clade.Our results show that nad2, nad5 and nad4 (fast-evolving) would be better candidates than cox1 (slow-evolving) for species identification and population genetics studies in the Diplectanidae. The unique gene order pattern further suggests discontinuous evolution of mitogenomic gene order arrangement in the Class Monogenea. This first report of paraphyly of the Dactylogyridea highlights the need to generate more molecular data for monogenean parasites, in order to be able to clarify their relationships using large datasets, as single-gene markers appear to provide a phylogenetic resolution which is too low for the task.
Pilot satellitome analysis of the model plant, Physcomitrellapatens, revealed a transcribed and high-copy IGS related tandem repeat.
Satellite DNA (satDNA) constitutes a substantial part of eukaryotic genomes. In the last decade, it has been shown that satDNA is not an inert part of the genome and its function extends beyond the nuclear membrane. However, the number of model plant species suitable for studying the novel horizons of satDNA functionality is low. Here, we explored the satellitome of the model “basal” plant, Physcomitrellapatens (Hedwig, 1801) Bruch & Schimper, 1849 (moss), which has a number of advantages for deep functional and evolutionary research. Using a newly developed pyTanFinder pipeline (https://github.com/Kirovez/pyTanFinder) coupled with fluorescence in situ hybridization (FISH), we identified five high copy number tandem repeats (TRs) occupying a long DNA array in the moss genome. The nuclear organization study revealed that two TRs had distinct locations in the moss genome, concentrating in the heterochromatin and knob-rDNA like chromatin bodies. Further genomic, epigenetic and transcriptomic analysis showed that one TR, named PpNATR76, was located in the intergenic spacer (IGS) region and transcribed into long non-coding RNAs (lncRNAs). Several specific features of PpNATR76 lncRNAs make them very similar with the recently discovered human lncRNAs, raising a number of questions for future studies. This work provides new resources for functional studies of satellitome in plants using the model organism P.patens, and describes a list of tandem repeats for further analysis.