During cell division, spindle fibers attach to chromosomes at centromeres. The DNA sequence at regional centromeres is fast evolving with no conserved genetic signature for centromere identity. Instead CENH3, a centromere-specific histone H3 variant, is the epigenetic signature that specifies centromere location across both plant and animal kingdoms. Paradoxically, CENH3 is also adaptively evolving. An ongoing question is whether CENH3 evolution is driven by a functional relationship with the underlying DNA sequence. Here, we demonstrate that despite extensive protein sequence divergence, CENH3 histones from distant species assemble centromeres on the same underlying DNA sequence. We first characterized the organization and diversity of centromere repeats in wild-type Arabidopsis thaliana We show that A. thaliana CENH3-containing nucleosomes exhibit a strong preference for a unique subset of centromeric repeats. These sequences are largely missing from the genome assemblies and represent the youngest and most homogeneous class of repeats. Next, we tested the evolutionary specificity of this interaction in a background in which the native A. thaliana CENH3 is replaced with CENH3s from distant species. Strikingly, we find that CENH3 from Lepidium oleraceum and Zea mays, although specifying epigenetically weaker centromeres that result in genome elimination upon outcrossing, show a binding pattern on A. thaliana centromere repeats that is indistinguishable from the native CENH3. Our results demonstrate positional stability of a highly diverged CENH3 on independently evolved repeats, suggesting that the sequence specificity of centromeres is determined by a mechanism independent of CENH3.© 2017 Maheshwari et al.; Published by Cold Spring Harbor Laboratory Press.
Polyploidy is an example of instantaneous speciation when it involves the formation of a new cytotype that is incompatible with the parental species. Because new polyploid individuals are likely to be rare, establishment of a new species is unlikely unless polyploids are able to reproduce through self-fertilization (selfing), or asexually. Conversely, selfing (or asexuality) makes it possible for polyploid species to originate from a single individual-a bona fide speciation event. The extent to which this happens is not known. Here, we consider the origin of Arabidopsis suecica, a selfing allopolyploid between Arabidopsis thaliana and Arabidopsis arenosa, which has hitherto been considered to be an example of a unique origin. Based on whole-genome re-sequencing of 15 natural A. suecica accessions, we identify ubiquitous shared polymorphism with the parental species, and hence conclusively reject a unique origin in favor of multiple founding individuals. We further estimate that the species originated after the last glacial maximum in Eastern Europe or central Eurasia (rather than Sweden, as the name might suggest). Finally, annotation of the self-incompatibility loci in A. suecica revealed that both loci carry non-functional alleles. The locus inherited from the selfing A. thaliana is fixed for an ancestral non-functional allele, whereas the locus inherited from the outcrossing A. arenosa is fixed for a novel loss-of-function allele. Furthermore, the allele inherited from A. thaliana is predicted to transcriptionally silence the allele inherited from A. arenosa, suggesting that loss of self-incompatibility may have been instantaneous.© The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
The next generation sequencing (NGS) techniques have been around for over a decade. Many of their fundamental applications rely on the ability to compute good genome assemblies. As the technology evolves, the assembly algorithms and tools have to continuously adjust and improve. The currently dominant technology of Illumina produces reads that are too short to bridge many repeats, setting limits on what can be successfully assembled. The emerging SMRT (Single Molecule, Real-Time) sequencing technique from Pacific Biosciences produces uniform coverage and long reads of length up to sixty thousand base pairs, enabling significantly better genome assemblies. However, SMRT reads are much more expensive and have a much higher error rate than Illumina’s – around 10-15% – mostly due to indels. New algorithms are very much needed to take advantage of the long reads while mitigating the effect of high error rate and lowering the required coverage.An essential step in assembling SMRT data is the detection of alignments, or overlaps, between reads. High error rate and very long reads make this a much more challenging problem than for Illumina data. We present a new pairwise read aligner, or overlapper, HISEA (Hierarchical SEed Aligner) for SMRT sequencing data. HISEA uses a novel two-step k-mer search, employing consistent clustering, k-mer filtering, and read alignment extension.We compare HISEA against several state-of-the-art programs – BLASR, DALIGNER, GraphMap, MHAP, and Minimap – on real datasets from five organisms. We compare their sensitivity, precision, specificity, F1-score, as well as time and memory usage. We also introduce a new, more precise, evaluation method. Finally, we compare the two leading programs, MHAP and HISEA, for their genome assembly performance in the Canu pipeline.Our algorithm has the best alignment detection sensitivity among all programs for SMRT data, significantly higher than the current best. The currently best assembler for SMRT data is the Canu program which uses the MHAP aligner in its pipeline. We have incorporated our new HISEA aligner in the Canu pipeline and benchmarked it against the best pipeline for multiple datasets at two relevant coverage levels: 30x and 50x. Our assemblies are better than those using MHAP for both coverage levels. Moreover, Canu+HISEA assemblies for 30x coverage are comparable with Canu+MHAP assemblies for 50x coverage, while being faster and cheaper.The HISEA algorithm produces alignments with highest sensitivity compared with the current state-of-the-art algorithms. Integrated in the Canu pipeline, currently the best for assembling PacBio data, it produces better assemblies than Canu+MHAP.
Single Molecule Real-Time (SMRT) sequencing has been widely applied in cutting-edge genomic studies. However, it is still an expensive task to align the noisy long SMRT reads to reference genome by state-of-the-art aligners, which is becoming a bot-tleneck in applications with SMRT sequencing. Novel approach is on demand for improving the efficiency and effectiveness of SMRT read alignment.We propose Regional Hashing-based Alignment Tool (rHAT), a seed-and-extension-based read alignment approach specifically designed for noisy long reads. rHAT indexes reference genome by regional hash table (RHT), a hash table-based index which describes the short tokens within local windows of reference genome. In the seeding phase, rHAT utilizes RHT for efficiently calculating the occurrences of short token matches between partial read and local genomic windows to find highly possible candidate sites. In the extension phase, a sparse dynamic programming-based heuristic approach is used for reducing the cost of aligning read to the candidate sites. By benchmarking on the real and simulated datasets from various prokaryote and eukaryote genomes, we demonstrated that rHAT can effectively align SMRT reads with outstanding throughput. rHAT is implemented in C++; the source code is available at https://github.com/derekguan/rHAT CONTACT: firstname.lastname@example.org. © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email: email@example.com.
Environmentally sensitive plant gene families like NBS-LRRs, receptor kinases, defensins and others, are known to be highly variable. However, most existing strategies for discovering and describing structural variation in complex gene families provide incomplete and imperfect results. The move to de novo genome assemblies for multiple accessions or individuals within a species is enabling more comprehensive and accurate insights about gene family variation. Earlier array-based genome hybridization and sequence-based read mapping methods were limited by their reliance on a reference genome and by misplacement of paralogous sequences. Variant discovery based on de novo genome assemblies overcome the problems arising from a reference genome and reduce sequence misplacement. As de novo genome sequencing moves to the use of longer reads, artifacts will be minimized, intact tandem gene clusters will be constructed accurately, and insights into rapid evolution will become feasible. Copyright © 2016 Elsevier Ltd. All rights reserved.
Population scale mapping of transposable element diversity reveals links to gene regulation and epigenomic variation.
Variation in the presence or absence of transposable elements (TEs) is a major source of genetic variation between individuals. Here, we identified 23,095 TE presence/absence variants between 216 Arabidopsis accessions. Most TE variants were rare, and we find these rare variants associated with local extremes of gene expression and DNA methylation levels within the population. Of the common alleles identified, two thirds were not in linkage disequilibrium with nearby SNPs, implicating these variants as a source of novel genetic diversity. Many common TE variants were associated with significantly altered expression of nearby genes, and a major fraction of inter-accession DNA methylation differences were associated with nearby TE insertions. Overall, this demonstrates that TE variants are a rich source of genetic diversity that likely plays an important role in facilitating epigenomic and transcriptional differences between individuals, and indicates a strong genetic basis for epigenetic variation.
Arabidopsis thaliana serves as a model organism for the study of fundamental physiological, cellular, and molecular processes. It has also greatly advanced our understanding of intraspecific genome variation. We present a detailed map of variation in 1,135 high-quality re-sequenced natural inbred lines representing the native Eurasian and North African range and recently colonized North America. We identify relict populations that continue to inhabit ancestral habitats, primarily in the Iberian Peninsula. They have mixed with a lineage that has spread to northern latitudes from an unknown glacial refugium and is now found in a much broader spectrum of habitats. Insights into the history of the species and the fine-scale distribution of genetic diversity provide the basis for full exploitation of A. thaliana natural variation through integration of genomes and epigenomes with molecular and non-molecular phenotypes. Copyright © 2016 The Author(s). Published by Elsevier Inc. All rights reserved.
Motivation. The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its current error rates are estimated in the range of 15-40%, significantly higher than those of the prevalent next generation sequencing (NGS) technologies (less than 1%). Fundamental bioinformatics tasks such as de novo genome assembly and variant calling require high-quality sequences that need to be extracted from these long but erroneous 3GS sequences. Results. We describe a versatile and efficient linear complexity consensus algorithm Sparc to facilitate de novo genome assembly. Sparc builds a sparse k-mer graph using a collection of sequences from a targeted genomic region. The heaviest path which approximates the most likely genome sequence is searched through a sparsity-induced reweighted graph as the consensus sequence. Sparc supports using NGS and 3GS data together, which leads to significant improvements in both cost efficiency and computational efficiency. Experiments with Sparc show that our algorithm can efficiently provide high-quality consensus sequences using both PacBio and Oxford Nanopore sequencing technologies. With only 30× PacBio data, Sparc can reach a consensus with error rate <0.5%. With the more challenging Oxford Nanopore data, Sparc can also achieve similar error rate when combined with NGS data. Compared with the existing approaches, Sparc calculates the consensus with higher accuracy, and uses approximately 80% less memory and time. Availability. The source code is available for download at https://github.com/yechengxi/Sparc.
Meiotic crossover frequency varies extensively along chromosomes and is typically concentrated in hotspots. As recombination increases genetic diversity, hotspots are predicted to occur at immunity genes, where variation may be beneficial. A major component of plant immunity is recognition of pathogen Avirulence (Avr) effectors by resistance (R) genes that encode NBS-LRR domain proteins. Therefore, we sought to test whether NBS-LRR genes would overlap with meiotic crossover hotspots using experimental genetics in Arabidopsis thaliana. NBS-LRR genes tend to physically cluster in plant genomes; for example, in Arabidopsis most are located in large clusters on the south arms of chromosomes 1 and 5. We experimentally mapped 1,439 crossovers within these clusters and observed NBS-LRR gene associated hotspots, which were also detected as historical hotspots via analysis of linkage disequilibrium. However, we also observed NBS-LRR gene coldspots, which in some cases correlate with structural heterozygosity. To study recombination at the fine-scale we used high-throughput sequencing to analyze ~1,000 crossovers within the RESISTANCE TO ALBUGO CANDIDA1 (RAC1) R gene hotspot. This revealed elevated intragenic crossovers, overlapping nucleosome-occupied exons that encode the TIR, NBS and LRR domains. The highest RAC1 recombination frequency was promoter-proximal and overlapped CTT-repeat DNA sequence motifs, which have previously been associated with plant crossover hotspots. Additionally, we show a significant influence of natural genetic variation on NBS-LRR cluster recombination rates, using crosses between Arabidopsis ecotypes. In conclusion, we show that a subset of NBS-LRR genes are strong hotspots, whereas others are coldspots. This reveals a complex recombination landscape in Arabidopsis NBS-LRR genes, which we propose results from varying coevolutionary pressures exerted by host-pathogen relationships, and is influenced by structural heterozygosity.
Approximately seven hundred 45S rRNA genes (rDNA) in the Arabidopsis thaliana genome are organised in two 4 Mbp-long arrays of tandem repeats arranged in head-to-tail fashion separated by an intergenic spacer (IGS). These arrays make up 5?% of the A. thaliana genome. IGS are rapidly evolving sequences and frequent rearrangements inside the rDNA loci have generated considerable interspecific and even intra-individual variability which allows to distinguish among otherwise highly conserved rRNA genes. The IGS has not been comprehensively described despite its potential importance in regulation of rDNA transcription and replication. Here we describe the detailed sequence variation in the complete IGS of A. thaliana WT plants and provide the reference/consensus IGS sequence, as well as genomic DNA analysis. We further investigate mutants dysfunctional in chromatin assembly factor-1 (CAF-1) (fas1 and fas2 mutants), which are known to have a reduced number of rDNA copies, and plant lines with restored CAF-1 function (segregated from a fas1xfas2 genetic background) showing major rDNA rearrangements. The systematic rDNA loss in CAF-1 mutants leads to the decreased variability of the IGS and to the occurrence of distinct IGS variants. We present for the first time a comprehensive and representative set of complete IGS sequences, obtained by conventional cloning and by Pacific Biosciences sequencing. Our data expands the knowledge of the A. thaliana IGS sequence arrangement and variability, which has not been available in full and in detail until now. This is also the first study combining IGS sequencing data with RFLP analysis of genomic DNA.
Assemblytics is a web app for detecting and analyzing variants from a de novo genome assembly aligned to a reference genome. It incorporates a unique anchor filtering approach to increase robustness to repetitive elements, and identifies six classes of variants based on their distinct alignment signatures. Assemblytics can be applied both to comparing aberrant genomes, such as human cancers, to a reference, or to identify differences between related species. Multiple interactive visualizations enable in-depth explorations of the genomic distributions of variants.http://assemblytics.com, https://github.com/marianattestad/assemblytics CONTACT: firstname.lastname@example.orgSupplementary information: Supplementary data are available at Bioinformatics online.© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: email@example.com.
Chloroplast genome sequence of Arabidopsis thaliana accession Landsberg erecta, assembled from single-molecule, real-time sequencing data.
A publicly available data set from Pacific Biosciences was used to create an assembly of the chloroplast genome sequence of the Arabidopsis thaliana genotype Landsberg erecta The assembly is solely based on single-molecule, real-time sequencing data and hence provides high resolution of the two inverted repeat regions typically contained in chloroplast genomes. Copyright © 2016 Stadermann et al.
Long read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. However, the cost and throughput of these technologies limits their application to complex genomes. One solution is to decrease the cost and time to assemble novel genomes by leveraging textquotedbllefthybridtextquotedblright assemblies that use long reads for scaffolding and short reads for accuracy. To this end, we describe a novel application of a multi-string Burrows-Wheeler transform with auxiliary FM-index to correct errors in long read sequences using a set of complementary short reads. We show that our method efficiently produces significantly higher quality corrected sequence than existing hybrid error-correction methods. We demonstrate the effectiveness of our method compared to state-of-the-art hybrid and long-read only de novo assembly methods.
Arabidopsis thaliana remains the foremost model system for plant genetics and genomics, and researchers rely on the accuracy of its genomic resources. The first completely sequenced angiosperm mitochondrial genome was obtained from Arabidopsis C24 (Unseld et al., 1997), and more recent efforts have produced additional Arabidopsis reference genomes, including one for Col-0, the most widely used ecotype in plant genetic research (Davila et al., 2011). These studies were based on older DNA sequencing methods, making them subject to errors associated with lower levels of sequencing coverage or the extremely short read lengths produced by early-generation Illumina technologies. Indeed, although the more recently published Arabidopsis mitochondrial reference genome sequences made substantial progress in improving upon earlier versions, they still have high error rates. By comparing publicly available Illumina sequence data to the Arabidopsis Col-0 reference genome, we found that it contains a sequence error every 2.4 kb on average, including 57 single-nucleotide polymorphisms (SNPs), 96 indels (up to 901 bp in size), and a large repeat-mediated rearrangement. Most of these errors appear to have been carried over from the original Arabidopsis mitochondrial genome sequence by reference-based assembly approaches, which has misled subsequent studies of plant mitochondrial mutation and molecular evolution by giving the false impression that the errors are naturally occurring variants present in multiple ecotypes. Building on the progress made by previous researchers, we provide a corrected reference sequence that we hope will serve as a useful community resource for future investigations in the field of plant mitochondrial genetics.