Repetitive sequence Archives - Page 6 of 6

September 22, 2019

A complete Cannabis chromosome assembly and adaptive admixture for elevated cannabidiol (CBD) content

Cannabis has been cultivated for millennia with distinct cultivars providing either fiber and grain or tetrahydrocannabinol. Recent demand for cannabidiol rather than tetrahydrocannabinol has favored the breeding of admixed cultivars with extremely high cannabidiol content. Despite several draft Cannabis genomes, the genomic structure of cannabinoid synthase loci has remained elusive. A genetic map derived from a tetrahydrocannabinol/cannabidiol segregating population and a complete chromosome assembly from a high-cannabidiol cultivar together resolve the linkage of cannabidiolic and tetrahydrocannabinolic acid synthase gene clusters which are associated with transposable elements. High-cannabidiol cultivars appear to have been generated by integrating hemp-type cannabidiolic acid synthase gene clusters into a background of marijuana-type cannabis. Quantitative trait locus mapping suggests that overall drug potency, however, is associated with other genomic regions needing additional study.

September 22, 2019

Physiological genomics of dietary adaptation in a marine herbivorous fish

Adopting a new diet is a significant evolutionary change and can profoundly affect an animaltextquoterights physiology, biochemistry, ecology, and its genome. To study this evolutionary transition, we investigated the physiology and genomics of digestion of a derived herbivorous fish, the monkeyface prickleback (Cebidichthys violaceus). We sequenced and assembled its genome and digestive transcriptome and revealed the molecular changes related to important dietary enzymes, finding abundant evidence for adaptation at the molecular level. In this species, two gene families experienced expansion in copy number and adaptive amino acid substitutions. These families, amylase, and bile salt activated lipase, are involved digestion of carbohydrates and lipids, respectively. Both show elevated levels of gene expression and increased enzyme activity. Because carbohydrates are abundant in the pricklebacktextquoterights diet and lipids are rare, these findings suggest that such dietary specialization involves both exploiting abundant resources and scavenging rare ones, especially essential nutrients, like essential fatty acids.

September 22, 2019

TranSurVeyor: an improved database-free algorithm for finding non-reference transpositions in high-throughput sequencing data.

Transpositions transfer DNA segments between different loci within a genome; in particular, when a transposition is found in a sample but not in a reference genome, it is called a non-reference transposition. They are important structural variations that have clinical impact. Transpositions can be called by analyzing second generation high-throughput sequencing datasets. Current methods follow either a database-based or a database-free approach. Database-based methods require a database of transposable elements. Some of them have good specificity; however this approach cannot detect novel transpositions, and it requires a good database of transposable elements, which is not yet available for many species. Database-free methods perform de novo calling of transpositions, but their accuracy is low. We observe that this is due to the misalignment of the reads; since reads are short and the human genome has many repeats, false alignments create false positive predictions while missing alignments reduce the true positive rate. This paper proposes new techniques to improve database-free non-reference transposition calling: first, we propose a realignment strategy called one-end remapping that corrects the alignments of reads in interspersed repeats; second, we propose a SNV-aware filter that removes some incorrectly aligned reads. By combining these two techniques and other techniques like clustering and positive-to-negative ratio filter, our proposed transposition caller TranSurVeyor shows at least 3.1-fold improvement in terms of F1-score over existing database-free methods. More importantly, even though TranSurVeyor does not use databases of prior information, its performance is at least as good as existing database-based methods such as MELT, Mobster and Retroseq. We also illustrate that TranSurVeyor can discover transpositions that are not known in the current database.

September 22, 2019

The chromosome-level quality genome provides insights into the evolution of the biosynthesis genes for aroma compounds of Osmanthus fragrans.

Sweet osmanthus (Osmanthus fragrans) is a very popular ornamental tree species throughout Southeast Asia and USA particularly for its extremely fragrant aroma. We constructed a chromosome-level reference genome of O. fragrans to assist in studies of the evolution, genetic diversity, and molecular mechanism of aroma development. A total of over 118?Gb of polished reads was produced from HiSeq (45.1?Gb) and PacBio Sequel (73.35?Gb), giving 100× depth coverage for long reads. The combination of Illumina-short reads, PacBio-long reads, and Hi-C data produced the final chromosome quality genome of O. fragrans with a genome size of 727?Mb and a heterozygosity of 1.45 %. The genome was annotated using de novo and homology comparison and further refined with transcriptome data. The genome of O. fragrans was predicted to have?45,542 genes, of which 95.68 % were functionally annotated. Genome annotation found 49.35 % as the repetitive sequences, with long terminal repeats (LTR) being the richest (28.94 %). Genome evolution analysis indicated the evidence of whole-genome duplication 15 million years ago, which contributed to the current content of 45,242 genes. Metabolic analysis revealed that linalool, a monoterpene is the main aroma compound. Based on the genome and transcriptome, we further demonstrated the direct connection between terpene synthases (TPSs) and the rich aromatic molecules in O. fragrans. We identified three new flower-specific TPS genes, of which the expression coincided with the production of linalool. Our results suggest that the high number of TPS genes and the flower tissue- and stage-specific TPS genes expressions might drive the strong unique aroma production of O. fragrans.

September 22, 2019

Correcting palindromes in long reads after whole-genome amplification.

Next-generation sequencing requires sufficient DNA to be available. If limited, whole-genome amplification is applied to generate additional amounts of DNA. Such amplification often results in many chimeric DNA fragments, in particular artificial palindromic sequences, which limit the usefulness of long sequencing reads.Here, we present Pacasus, a tool for correcting such errors. Two datasets show that it markedly improves read mapping and de novo assembly, yielding results similar to these that would be obtained with non-amplified DNA.With Pacasus long-read technologies become available for sequencing targets with very small amounts of DNA, such as single cells or even single chromosomes.

September 22, 2019

Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies.

Recent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembly of highly heterozygous genomes is still problematic when regional heterogeneity is so high that haplotype homology is not recognised during assembly. This results in regional duplication rather than consolidation into allelic variants and can cause issues with downstream analysis, for example variant discovery, or haplotype reconstruction using the diploid assembly with unpaired allelic contigs.A new pipeline-Purge Haplotigs-was developed specifically for third-gen sequencing-based assemblies to automate the reassignment of allelic contigs, and to assist in the manual curation of genome assemblies. The pipeline uses a draft haplotype-fused assembly or a diploid assembly, read alignments, and repeat annotations to identify allelic variants in the primary assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing, and compared with a similar tool. After processing with Purge Haplotigs, haploid assemblies were less duplicated with minimal impact on genome completeness, and diploid assemblies had more pairings of allelic contigs.Purge Haplotigs improves the haploid and diploid representations of third-gen sequencing based genome assemblies by identifying and reassigning allelic contigs. The implementation is fast and scales well with large genomes, and it is less likely to over-purge repetitive or paralogous elements compared to alignment-only based methods. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence.

September 22, 2019

Unexpected patterns of segregation distortion at a selfish supergene in the fire ant Solenopsis invicta.

The Sb supergene in the fire ant Solenopsis invicta determines the form of colony social organization, with colonies whose inhabitants bear the element containing multiple reproductive queens and colonies lacking it containing only a single queen. Several features of this supergene – including suppressed recombination, presence of deleterious mutations, association with a large centromere, and “green-beard” behavior – suggest that it may be a selfish genetic element that engages in transmission ratio distortion (TRD), defined as significant departures in progeny allele frequencies from Mendelian inheritance ratios. We tested this possibility by surveying segregation ratios in embryo progenies of 101 queens of the “polygyne” social form (3512 embryos) using three supergene-linked markers and twelve markers outside the supergene.Significant departures from Mendelian ratios were observed at the supergene loci in 3-5 times more progenies than expected in the absence of TRD and than found, on average, among non-supergene loci. Also, supergene loci displayed the greatest mean deviations from Mendelian ratios among all study loci, although these typically were modest. A surprising feature of the observed inter-progeny variation in TRD was that significant deviations involved not only excesses of supergene alleles but also similarly frequent excesses of the alternate alleles on the homologous chromosome. As expected given the common occurrence of such “drive reversal” in this system, alleles associated with the supergene gain no consistent transmission advantage over their alternate alleles at the population level. Finally, we observed low levels of recombination and incomplete gametic disequilibrium across the supergene, including between adjacent markers within a single inversion.Our data confirm the prediction that the Sb supergene is a selfish genetic element capable of biasing its own transmission during reproduction, yet counterselection for suppressor loci evidently has produced an evolutionary stalemate in TRD between the variant homologous haplotypes on the “social chromosome”. Evidence implicates prezygotic segregation distortion as responsible for the TRD we document, with “true” meiotic drive the most likely mechanism. Low levels of recombination and incomplete gametic disequilibrium across the supergene suggest that selection does not preserve a single uniform supergene haplotype responsible for inducing polygyny.

September 22, 2019

The genome of the tegu lizard Salvator merianae: combining Illumina, PacBio, and optical mapping data to generate a highly contiguous assembly.

Reptiles are a species-rich group with great phenotypic and life history diversity but are highly underrepresented among the vertebrate species with sequenced genomes.Here, we report a high-quality genome assembly of the tegu lizard, Salvator merianae, the first lacertoid with a sequenced genome. We combined 74X Illumina short-read, 29.8X Pacific Biosciences long-read, and optical mapping data to generate a high-quality assembly with a scaffold N50 value of 55.4 Mb. The contig N50 value of this assembly is 521 Kb, making it the most contiguous reptile assembly so far. We show that the tegu assembly has the highest completeness of coding genes and conserved non-exonic elements (CNEs) compared to other reptiles. Furthermore, the tegu assembly has the highest number of evolutionarily conserved CNE pairs, corroborating a high assembly contiguity in intergenic regions. As in other reptiles, long interspersed nuclear elements comprise the most abundant transposon class. We used transcriptomic data, homology- and de novo gene predictions to annotate 22,413 coding genes, of which 16,995 (76%) likely have human orthologs as inferred by CESAR-derived gene mappings. Finally, we generated a multiple genome alignment comprising 10 squamates and 7 other amniote species and identified conserved regions that are under evolutionary constraint. CNEs cover 38 Mb (1.8%) of the tegu genome, with 3.3 Mb in these elements being squamate specific. In contrast to placental mammal-specific CNEs, very few of these squamate-specific CNEs (<20 Kb) overlap transposons, highlighting a difference in how lineage-specific CNEs originated in these two clades.The tegu lizard genome together with the multiple genome alignment and comprehensive conserved element datasets provide a valuable resource for comparative genomic studies of reptiles and other amniotes.

September 22, 2019

Genomic characterization of a B chromosome in Lake Malawi cichlid fishes.

B chromosomes (Bs) were discovered a century ago, and since then, most studies have focused on describing their distribution and abundance using traditional cytogenetics. Only recently have attempts been made to understand their structure and evolution at the level of DNA sequence. Many questions regarding the origin, structure, function, and evolution of B chromosomes remain unanswered. Here, we identify B chromosome sequences from several species of cichlid fish from Lake Malawi by examining the ratios of DNA sequence coverage in individuals with or without B chromosomes. We examined the efficiency of this method, and compared results using both Illumina and PacBio sequence data. The B chromosome sequences detected in 13 individuals from 7 species were compared to assess the rates of sequence replacement. B-specific sequence common to at least 12 of the 13 datasets were identified as the “Core” B chromosome. The location of B sequence homologs throughout the genome provides further support for theories of B chromosome evolution. Finally, we identified genes and gene fragments located on the B chromosome, some of which may regulate the segregation and maintenance of the B chromosome.

September 22, 2019

N6-methyladenine DNA methylation in Japonica and Indica rice genomes and its association with gene expression, plant development, and stress responses.

N6-Methyladenine (6mA) DNA methylation has recently been implicated as a potential new epigenetic marker in eukaryotes, including the dicot model Arabidopsis thaliana. However, the conservation and divergence of 6mA distribution patterns and functions in plants remain elusive. Here we report high-quality 6mA methylomes at single-nucleotide resolution in rice based on substantially improved genome sequences of two rice cultivars, Nipponbare (Nip; Japonica) and 93-11 (Indica). Analysis of 6mA genomic distribution and its association with transcription suggest that 6mA distribution and function is rather conserved between rice and Arabidopsis. We found that 6mA levels are positively correlated with the expression of key stress-related genes, which may be responsible for the difference in stress tolerance between Nip and 93-11. Moreover, we showed that mutations in DDM1 cause defects in plant growth and decreased 6mA level. Our results reveal that 6mA is a conserved DNA modification that is positively associated with gene expression and contributes to key agronomic traits in plants. Copyright © 2018 The Author. Published by Elsevier Inc. All rights reserved.

September 22, 2019

Detection and visualization of complex structural variants from long reads.

With applications in cancer, drug metabolism, and disease etiology, understanding structural variation in the human genome is critical in advancing the thrusts of individualized medicine. However, structural variants (SVs) remain challenging to detect with high sensitivity using short read sequencing technologies. This problem is exacerbated when considering complex SVs comprised of multiple overlapping or nested rearrangements. Longer reads, such as those from Pacific Biosciences platforms, often span multiple breakpoints of such events, and thus provide a way to unravel small-scale complexities in SVs with higher confidence.We present CORGi (COmplex Rearrangement detection with Graph-search), a method for the detection and visualization of complex local genomic rearrangements. This method leverages the ability of long reads to span multiple breakpoints to untangle SVs that appear very complicated with respect to a reference genome. We validated our approach against both simulated long reads, and real data from two long read sequencing technologies. We demonstrate the ability of our method to identify breakpoints inserted in synthetic data with high accuracy, and the ability to detect and plot SVs from NA12878 germline, achieving 88.4% concordance between the two sets of sequence data. The patterns of complexity we find in many NA12878 SVs match known mechanisms associated with DNA replication and structural variant formation, and highlight the ability of our method to automatically label complex SVs with an intuitive combination of adjacent or overlapping reference transformations.CORGi is a method for interrogating genomic regions suspected to contain local rearrangements using long reads. Using pairwise alignments and graph search CORGi produces labels and visualizations for local SVs of arbitrary complexity.

September 21, 2019

Assessing genome assembly quality using the LTR Assembly Index (LAI).

Assembling a plant genome is challenging due to the abundance of repetitive sequences, yet no standard is available to evaluate the assembly of repeat space. LTR retrotransposons (LTR-RTs) are the predominant interspersed repeat that is poorly assembled in draft genomes. Here, we propose a reference-free genome metric called LTR Assembly Index (LAI) that evaluates assembly continuity using LTR-RTs. After correcting for LTR-RT amplification dynamics, we show that LAI is independent of genome size, genomic LTR-RT content, and gene space evaluation metrics (i.e., BUSCO and CEGMA). By comparing genomic sequences produced by various sequencing techniques, we reveal the significant gain of assembly continuity by using long-read-based techniques over short-read-based methods. Moreover, LAI can facilitate iterative assembly improvement with assembler selection and identify low-quality genomic regions. To apply LAI, intact LTR-RTs and total LTR-RTs should contribute at least 0.1% and 5% to the genome size, respectively. The LAI program is freely available on GitHub: https://github.com/oushujun/LTR_retriever.

September 21, 2019

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.

Auto Tag: Repetitive sequence

A complete Cannabis chromosome assembly and adaptive admixture for elevated cannabidiol (CBD) content

Physiological genomics of dietary adaptation in a marine herbivorous fish

TranSurVeyor: an improved database-free algorithm for finding non-reference transpositions in high-throughput sequencing data.

The chromosome-level quality genome provides insights into the evolution of the biosynthesis genes for aroma compounds of Osmanthus fragrans.

Correcting palindromes in long reads after whole-genome amplification.

Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies.

Unexpected patterns of segregation distortion at a selfish supergene in the fire ant Solenopsis invicta.

The genome of the tegu lizard Salvator merianae: combining Illumina, PacBio, and optical mapping data to generate a highly contiguous assembly.

Genomic characterization of a B chromosome in Lake Malawi cichlid fishes.

N6-methyladenine DNA methylation in Japonica and Indica rice genomes and its association with gene expression, plant development, and stress responses.

Detection and visualization of complex structural variants from long reads.

Assessing genome assembly quality using the LTR Assembly Index (LAI).

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Subscribe for blog updates:

Filter by topic

Talk with an expert

ALS case study

Subscribe for blog updates:

Filter by topic

Talk with an expert