Genome assembly Archives - Page 110 of 196

July 7, 2019

Complete sequences of organelle genomes from the medicinal plant Rhazya stricta (Apocynaceae) and contrasting patterns of mitochondrial genome evolution across asterids.

Rhazya stricta is native to arid regions in South Asia and the Middle East and is used extensively in folk medicine to treat a wide range of diseases. In addition to generating genomic resources for this medicinally important plant, analyses of the complete plastid and mitochondrial genomes and a nuclear transcriptome from Rhazya provide insights into inter-compartmental transfers between genomes and the patterns of evolution among eight asterid mitochondrial genomes.The 154,841 bp plastid genome is highly conserved with gene content and order identical to the ancestral organization of angiosperms. The 548,608 bp mitochondrial genome exhibits a number of phenomena including the presence of recombinogenic repeats that generate a multipartite organization, transferred DNA from the plastid and nuclear genomes, and bidirectional DNA transfers between the mitochondrion and the nucleus. The mitochondrial genes sdh3 and rps14 have been transferred to the nucleus and have acquired targeting presequences. In the case of rps14, two copies are present in the nucleus; only one has a mitochondrial targeting presequence and may be functional. Phylogenetic analyses of both nuclear and mitochondrial copies of rps14 across angiosperms suggests Rhazya has experienced a single transfer of this gene to the nucleus, followed by a duplication event. Furthermore, the phylogenetic distribution of gene losses and the high level of sequence divergence in targeting presequences suggest multiple, independent transfers of both sdh3 and rps14 across asterids. Comparative analyses of mitochondrial genomes of eight sequenced asterids indicates a complicated evolutionary history in this large angiosperm clade with considerable diversity in genome organization and size, repeat, gene and intron content, and amount of foreign DNA from the plastid and nuclear genomes.Organelle genomes of Rhazya stricta provide valuable information for improving the understanding of mitochondrial genome evolution among angiosperms. The genomic data have enabled a rigorous examination of the gene transfer events. Rhazya is unique among the eight sequenced asterids in the types of events that have shaped the evolution of its mitochondrial genome. Furthermore, the organelle genomes of R. stricta provide valuable genomic resources for utilizing this important medicinal plant in biotechnology applications.

July 7, 2019

Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences.

To assess the potential of different types of sequence data combined with de novo and hybrid assembly approaches to improve existing draft genome sequences.Illumina, 454 and PacBio sequencing technologies were used to generate de novo and hybrid genome assemblies for four different bacteria, which were assessed for quality using summary statistics (e.g. number of contigs, N50) and in silico evaluation tools. Differences in predictions of multiple copies of rDNA operons for each respective bacterium were evaluated by PCR and Sanger sequencing, and then the validated results were applied as an additional criterion to rank assemblies. In general, assemblies using longer PacBio reads were better able to resolve repetitive regions. In this study, the combination of Illumina and PacBio sequence data assembled through the ALLPATHS-LG algorithm gave the best summary statistics and most accurate rDNA operon number predictions. This study will aid others looking to improve existing draft genome assemblies.All assembly tools except CLC Genomics Workbench are freely available under GNU General Public License.brownsd@ornl.govSupplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.

July 7, 2019

Enhancing the detection of barcoded reads in high throughput DNA sequencing data by controlling the false discovery rate.

DNA barcodes are short unique sequences used to label DNA or RNA-derived samples in multiplexed deep sequencing experiments. During the demultiplexing step, barcodes must be detected and their position identified. In some cases (e.g., with PacBio SMRT), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives.For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements.In our analysis, barcode sequences showed high rates of coincidental similarities with the Mus musculus reference DNA. This problem became more acute when the length of the barcode sequence decreased and the number of barcodes in the set increased. The method presented in this paper controls the tail area-based false discovery rate to distinguish between barcoded and unbarcoded reads. This method helps to establish the highest acceptable minimal distance between reads and barcode sequences. In a proof of concept experiment we correctly detected barcodes in 83% of the reads with a precision of 89%. Sensitivity improved to 99% at 99% precision when the adjacent primer sequence was incorporated in the analysis. The analysis was further improved using a paired end strategy. Following an analysis of the data for sequence variants induced in the Atp1a1 gene of C57BL/6 murine melanocytes by ultraviolet light and conferring resistance to ouabain, we found no evidence of cross-contamination of DNA material between samples.Our method offers a proper quantitative treatment of the problem of detecting barcoded reads in a noisy sequencing environment. It is based on the false discovery rate statistics that allows a proper trade-off between sensitivity and precision to be chosen.

July 7, 2019

Organellar genomes of the four-toothed moss, Tetraphis pellucida.

Mosses are the largest of the three extant clades of gametophyte-dominant land plants and remain poorly studied using comparative genomic methods. Major monophyletic moss lineages are characterised by different types of a spore dehiscence apparatus called the peristome, and the most important unsolved problem in higher-level moss systematics is the branching order of these peristomate clades. Organellar genome sequencing offers the potential to resolve this issue through the provision of both genomic structural characters and a greatly increased quantity of nucleotide substitution characters, as well as to elucidate organellar evolution in mosses. We publish and describe the chloroplast and mitochondrial genomes of Tetraphis pellucida, representative of the most phylogenetically intractable and morphologically isolated peristomate lineage.Assembly of reads from Illumina SBS and Pacific Biosciences RS sequencing reveals that the Tetraphis chloroplast genome comprises 127,489 bp and the mitochondrial genome 107,730 bp. Although genomic structures are similar to those of the small number of other known moss organellar genomes, the chloroplast lacks the petN gene (in common with Tortula ruralis) and the mitochondrion has only a non-functional pseudogenised remnant of nad7 (uniquely amongst known moss chondromes).Structural genomic features exist with the potential to be informative for phylogenetic relationships amongst the peristomate moss lineages, and thus organellar genome sequences are urgently required for exemplars from other clades. The unique genomic and morphological features of Tetraphis confirm its importance for resolving one of the major questions in land plant phylogeny and for understanding the evolution of the peristome, a likely key innovation underlying the diversity of mosses. The functional loss of nad7 from the chondrome is now shown to have occurred independently in all three bryophyte clades as well as in the early-diverging tracheophyte Huperzia squarrosa.

July 7, 2019

Complete genome sequence of highly adherent Pseudomonas aeruginosa small-colony variant SCV20265.

The evolution of small-colony variants within Pseudomonas aeruginosa populations chronically infecting the cystic fibrosis lung is one example of the emergence of adapted subpopulations. Here, we present the complete genome sequence of the autoaggregative and hyperpiliated P. aeruginosa small-colony variant SCV20265, which was isolated from a cystic ?brosis (CF) patient.

July 7, 2019

Genome sequence of Candidatus Nitrososphaera evergladensis from group I.1b enriched from Everglades soil reveals novel genomic features of the ammonia-oxidizing archaea.

The activity of ammonia-oxidizing archaea (AOA) leads to the loss of nitrogen from soil, pollution of water sources and elevated emissions of greenhouse gas. To date, eight AOA genomes are available in the public databases, seven are from the group I.1a of the Thaumarchaeota and only one is from the group I.1b, isolated from hot springs. Many soils are dominated by AOA from the group I.1b, but the genomes of soil representatives of this group have not been sequenced and functionally characterized. The lack of knowledge of metabolic pathways of soil AOA presents a critical gap in understanding their role in biogeochemical cycles. Here, we describe the first complete genome of soil archaeon Candidatus Nitrososphaera evergladensis, which has been reconstructed from metagenomic sequencing of a highly enriched culture obtained from an agricultural soil. The AOA enrichment was sequenced with the high throughput next generation sequencing platforms from Pacific Biosciences and Ion Torrent. The de novo assembly of sequences resulted in one 2.95 Mb contig. Annotation of the reconstructed genome revealed many similarities of the basic metabolism with the rest of sequenced AOA. Ca. N. evergladensis belongs to the group I.1b and shares only 40% of whole-genome homology with the closest sequenced relative Ca. N. gargensis. Detailed analysis of the genome revealed coding sequences that were completely absent from the group I.1a. These unique sequences code for proteins involved in control of DNA integrity, transporters, two-component systems and versatile CRISPR defense system. Notably, genomes from the group I.1b have more gene duplications compared to the genomes from the group I.1a. We suggest that the presence of these unique genes and gene duplications may be associated with the environmental versatility of this group.

July 7, 2019

Complete genome sequences of eight Helicobacter pylori strains with different virulence factor genotypes and methylation profiles, isolated from patients with diverse gastrointestinal diseases on Okinawa Island, Japan, determined using PacBio Single-Molecule Real-Time Technology.

We report the complete genome sequences of eight Helicobacter pylori strains isolated from patients with gastrointestinal diseases in Okinawa, Japan. Whole-genome sequencing and DNA methylation detection were performed using the PacBio platform. De novo assembly determined a single, complete contig for each strain. Furthermore, methylation analysis identified virulence factor genotype-dependent motifs.

July 7, 2019

proovread: large-scale high-accuracy PacBio correction through iterative short read consensus.

Today, the base code of DNA is mostly determined through sequencing by synthesis as provided by the Illumina sequencers. Although highly accurate, resulting reads are short, making their analyses challenging. Recently, a new technology, single molecule real-time (SMRT) sequencing, was developed that could address these challenges, as it generates reads of several thousand bases. But, their broad application has been hampered by a high error rate. Therefore, hybrid approaches that use high-quality short reads to correct erroneous SMRT long reads have been developed. Still, current implementations have great demands on hardware, work only in well-defined computing infrastructures and reject a substantial amount of reads. This limits their usability considerably, especially in the case of large sequencing projects.Here we present proovread, a hybrid correction pipeline for SMRT reads, which can be flexibly adapted on existing hardware and infrastructure from a laptop to a high-performance computing cluster. On genomic and transcriptomic test cases covering Escherichia coli, Arabidopsis thaliana and human, proovread achieved accuracies up to 99.9% and outperformed the existing hybrid correction programs. Furthermore, proovread-corrected sequences were longer and the throughput was higher. Thus, proovread combines the most accurate correction results with an excellent adaptability to the available hardware. It will therefore increase the applicability and value of SMRT sequencing.proovread is available at the following URL: http://proovread.bioapps.biozentrum.uni-wuerzburg.de. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

July 7, 2019

Insights into the preservation of the homomorphic sex-determining chromosome of Aedes aegypti from the discovery of a male-biased gene tightly linked to the M-locus.

The preservation of a homomorphic sex-determining chromosome in some organisms without transformation into a heteromorphic sex chromosome is a long-standing enigma in evolutionary biology. A dominant sex-determining locus (or M-locus) in an undifferentiated homomorphic chromosome confers the male phenotype in the yellow fever mosquito Aedes aegypti. Genetic evidence suggests that the M-locus is in a nonrecombining region. However, the molecular nature of the M-locus has not been characterized. Using a recently developed approach based on Illumina sequencing of male and female genomic DNA, we identified a novel gene, myo-sex, that is present almost exclusively in the male genome but can sporadically be found in the female genome due to recombination. For simplicity, we define sequences that are primarily found in the male genome as male-biased. Fluorescence in situ hybridization (FISH) on A. aegypti chromosomes demonstrated that the myo-sex probe localized to region 1q21, the established location of the M-locus. Myo-sex is a duplicated myosin heavy chain gene that is highly expressed in the pupa and adult male. Myo-sex shares 83% nucleotide identity and 97% amino acid identity with its closest autosomal paralog, consistent with ancient duplication followed by strong purifying selection. Compared with males, myo-sex is expressed at very low levels in the females that acquired it, indicating that myo-sex may be sexually antagonistic. This study establishes a framework to discover male-biased sequences within a homomorphic sex-determining chromosome and offers new insights into the evolutionary forces that have impeded the expansion of the nonrecombining M-locus in A. aegypti.

July 7, 2019

Characterization of biological pathways associated with a 1.37 Mbp genomic region protective of hypertension in Dahl S rats.

The goal of the present study was to narrow a region of chromosome 13 to only several genes and then apply unbiased statistical approaches to identify molecular networks and biological pathways relevant to blood-pressure salt sensitivity in Dahl salt-sensitive (SS) rats. The analysis of 13 overlapping subcongenic strains identified a 1.37 Mbp region on chromosome 13 that influenced the mean arterial blood pressure by at least 25 mmHg in SS rats fed a high-salt diet. DNA sequencing and analysis filled genomic gaps and provided identification of five genes in this region, Rfwd2, Fam5b, Astn1, Pappa2, and Tnr. A cross-platform normalization of transcriptome data sets obtained from our previously published Affymetrix GeneChip dataset and newly acquired RNA-seq data from renal outer medullary tissue provided 90 observations for each gene. Two Bayesian methods were used to analyze the data: 1) a linear model analysis to assess 243 biological pathways for their likelihood to discriminate blood pressure levels across experimental groups and 2) a Bayesian graphical modeling of pathways to discover genes with potential relationships to the candidate genes in this region. As none of these five genes are known to be involved in hypertension, this unbiased approach has provided useful clues to be experimentally explored. Of these five genes, Rfwd2, the gene most strongly expressed in the renal outer medulla, was notably associated with pathways that can affect blood pressure via renal transcellular Na(+) and K(+) electrochemical gradients and tubular Na(+) transport, mitochondrial TCA cycle and cell energetics, and circadian rhythms. Copyright © 2014 the American Physiological Society.

July 7, 2019

LUMPY: a probabilistic framework for structural variant discovery.

Comprehensive discovery of structural variation (SV) from whole genome sequencing data requires multiple detection signals including read-pair, split-read, read-depth and prior knowledge. Owing to technical challenges, extant SV discovery algorithms either use one signal in isolation, or at best use two sequentially. We present LUMPY, a novel SV discovery framework that naturally integrates multiple SV signals jointly across multiple samples. We show that LUMPY yields improved sensitivity, especially when SV signal is reduced owing to either low coverage data or low intra-sample variant allele frequency. We also report a set of 4,564 validated breakpoints from the NA12878 human genome. https://github.com/arq5x/lumpy-sv.

July 7, 2019

Association mapping, patterns of linkage disequilibrium and selection in the vicinity of the PHYTOCHROME C gene in pearl millet.

Linkage analysis confirmed the association in the region of PHYC in pearl millet. The comparison of genes found in this region suggests that PHYC is the best candidate. Major efforts are currently underway to dissect the phenotype-genotype relationship in plants and animals using existing populations. This method exploits historical recombinations accumulated in these populations. However, linkage disequilibrium sometimes extends over a relatively long distance, particularly in genomic regions containing polymorphisms that have been targets for selection. In this case, many genes in the region could be statistically associated with the trait shaped by the selected polymorphism. Statistical analyses could help in identifying the best candidate genes into such a region where an association is found. In a previous study, we proposed that a fragment of the PHYTOCHROME C gene (PHYC) is associated with flowering time and morphological variations in pearl millet. In the present study, we first performed linkage analyses using three pearl millet F2 families to confirm the presence of a QTL in the vicinity of PHYC. We then analyzed a wider genomic region of ~100 kb around PHYC to pinpoint the gene that best explains the association with the trait in this region. A panel of 90 pearl millet inbred lines was used to assess the association. We used a Markov chain Monte Carlo approach to compare 75 markers distributed along this 100-kb region. We found the best candidate markers on the PHYC gene. Signatures of selection in this region were assessed in an independent data set and pointed to the same gene. These results foster confidence in the likely role of PHYC in phenotypic variation and encourage the development of functional studies.

July 7, 2019

Dissecting a hidden gene duplication: the Arabidopsis thaliana SEC10 locus.

Repetitive sequences present a challenge for genome sequence assembly, and highly similar segmental duplications may disappear from assembled genome sequences. Having found a surprising lack of observable phenotypic deviations and non-Mendelian segregation in Arabidopsis thaliana mutants in SEC10, a gene encoding a core subunit of the exocyst tethering complex, we examined whether this could be explained by a hidden gene duplication. Re-sequencing and manual assembly of the Arabidopsis thaliana SEC10 (At5g12370) locus revealed that this locus, comprising a single gene in the reference genome assembly, indeed contains two paralogous genes in tandem, SEC10a and SEC10b, and that a sequence segment of 7 kb in length is missing from the reference genome sequence. Differences between the two paralogs are concentrated in non-coding regions, while the predicted protein sequences exhibit 99% identity, differing only by substitution of five amino acid residues and an indel of four residues. Both SEC10 genes are expressed, although varying transcript levels suggest differential regulation. Homozygous T-DNA insertion mutants in either paralog exhibit a wild-type phenotype, consistent with proposed extensive functional redundancy of the two genes. By these observations we demonstrate that recently duplicated genes may remain hidden even in well-characterized genomes, such as that of A. thaliana. Moreover, we show that the use of the existing A. thaliana reference genome sequence as a guide for sequence assembly of new Arabidopsis accessions or related species has at least in some cases led to error propagation.

July 7, 2019

Genome sequence and methylome of soil bacterium Gemmatirosa kalamazoonensis KBS708(T), a member of the rarely cultivated Gemmatimonadetes phylum.

Bacteria belonging to the phylum Gemmatimonadetes are found in a wide variety of environments and are particularly abundant in soils. Here, we present the complete genome sequence and methylation pattern of the newly described Gemmatirosa kalamazoonensis type strain.

July 7, 2019

FGAP: an automated gap closing tool.

The fast reduction of prices of DNA sequencing allowed rapid accumulation of genome data. However, the process of obtaining complete genome sequences is still very time consuming and labor demanding. In addition, data produced from various sequencing technologies or alternative assemblies remain underexplored to improve assembly of incomplete genome sequences.We have developed FGAP, a tool for closing gaps of draft genome sequences that takes advantage of different datasets. FGAP uses BLAST to align multiple contigs against a draft genome assembly aiming to find sequences that overlap gaps. The algorithm selects the best sequence to fill and eliminate the gap.FGAP reduced the number of gaps by 78% in an E. coli draft genome assembly using two different sequencing technologies, Illumina and 454. Using PacBio long reads, 98% of gaps were solved. In human chromosome 14 assemblies, FGAP reduced the number of gaps by 35%. All the inserted sequences were validated with a reference genome using QUAST. The source code and a web tool are available at http://www.bioinfo.ufpr.br/fgap/.

Auto Tag: Genome assembly

Complete sequences of organelle genomes from the medicinal plant Rhazya stricta (Apocynaceae) and contrasting patterns of mitochondrial genome evolution across asterids.

Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences.

Enhancing the detection of barcoded reads in high throughput DNA sequencing data by controlling the false discovery rate.

Organellar genomes of the four-toothed moss, Tetraphis pellucida.

Complete genome sequence of highly adherent Pseudomonas aeruginosa small-colony variant SCV20265.

Genome sequence of Candidatus Nitrososphaera evergladensis from group I.1b enriched from Everglades soil reveals novel genomic features of the ammonia-oxidizing archaea.

Complete genome sequences of eight Helicobacter pylori strains with different virulence factor genotypes and methylation profiles, isolated from patients with diverse gastrointestinal diseases on Okinawa Island, Japan, determined using PacBio Single-Molecule Real-Time Technology.

proovread: large-scale high-accuracy PacBio correction through iterative short read consensus.

Insights into the preservation of the homomorphic sex-determining chromosome of Aedes aegypti from the discovery of a male-biased gene tightly linked to the M-locus.

Characterization of biological pathways associated with a 1.37 Mbp genomic region protective of hypertension in Dahl S rats.

LUMPY: a probabilistic framework for structural variant discovery.

Association mapping, patterns of linkage disequilibrium and selection in the vicinity of the PHYTOCHROME C gene in pearl millet.

Dissecting a hidden gene duplication: the Arabidopsis thaliana SEC10 locus.

Genome sequence and methylome of soil bacterium Gemmatirosa kalamazoonensis KBS708(T), a member of the rarely cultivated Gemmatimonadetes phylum.

FGAP: an automated gap closing tool.

Subscribe for blog updates:

Filter by topic

Talk with an expert

Antimicrobial resistance research

Subscribe for blog updates:

Filter by topic

Talk with an expert