Menu
September 22, 2019

Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads

Due to the large number of repetitive sequences in complex eukaryotic genomes, fragmented and incompletely assembled genomes lose value as reference sequences, often due to short contigs that cannot be anchored or mispositioned onto chromosomes. Here we report a novel method Highly Efficient Repeat Assembly (HERA), which includes a new concept called a connection graph as well as algorithms for constructing the graph. HERA resolves repeats at high efficiency with single-molecule sequencing data, and enables the assembly of chromosome-scale contigs by further integrating genome maps and Hi-C data. We tested HERA with the genomes of rice R498, maize B73, human HX1 and Tartary buckwheat Pinku1. HERA can correctly assemble most of the tandemly repetitive sequences in rice using single-molecule sequencing data only. Using the same maize and human sequencing data published by Jiao et al. (2017) and Shi et al. (2016), respectively, we dramatically improved on the sequence contiguity compared with the published assemblies, increasing the contig N50 from 1.3 Mb to 61.2 Mb in maize B73 assembly and from 8.3 Mb to 54.4 Mb in human HX1 assembly with HERA. We provided a high-quality maize reference genome with 96.9% of the gaps filled (only 76 gaps left) and several incorrectly positioned sequences fixed compared with the B73 RefGen_v4 assembly. Comparisons between the HERA assembly of HX1 and the human GRCh38 reference genome showed that many gaps in GRCh38 could be filled, and that GRCh38 contained some potential errors that could be fixed. We assembled the Pinku1 genome into 12 scaffolds with a contig N50 size of 27.85 Mb. HERA serves as a new genome assembly/phasing method to generate high quality sequences for complex genomes and as a curation tool to improve the contiguity and completeness of existing reference genomes, including the correction of assembly errors in repetitive regions.


September 22, 2019

Comparative genomics provides insights into the marine adaptation in sponge-derived Kocuriaflava S43.

Sponge-derived actinomycetes represent a significant component of marine actinomycetes. Members of the genus Kocuria are distributed in various habitats such as soil, rhizosphere, clinical specimens, marine sediments, and sponges, however, to date, little is known about the mechanism of their environmental adaptation. Kocuria flava S43 was isolated from a coastal sponge. Phylogenetic analysis revealed that it was closely related to the terrestrial airborne K. flava HO-9041. In this study, to gain insights into the marine adaptation in K. flava S43 we sequenced the draft genome for K. flava S43 by third generation sequencing (TGS) and compared it with those of K. flava HO-9041 and some other Kocuria relatives. Comparative genomics and phylogenetic analyses revealed that K. flava S43 might adapt to the marine environment mainly by increasing the number of the genes linked to potassium homeostasis, resistance to heavy metals and phosphate metabolism, and acquiring the genes associated with electron transport and the genes encoding ATP-binding cassette (ABC) transporter, aquaporin, and thiol/disulfide interchange protein. Notably, gene acquisition was probably a primary mechanism of environmental adaptation in K. flava S43. Furthermore, this study also indicated that the Kocuria isolates from various marine and hyperosmotic environments possessed common genetic basis for environmental adaptation.


September 22, 2019

Genome Assembly.

Genome assembly uses sequence similarity to go from sequencing reads to longer contiguous sequences (contigs). Scaffolds are contigs linked together by gaps where the order and orientation of the contigs is known but the exact sequence connecting two contigs is unknown, represented by Ns which estimate the gap length. Here we describe recommendations for genome assembly for different sequencing technologies, describe organelle assembly, and review how to perform assembly quality control.


September 22, 2019

Improved de novo genome assembly and analysis of the Chinese cucurbit Siraitia grosvenorii, also known as monk fruit or luo-han-guo.

Luo-han-guo (Siraitia grosvenorii), also called monk fruit, is a member of the Cucurbitaceae family. Monk fruit has become an important area for research because of the pharmacological and economic potential of its noncaloric, extremely sweet components (mogrosides). It is also commonly used in traditional Chinese medicine for the treatment of lung congestion, sore throat, and constipation. Recently, a single reference genome became available for monk fruit, assembled from 36.9x genome coverage reads via Illumina sequencing platforms. This genome assembly has a relatively short (34.2 kb) contig N50 length and lacks integrated annotations. These drawbacks make it difficult to use as a reference in assembling transcriptomes and discovering novel functional genes.Here, we offer a new high-quality draft of the S. grosvenorii genome assembled using 31 Gb (~73.8x) long single molecule real time sequencing reads and polished with ~50 Gb Illumina paired-end reads. The final genome assembly is approximately 469.5 Mb, with a contig N50 length of 432,384 bp, representing a 12.6-fold improvement. We further annotated 237.3 Mb of repetitive sequence and 30,565 consensus protein coding genes with combined evidence. Phylogenetic analysis showed that S. grosvenorii diverged from members of the Cucurbitaceae family approximately 40.9 million years ago. With comprehensive transcriptomic analysis and differential expression testing, we identified 4,606 up-regulated genes in the early fruit compared to the leaf, a number of which were linked to metabolic pathways regulating fruit development and ripening.The availability of this new monk fruit genome assembly, as well as the annotations, will facilitate the discovery of new functional genes and the genetic improvement of monk fruit.


September 22, 2019

Transcriptional regulation of cysteine and methionine metabolism in Lactobacillus paracasei FAM18149.

Lactobacillus paracasei is common in the non-starter lactic acid bacteria (LAB) community of raw milk cheeses. This species can significantly contribute to flavor formation through amino acid metabolism. In this study, the DNA and RNA of L. paracasei FAM18149 were sequenced using next-generation sequencing technologies to reconstruct the metabolism of the sulfur-containing amino acids cysteine and methionine. Twenty-three genes were found to be involved in cysteine biosynthesis, the conversion of cysteine to methionine and vice versa, the S-adenosylmethionine recycling pathway, and the transport of sulfur-containing amino acids. Additionally, six methionine-specific T-boxes and one cysteine-specific T-box were found. Five of these were located upstream of genes encoding transporter functions. RNA-seq analysis and reverse-transcription quantitative polymerase reaction assays showed that expression of genes located downstream of these T-boxes was affected by the absence of either cysteine or methionine. Remarkably, the cysK2-ctl1-cysE2 operon, which is associated with te methionine-to-cysteine conversion and is upregulated in the absence of cysteine, showed high read coverage in the 5′-untranslated region and an antisense-RNA in the 3′-untranslated region. This indicates that this operon is regulated by the combination of cis- and antisense-mediated regulation mechanisms. The results of this study may help in the selection of L. paracasei strains to control sulfuric flavor formation in cheese.


September 22, 2019

HapCHAT: adaptive haplotype assembly for efficiently leveraging high coverage in long reads.

Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages.Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes.Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result.HapCHAT is available at http://hapchat.algolab.eu under the GNU Public License (GPL).


September 22, 2019

Characterization and high-quality draft genome sequence of Herbivorax saccincola A7, an anaerobic, alkaliphilic, thermophilic, cellulolytic, and xylanolytic bacterium.

An anaerobic, cellulolytic-xylanolytic bacterium, designated strain A7, was isolated from a cellulose-degrading bacterial community inhabiting bovine manure compost on Ishigaki Island, Japan, by enrichment culture using unpretreated corn stover as the sole carbon source. The strain was Gram-positive, non-endospore forming, non-motile, and formed orange colonies on solid medium. Strain A7 was identified as Herbivorax saccincola by DNA-DNA hybridization, and phylogenetic analysis based on 16S rRNA gene sequences showed that it was closely related to H. saccincola GGR1 (= DSM 101079T). H. saccincola A7 (= JCM 31827=DSM 104321) had quite similar phenotypic characteristics to those of strain GGR1. However, the optimum growth of A7 was at alkaline pH (9.0) and 55°C, compared to pH 7.0 at 60°C for GGR1, and the fatty acid profile of A7 contained 1.7-times more C17:0 iso than GGR1. The draft genome sequence revealed that H. saccincola A7 possessed a cellulosome-like extracellular macromolecular complex, which has also been found for Clostridium thermocellum and C. clariflavum. H. saccincola A7 contained more glycoside hydrolases (GHs) belonging to GH families-11 and -2, and more diversity of xylanolytic enzymes, than C. thermocellum and C. clariflavum. H. saccincola A7 could grow on xylan because it encoded essential genes for xylose metabolism, such as a xylose transporter, xylose isomerase, xylulokinase, and ribulose-phosphate 3-epimerase, which are absent from C. thermocellum. These results indicated that H. saccincola A7 has great potential as a microorganism that can effectively degrade lignocellulosic biomass. Copyright © 2018 Elsevier GmbH. All rights reserved.


September 22, 2019

RAD sequencing and a hybrid Antarctic fur seal genome assembly reveal rapidly decaying linkage disequilibrium, global population structure and evidence for inbreeding.

Recent advances in high throughput sequencing have transformed the study of wild organisms by facilitating the generation of high quality genome assemblies and dense genetic marker datasets. These resources have the potential to significantly advance our understanding of diverse phenomena at the level of species, populations and individuals, ranging from patterns of synteny through rates of linkage disequilibrium (LD) decay and population structure to individual inbreeding. Consequently, we used PacBio sequencing to refine an existing Antarctic fur seal (Arctocephalus gazella) genome assembly and genotyped 83 individuals from six populations using restriction site associated DNA (RAD) sequencing. The resulting hybrid genome comprised 6,169 scaffolds with an N50 of 6.21 Mb and provided clear evidence for the conservation of large chromosomal segments between the fur seal and dog (Canis lupus familiaris). Focusing on the most extensively sampled population of South Georgia, we found that LD decayed rapidly, reaching the background level by around 400 kb, consistent with other vertebrates but at odds with the notion that fur seals experienced a strong historical bottleneck. We also found evidence for population structuring, with four main Antarctic island groups being resolved. Finally, appreciable variance in individual inbreeding could be detected, reflecting the strong polygyny and site fidelity of the species. Overall, our study contributes important resources for future genomic studies of fur seals and other pinnipeds while also providing a clear example of how high throughput sequencing can generate diverse biological insights at multiple levels of organization. Copyright © 2018 Humble et al.


September 22, 2019

Three substrains of the cyanobacterium Anabaena sp. PCC 7120 display divergence in genomic sequences and hetC function.

Anabaena sp. strain PCC 7120 is a model strain for molecular studies of cell differentiation and patterning in heterocyst-forming cyanobacteria. Subtle differences in heterocyst development have been noticed in different laboratories working on the same organism. In this study, 360 mutations, including single nucleotide polymorphisms (SNPs), small insertion/deletions (indels; 1 to 3 bp), fragment deletions, and transpositions, were identified in the genomes of three substrains. Heterogeneous/heterozygous bases were also identified due to the polyploidy nature of the genome and the multicellular morphology but could be completely segregated when plated after filament fragmentation by sonication. hetC is a gene upregulated in developing cells during heterocyst formation in Anabaena sp. strain PCC 7120 and found in approximately half of other heterocyst-forming cyanobacteria. Inactivation of hetC in 3 substrains of Anabaena sp. PCC 7120 led to different phenotypes: the formation of heterocysts, differentiating cells that keep dividing, or the presence of both heterocysts and dividing differentiating cells. The expression of P hetZ -gfp in these hetC mutants also showed different patterns of green fluorescent protein (GFP) fluorescence. Thus, the function of hetC is influenced by the genomic background and epistasis and constitutes an example of evolution under way.IMPORTANCE Our knowledge about the molecular genetics of heterocyst formation, an important cell differentiation process for global N2 fixation, is mostly based on studies with Anabaena sp. strain PCC 7120. Here, we show that rapid microevolution is under way in this strain, leading to phenotypic variations for certain genes related to heterocyst development, such as hetC This study provides an example for ongoing microevolution, marked by multiple heterogeneous/heterozygous single nucleotide polymorphisms (SNPs), in a multicellular multicopy-genome microorganism. Copyright © 2018 American Society for Microbiology.


September 22, 2019

Investigating the central metabolism of Clostridium thermosuccinogenes.

Clostridium thermosuccinogenes is a thermophilic anaerobic bacterium able to convert various carbohydrates to succinate and acetate as main fermentation products. Genomes of the four publicly available strains have been sequenced, and the genome of the type strain has been closed. The annotated genomes were used to reconstruct the central metabolism, and enzyme assays were used to validate annotations and to determine cofactor specificity. The genes were identified for the pathways to all fermentation products, as well as for the Embden-Meyerhof-Parnas pathway and the pentose phosphate pathway. Notably, a candidate transaldolase was lacking, and transcriptomics during growth on glucose versus that on xylose did not provide any leads to potential transaldolase genes or alternative pathways connecting the C5 with the C3/C6 metabolism. Enzyme assays showed xylulokinase to prefer GTP over ATP, which could be of importance for engineering xylose utilization in related thermophilic species of industrial relevance. Furthermore, the gene responsible for malate dehydrogenase was identified via heterologous expression in Escherichia coli and subsequent assays with the cell extract, which has proven to be a simple and powerful method for the basal characterization of thermophilic enzymes.IMPORTANCE Running industrial fermentation processes at elevated temperatures has several advantages, including reduced cooling requirements, increased reaction rates and solubilities, and a possibility to perform simultaneous saccharification and fermentation of a pretreated biomass. Most studies with thermophiles so far have focused on bioethanol production. Clostridium thermosuccinogenes seems an attractive production organism for organic acids, succinic acid in particular, from lignocellulosic biomass-derived sugars. This study provides valuable insights into its central metabolism and GTP and PPi cofactor utilization. Copyright © 2018 American Society for Microbiology.


September 22, 2019

Tumor-specific mitochondrial DNA variants are rarely detected in cell-free DNA.

The use of blood-circulating cell-free DNA (cfDNA) as a “liquid biopsy” in oncology is being explored for its potential as a cancer biomarker. Mitochondria contain their own circular genomic entity (mitochondrial DNA, mtDNA), up to even thousands of copies per cell. The mutation rate of mtDNA is several orders of magnitude higher than that of the nuclear DNA. Tumor-specific variants have been identified in tumors along the entire mtDNA, and their number varies among and within tumors. The high mtDNA copy number per cell and the high mtDNA mutation rate make it worthwhile to explore the potential of tumor-specific cf-mtDNA variants as cancer marker in the blood of cancer patients. We used single-molecule real-time (SMRT) sequencing to profile the entire mtDNA of 19 tissue specimens (primary tumor and/or metastatic sites, and tumor-adjacent normal tissue) and 9 cfDNA samples, originating from 8 cancer patients (5 breast, 3 colon). For each patient, tumor-specific mtDNA variants were detected and traced in cfDNA by SMRT sequencing and/or digital PCR to explore their feasibility as cancer biomarker. As a reference, we measured other blood-circulating biomarkers for these patients, including driver mutations in nuclear-encoded cfDNA and cancer-antigen levels or circulating tumor cells. Four of the 24 (17%) tumor-specific mtDNA variants were detected in cfDNA, however at much lower allele frequencies compared to mutations in nuclear-encoded driver genes in the same samples. Also, extensive heterogeneity was observed among the heteroplasmic mtDNA variants present in an individual. We conclude that there is limited value in tracing tumor-specific mtDNA variants in blood-circulating cfDNA with the current methods available. Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.


September 22, 2019

Variation in human chromosome 21 ribosomal RNA genes characterized by TAR cloning and long-read sequencing.

Despite the key role of the human ribosome in protein biosynthesis, little is known about the extent of sequence variation in ribosomal DNA (rDNA) or its pre-rRNA and rRNA products. We recovered ribosomal DNA segments from a single human chromosome 21 using transformation-associated recombination (TAR) cloning in yeast. Accurate long-read sequencing of 13 isolates covering ~0.82 Mb of the chromosome 21 rDNA complement revealed substantial variation among tandem repeat rDNA copies, several palindromic structures and potential errors in the previous reference sequence. These clones revealed 101 variant positions in the 45S transcription unit and 235 in the intergenic spacer sequence. Approximately 60% of the 45S variants were confirmed in independent whole-genome or RNA-seq data, with 47 of these further observed in mature 18S/28S rRNA sequences. TAR cloning and long-read sequencing enabled the accurate reconstruction of multiple rDNA units and a new, high-quality 44 838 bp rDNA reference sequence, which we have annotated with variants detected from chromosome 21 of a single individual. The large number of variants observed reveal heterogeneity in human rDNA, opening up the possibility of corresponding variations in ribosome dynamics.


September 22, 2019

High-quality assembly of the reference genome for scarlet sage, Salvia splendens, an economically important ornamental plant.

Salvia splendens Ker-Gawler, scarlet or tropical sage, is a tender herbaceous perennial widely introduced and seen in public gardens all over the world. With few molecular resources, breeding is still restricted to traditional phenotypic selection, and the genetic mechanisms underlying phenotypic variation remain unknown. Hence, a high-quality reference genome will be very valuable for marker-assisted breeding, genome editing, and molecular genetics.We generated 66 Gb and 37 Gb of raw DNA sequences, respectively, from whole-genome sequencing of a largely homozygous scarlet sage inbred line using Pacific Biosciences (PacBio) single-molecule real-time and Illumina HiSeq sequencing platforms. The PacBio de novo assembly yielded a final genome with a scaffold N50 size of 3.12 Mb and a total length of 808 Mb. The repetitive sequences identified accounted for 57.52% of the genome sequence, and ?54,008 protein-coding genes were predicted collectively with ab initio and homology-based gene prediction from the masked genome. The divergence time between S. splendens and Salvia miltiorrhiza was estimated at 28.21 million years ago (Mya). Moreover, 3,797 species-specific genes and 1,187 expanded gene families were identified for the scarlet sage genome.We provide the first genome sequence and gene annotation for the scarlet sage. The availability of these resources will be of great importance for further breeding strategies, genome editing, and comparative genomics among related species.


September 22, 2019

A graph-based approach to diploid genome assembly.

Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community.We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50× coverage Illumina data and 10× PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants.https://github.com/whatshap/whatshap.Supplementary data are available at Bioinformatics online.


September 22, 2019

Whole genome and transcriptome maps of the entirely black native Korean chicken breed Yeonsan Ogye.

Yeonsan Ogye (YO), an indigenous Korean chicken breed (Gallus gallus domesticus), has entirely black external features and internal organs. In this study, the draft genome of YO was assembled using a hybrid de novo assembly method that takes advantage of high-depth Illumina short reads (376.6X) and low-depth Pacific Biosciences (PacBio) long reads (9.7X).The contig and scaffold NG50s of the hybrid de novo assembly were 362.3 Kbp and 16.8 Mbp, respectively. The completeness (97.6%) of the draft genome (Ogye_1.1) was evaluated with single-copy orthologous genes using Benchmarking Universal Single-Copy Orthologs and found to be comparable to the current chicken reference genome (galGal5; 97.4%; contigs were assembled with high-depth PacBio long reads (50X) and scaffolded with short reads) and superior to other avian genomes (92%-93%; assembled with short read-only or hybrid methods). Compared to galGal4 and galGal5, the draft genome included 551 structural variations including the fibromelanosis (FM) locus duplication, related to hyperpigmentation. To comprehensively reconstruct transcriptome maps, RNA sequencing and reduced representation bisulfite sequencing data were analyzed from 20 tissues, including 4 black tissues (skin, shank, comb, and fascia). The maps included 15,766 protein-coding and 6,900 long noncoding RNA genes, many of which were tissue-specifically expressed and displayed tissue-specific DNA methylation patterns in the promoter regions.We expect that the resulting genome sequence and transcriptome maps will be valuable resources for studying domestic chicken breeds, including black-skinned chickens, as well as for understanding genomic differences between breeds and the evolution of hyperpigmented chickens and functional elements related to hyperpigmentation.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.