Menu
April 21, 2020

Construction of JRG (Japanese reference genome) with single-molecule real-time sequencing

In recent genome analyses, population-specific reference panels have indicated important. However, reference panels based on short-read sequencing data do not sufficiently cover long insertions. Therefore, the nature of long insertions has not been well documented. Here, we assembled a Japanese genome using single-molecule real-time sequencing data and characterized insertions found in the assembled genome. We identified 3691 insertions ranging from 100?bps to ~10,000?bps in the assembled genome relative to the international reference sequence (GRCh38). To validate and characterize these insertions, we mapped short-reads from 1070 Japanese individuals and 728 individuals from eight other populations to insertions integrated into GRCh38. With this result, we constructed JRGv1 (Japanese Reference Genome version 1) by integrating the 903 verified insertions, totaling 1,086,173 bases, shared by at least two Japanese individuals into GRCh38. We also constructed decoyJRGv1 by concatenating 3559 verified insertions, totaling 2,536,870 bases, shared by at least two Japanese individuals or by six other assemblies. This assembly improved the alignment ratio by 0.4% on average. These results demonstrate the importance of refining the reference assembly and creating a population-specific reference genome. JRGv1 and decoyJRGv1 are available at the JRG website.


April 21, 2020

A comparative evaluation of hybrid error correction methods for error-prone long reads.

Third-generation sequencing technologies have advanced the progress of the biological research by generating reads that are substantially longer than second-generation sequencing technologies. However, their notorious high error rate impedes straightforward data analysis and limits their application. A handful of error correction methods for these error-prone long reads have been developed to date. The output data quality is very important for downstream analysis, whereas computing resources could limit the utility of some computing-intense tools. There is a lack of standardized assessments for these long-read error-correction methods.Here, we present a comparative performance assessment of ten state-of-the-art error-correction methods for long reads. We established a common set of benchmarks for performance assessment, including sensitivity, accuracy, output rate, alignment rate, output read length, run time, and memory usage, as well as the effects of error correction on two downstream applications of long reads: de novo assembly and resolving haplotype sequences.Taking into account all of these metrics, we provide a suggestive guideline for method choice based on available data size, computing resources, and individual research goals.


April 21, 2020

Horizontal transfer of a retrotransposon between parasitic nematodes and the common shrew.

As the genomes of more metazoan species are sequenced, reports of horizontal transposon transfers (HTT) have increased. Our understanding of the mechanisms of such events is at an early stage. The close physical relationship between a parasite and its host could facilitate horizontal transfer. To date, two studies have identified horizontal transfer of RTEs, a class of retrotransposable elements, involving parasites: ticks might act as vector for BovB between ruminants and squamates, and AviRTE was transferred between birds and parasitic nematodes.We searched for RTEs shared between nematode and mammalian genomes. Given their physical proximity, it was necessary to detect and remove sequence contamination from the genome datasets, which would otherwise distort the signal of horizontal transfer. We developed an approach that is based on reads instead of genomic sequences to reliably detect contamination. From comparison of 43 RTEs across 197 genomes, we identified a single putative case of horizontal transfer: we detected RTE1_Sar from Sorex araneus, the common shrew, in parasitic nematodes. From the taxonomic distribution and evolutionary analysis, we show that RTE1_Sar was horizontally transferred.We identified a new horizontal RTE transfer in host-parasite interactions, which suggests that it is not uncommon. Further, we present and provide the workflow a read-based method to distinguish between contamination and horizontal transfer.


April 21, 2020

Retrotranspositional landscape of Asian rice revealed by 3000 genomes.

The recent release of genomic sequences for 3000 rice varieties provides access to the genetic diversity at species level for this crop. We take advantage of this resource to unravel some features of the retrotranspositional landscape of rice. We develop software TRACKPOSON specifically for the detection of transposable elements insertion polymorphisms (TIPs) from large datasets. We apply this tool to 32 families of retrotransposons and identify more than 50,000 TIPs in the 3000 rice genomes. Most polymorphisms are found at very low frequency, suggesting that they may have occurred recently in agro. A genome-wide association study shows that these activations in rice may be triggered by external stimuli, rather than by the alteration of genetic factors involved in transposable element silencing pathways. Finally, the TIPs dataset is used to trace the origin of rice domestication. Our results suggest that rice originated from three distinct domestication events.


April 21, 2020

Complete Sequences of Multiple-Drug Resistant IncHI2 ST3 Plasmids in Escherichia coli of Porcine Origin in Australia

IncHI2 ST3 plasmids are known carriers of multiple antimicrobial resistance genes. Complete plasmid sequences from multiple drug resistant Escherichia coli circulating in Australian swine is however limited. Here we sequenced two related IncHI2 ST3 plasmids, pSDE-SvHI2 and pSDC-F2_12BHI2, from phylogenetically unrelated multiple-drug resistant Escherichia coli strains SvETEC (CC23:O157:H19) and F2_12B (ST93:O7:H4) from geographically disparate pig production operations in New South Wales, Australia. Unicycler was used to co-assemble short read (Illumina) and long read (PacBio SMRT) nucleotide sequence data. The plasmids encoded three drug-resistance loci, two of which carried class 1 integrons. One integron, hosting drfA12-orfF-aadA2, was within a hybrid Tn1721/21, with the second residing within a copper/silver resistance transposon, comprising part of an atypical sul3-associated structure. The third resistance locus was flanked by IS15DI and encoded neomycin resistance (neoR). An oqx-encoding transposon (quinolone resistance), similar in structure to Tn6010, was identified only in pSDC-F2_12BHI2. Both plasmids showed high sequence identity to plasmid pSTM6-275, recently described in Salmonella enterica serotype 1,4,[5],12:i:- that has risen to prominence and become endemic in Australia. IncHI2 ST3 plasmids circulating in commensal and pathogenic E. coli from Australian swine belong to a lineage of plasmids often in association with sul3 and host multiple complex antibiotic and metal resistance structures, formed in part by IS26.


April 21, 2020

CAMISIM: simulating metagenomes and microbial communities.

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required.We describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series, and differential abundance studies, includes real and simulated strain-level diversity, and generates second- and third-generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes, we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT, and metaSPAdes, on several thousand small data sets generated with CAMISIM.CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with standards of truth for method evaluation. All data sets and the software are freely available at https://github.com/CAMI-challenge/CAMISIM.


April 21, 2020

Identification of genes associated with ricinoleic acid accumulation in Hiptage benghalensis via transcriptome analysis.

Ricinoleic acid is a high-value hydroxy fatty acid with broad industrial applications. Hiptage benghalensis seed oil contains a high amount of ricinoleic acid (~?80%) and represents an emerging source of this unusual fatty acid. However, the mechanism of ricinoleic acid accumulation in H. benghalensis is yet to be explored at the molecular level, which hampers the exploration of its potential in ricinoleic acid production.To explore the molecular mechanism of ricinoleic acid biosynthesis and regulation, H. benghalensis seeds were harvested at five developing stages (13, 16, 19, 22, and 25 days after pollination) for lipid analysis. The results revealed that the rapid accumulation of ricinoleic acid occurred at the early-mid-seed development stages (16-22 days after pollination). Subsequently, the gene transcription profiles of the developing seeds were characterized via a comprehensive transcriptome analysis with second-generation sequencing and single-molecule real-time sequencing. Differential expression patterns were identified in 12,555 transcripts, including 71 enzymes in lipid metabolic pathways, 246 putative transcription factors (TFs) and 124 long noncoding RNAs (lncRNAs). Twelve genes involved in diverse lipid metabolism pathways, including fatty acid biosynthesis and modification (hydroxylation), lipid traffic, triacylglycerol assembly, acyl editing and oil-body formation, displayed high expression levels and consistent expression patterns with ricinoleic acid accumulation in the developing seeds, suggesting their primary roles in ricinoleic acid production. Subsequent co-expression network analysis identified 57 TFs and 35 lncRNAs, which are putatively involved in the regulation of ricinoleic acid biosynthesis. The transcriptome data were further validated by analyzing the expression profiles of key enzyme-encoding genes, TFs and lncRNAs with quantitative real-time PCR. Finally, a network of genes associated with ricinoleic acid accumulation in H. benghalensis was established.This study was the first step toward the understating of the molecular mechanisms of ricinoleic acid biosynthesis and oil accumulation in H. benghalensis seeds and identified a pool of novel genes regulating ricinoleic acid accumulation. The results set a foundation for developing H. benghalensis into a novel ricinoleic acid feedstock at the transcriptomic level and provided valuable candidate genes for improving ricinoleic acid production in other plants.


April 21, 2020

Complete genome sequence of bile-isolated Enterococcus avium strain 352

Background: Enterococcus avium is a Gram-positive pathogenic bacterium belonging to the family Enterobacte- riaceae. E. avium can cause bacteremia, peritonitis, and intracranial suppurative infection. However, the mechanism of its pathogenesis and its adaptation to a special niche is still unclear. Results: In this study, the E. avium strain 352 was isolated from human bile and whole genome sequencing was per- formed. The E. avium strain 352 consists of a circular 4,794,392 bp chromosome as well as an 87,705 bp plasmid. The GC content of the chromosome is 38.98%. There are 4905 and 99 protein coding sequences in the chromosome and the plasmid, respectively. The genome of the E. avium strain 352 contains number of genes reported to be associated with bile adaption, including bsh, sbcC, mutS, nifI, galU, and hupB. There are also several virulence-associated genes including esp, fss1, fss3, ecbA, bsh, lap, clpC, clpE, and clpP. Conclusions: This study demonstrates the presence of various virulence factors of the E. avium strain 352, which has the potential to cause infections. Moreover, the genes involved in bile adaption might contribute to its ability to live in bile. Further comparative genomic studies would help to elucidate the evolution of pathogenesis of E. avium.


April 21, 2020

Whole-genome sequencing of Klebsiella pneumoniae isolates to track strain progression in a single patient with recurrent urinary tract infection.

Klebsiella pneumoniae is an important uropathogen that increasingly harbors broad-spectrum antibiotic resistance determinants. Evidence suggests that some same-strain recurrences in women with frequent urinary tract infections (UTIs) may emanate from a persistent intravesicular reservoir. Our objective was to analyze K. pneumoniae isolates collected over weeks from multiple body sites of a single patient with recurrent UTI in order to track ordered strain progression across body sites, as has been employed across patients in outbreak settings. Whole-genome sequencing of 26 K. pneumoniae isolates was performed utilizing the Illumina platform. PacBio sequencing was used to create a refined reference genome of the original urinary isolate (TOP52). Sequence variation was evaluated by comparing the 26 isolate sequences to the reference genome sequence. Whole-genome sequencing of the K. pneumoniae isolates from six different body sites of this patient with recurrent UTI demonstrated 100% chromosomal sequence identity of the isolates, with only a small P2 plasmid deletion in a minority of isolates. No single nucleotide variants were detected. The complete absence of single-nucleotide variants from 26 K. pneumoniae isolates from multiple body sites collected over weeks from a patient with recurrent UTI suggests that, unlike in an outbreak situation with strains collected from numerous patients, other methods are necessary to discern strain progression within a single host over a relatively short time frame.


April 21, 2020

Genome sequencing and comparison of five Tilletia species to identify candidate genes for the detection of regulated species infecting wheat

Tilletia species cause diseases on grass hosts with some causing bunt diseases on wheat (Triticum). Two of the four species infecting wheat have restricted distributions globally and are subject to quarantine regulations to prevent their spread to new areas. Tilletia indica causes Karnal bunt and is regulated by many countries while the non-regulated T. walkeri is morphologically similar and very closely related phylogenetically, but infects ryegrass (Lolium) and not wheat. Tilletia controversa causes dwarf bunt of wheat (DB) and is also regulated by some countries, while the closely related but non-regulated species, T. caries and T. laevis, both cause common bunt of wheat (CB). Historically, diagnostic methods have relied on cryptic morphology to differentiate these species in subsamples from grain shipments. Of the DNA-based methods published so far, most have focused on sequence variation among tested strains at a single gene locus. To facilitate the development of additional molecular assays for diagnostics, we generated whole genome data for multiple strains of the two regulated wheat pathogens and their closest relatives. Depending on the species, the genomes were assembled into 907 to 4633 scaffolds ranging from 24?Mb to 30?Mb with 7842 to 9952 gene models predicted. Phylogenomic analyses confirmed the placement of Tilletia in the Exobasidiomycetes and showed that T. indica and T. walkeri were in one clade whereas T. controversa, T. caries and T. laevis grouped in a separate clade. Single copy and species-specific genes were identified by orthologous group analysis. Unique species-specific genes were identified and evaluated as suitable markers to differentiate the quarantine and non-quarantine species. After further analyses and manual inspection, primers and probes for the optimum candidate genes were designed and tested in silico, for validation in future wet-lab studies.


April 21, 2020

Transcriptome, proteome and draft genome of Euglena gracilis.

Photosynthetic euglenids are major contributors to fresh water ecosystems. Euglena gracilis in particular has noted metabolic flexibility, reflected by an ability to thrive in a range of harsh environments. E. gracilis has been a popular model organism and of considerable biotechnological interest, but the absence of a gene catalogue has hampered both basic research and translational efforts.We report a detailed transcriptome and partial genome for E. gracilis Z1. The nuclear genome is estimated to be around 500?Mb in size, and the transcriptome encodes over 36,000 proteins and the genome possesses less than 1% coding sequence. Annotation of coding sequences indicates a highly sophisticated endomembrane system, RNA processing mechanisms and nuclear genome contributions from several photosynthetic lineages. Multiple gene families, including likely signal transduction components, have been massively expanded. Alterations in protein abundance are controlled post-transcriptionally between light and dark conditions, surprisingly similar to trypanosomatids.Our data provide evidence that a range of photosynthetic eukaryotes contributed to the Euglena nuclear genome, evidence in support of the ‘shopping bag’ hypothesis for plastid acquisition. We also suggest that euglenids possess unique regulatory mechanisms for achieving extreme adaptability, through mechanisms of paralog expansion and gene acquisition.


April 21, 2020

Genome sequence and transcriptomic profiles of a marine bacterium, Pseudoalteromonas agarivorans Hao 2018.

Members of the marine genus Pseudoalteromonas have attracted great interest because of their ability to produce a large number of biologically active substances. Here, we report the complete genome sequence of Pseudoalteromonas agarivorans Hao 2018, a strain isolated from an abalone breeding environment, using second-generation Illumina and third-generation PacBio sequencing technologies. Illumina sequencing offers high quality and short reads, while PacBio technology generates long reads. The scaffolds of the two platforms were assembled to yield a complete genome sequence that included two circular chromosomes and one circular plasmid. Transcriptomic data for Pseudoalteromonas were not available. We therefore collected comprehensive RNA-seq data using Illumina sequencing technology from a fermentation culture of P. agarivorans Hao 2018. Researchers studying the evolution, environmental adaptations and biotechnological applications of Pseudoalteromonas may benefit from our genomic and transcriptomic data to analyze the function and expression of genes of interest.


April 21, 2020

The golden death bacillus Chryseobacterium nematophagum is a novel matrix digesting pathogen of nematodes.

Nematodes represent important pathogens of humans and farmed animals and cause significant health and economic impacts. The control of nematodes is primarily carried out by applying a limited number of anthelmintic compounds, for which there is now widespread resistance being reported. There is a current unmet need to develop novel control measures including the identification and characterisation of natural pathogens of nematodes.Nematode killing bacilli were isolated from a rotten fruit in association with wild free-living nematodes. These bacteria belong to the Chryseobacterium genus (golden bacteria) and represent a new species named Chryseobacterium nematophagum. These bacilli are oxidase-positive, flexirubin-pigmented, gram-negative rods that exhibit gelatinase activity. Caenorhabditis elegans are attracted to and eat these bacteria. Within 3 h of ingestion, however, the bacilli have degraded the anterior pharyngeal chitinous lining and entered the body cavity, ultimately killing the host. Within 24?h, the internal contents of the worms are digested followed by the final digestion of the remaining cuticle over a 2-3-day period. These bacteria will also infect and kill bacterivorous free-living (L1-L3) stages of all tested parasitic nematodes including the important veterinary Trichostrongylids such as Haemonchus contortus and Ostertagia ostertagi. The bacteria exhibit potent collagen-digesting properties, and genome sequencing has identified novel metalloprotease, collagenase and chitinase enzymes representing potential virulence factors.Chryseobacterium nematophagum is a newly discovered pathogen of nematodes that rapidly kills environmental stages of a wide range of key nematode parasites. These bacilli exhibit a unique invasion process, entering the body via the anterior pharynx through the specific degradation of extracellular matrices. This bacterial pathogen represents a prospective biological control agent for important nematode parasites.


April 21, 2020

The complete genome and methylome of Helicobacter pylori hpNEAfrica strain HP14039

Background Helicobacter pylori is a Gram-negative bacterium which mainly causes peptic ulcer disease in human, but is also the predominant cause of stomach cancer. It has been coevolving with human since 120,000 years and, according to Multi-locus sequence typing (MLST), H. pylori can be classified into seven major population types, namely, hpAfrica1, hpAfrica2, hpNEAfrica, hpEastAsia, hpAsia2, hpEurope and hpSahul. Helicobacter pylori harbours a large number of restriction-modification (R-M) systems. The methyltransferase (MTase) unit plays a significant role in gene regulation and also possibly modulates pathogenicity. The diversity in MTase can act as geomarkers to correlate strains with the phylogeographic origins. This paper describes the complete genome sequence and methylome of gastric pathogen H. pylori belonging to the population hpNEAfrica. Results In this paper, we present the complete genome sequence and the methylome profile of H. pylori hpNEAfrica strain HP14039, isolated from a patient who was born in Somalia and likely to be infected locally during early childhood prior to migration. The genome of HP14039 consists of 1,678,260 bp with 1574 coding genes and 38.7% GC content. The sequence analysis showed that this strain lacks the cag pathogenicity island. The vacA gene is of S2M2 type. We have also identified 15 methylation motifs, including WCANHNNNNTG and CTANNNNNNNTAYG that were not previously described. Conclusions We have described the complete genome of H. pylori strain HP14039. The information regarding phylo-geography, methylome and associated metadata would help scientific community to study more about hpNEAfrica population type.


April 21, 2020

External memory BWT and LCP computation for sequence collections with applications

Background: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Bur- rows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. Results: We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modi- fied to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. Conclusions: We prove that our algorithm performs O(nmaxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.