Foodborne infections caused by lung flukes of the genus Paragonimus are a significant and widespread public health problem in tropical areas. Approximately 50 Paragonimus species have been reported to infect animals and humans, but Paragonimus westermani is responsible for the bulk of human disease. Despite their medical and economic importance, no genome sequence for any Paragonimus species is available.We sequenced and assembled the genome of P. westermani, which is among the largest of the known pathogen genomes with an estimated size of 1.1 Gb. A 922.8 Mb genome assembly was generated from Illumina and Pacific Biosciences (PacBio) sequence data, covering 84% of the estimated genome size. The genome has a high proportion (45%) of repeat-derived DNA, particularly of the long interspersed element and long terminal repeat subtypes, and the expansion of these elements may explain some of the large size. We predicted 12,852 protein coding genes, showing a high level of conservation with related trematode species. The majority of proteins (80%) had homologs in the human liver fluke Opisthorchis viverrini, with an average sequence identity of 64.1%. Assembly of the P. westermani mitochondrial genome from long PacBio reads resulted in a single high-quality circularized 20.6 kb contig. The contig harbored a 6.9 kb region of non-coding repetitive DNA comprised of three distinct repeat units. Our results suggest that the region is highly polymorphic in P. westermani, possibly even within single worm isolates.The generated assembly represents the first Paragonimus genome sequence and will facilitate future molecular studies of this important, but neglected, parasite group.
Group A Streptococcus (GAS) is a major cause of global infection-related morbidity and mortality. A modern controlled human infection model (CHIM) of GAS pharyngitis can accelerate vaccine development and pathogenesis research. A robust rationale for strain selection is central to meeting ethical, scientific, and regulatory requirements. Multifaceted characterization studies were done to compare a preferred candidate emm75 (M75) GAS strain to three other strains: an alternative candidate emm12 (M12) strain, an M1 strain used in 1970s pharyngitis CHIM studies (SS-496), and a representative (5448) of the globally disseminated M1T1 clone. A range of approaches were used to explore strain growth, adherence, invasion, delivery characteristics, short- and long-term viability, phylogeny, virulence factors, vaccine antigens, resistance to killing by human neutrophils, and lethality in a murine invasive model. The strains grew reliably in a medium without animal-derived components, were consistently transferred using a swab method simulating the CHIM protocol, remained viable at -80°C, and carried genes for most candidate vaccine antigens. Considering GAS molecular epidemiology, virulence factors, in vitro assays, and results from the murine model, the contemporary strains show a spectrum of virulence, with M75 appearing the least virulent and 5448 the most. The virulence profile of SS-496, used safely in 1970s CHIM studies, was similar to that of 5448 in the animal model and virulence gene carriage. The results of this multifaceted characterization confirm the M75 strain as an appropriate choice for initial deployment in the CHIM, with the aim of safely and successfully causing pharyngitis in healthy adult volunteers. IMPORTANCE GAS (Streptococcus pyogenes) is a leading global cause of infection-related morbidity and mortality. A modern CHIM of GAS pharyngitis could help to accelerate vaccine development and drive pathogenesis research. Challenge strain selection is critical to the safety and success of any CHIM and especially so for an organism such as GAS, with its wide strain diversity and potential to cause severe life-threatening acute infections (e.g., toxic shock syndrome and necrotizing fasciitis) and postinfectious complications (e.g., acute rheumatic fever, rheumatic heart disease, and acute poststreptococcal glomerulonephritis). In this paper, we outline the rationale for selecting an emm75 strain for initial use in a GAS pharyngitis CHIM in healthy adult volunteers, drawing on the findings of a broad characterization effort spanning molecular epidemiology, in vitro assays, whole-genome sequencing, and animal model studies. Copyright © 2019 Osowicki et al.
Sucrose accumulation and decreased photosynthesis are early symptoms of yellow canopy syndrome (YCS) in sugarcane (Saccharum spp.), and precede the visual yellowing of the leaves. To investigate broad-scale gene expression changes during YCS-onset, transcriptome analyses coupled to metabolome analyses were performed. Across leaf tissues, the greatest number of differentially expressed genes related to the chloroplast, and the metabolic processes relating to nitrogen and carbohydrates. Five genes represented 90% of the TPM (Transcripts Per Million) associated with the downregulation of transcription during YCS-onset, which included PSII D1 (PsbA). This differential expression was consistent with a feedback regulatory effect upon photosynthesis. Broad-scale gene expression analyses did not reveal a cause for leaf sugar accumulation during YCS-onset. Interestingly, the midrib showed the greatest accumulation of sugars, followed by symptomatic lamina. To investigate if phloem loading/reloading may be compromised on a gene expression level – to lead to leaf sucrose accumulation – sucrose transport-related proteins of SWEETs, Sucrose Transporters (SUTs), H+-ATPases and H+-pyrophosphatases (H+-PPases) were characterised from a sugarcane transcriptome and expression analysed. Two clusters of Type I H+-PPases, with one upregulated and the other downregulated, were evident. Although less pronounced, a similar pattern of change was observed for the H+-ATPases. The disaccharide transporting SWEETs were downregulated after visual symptoms were present, and a monosaccharide transporting SWEET upregulated preceding, as well as after, symptom development. SUT gene expression was the least responsive to YCS development. The results are consistent with a reduction of photoassimilate movement through the phloem leading to sucrose build-up in the leaf.
SMRT sequencing reveals differential patterns of methylation in two O111:H- STEC isolates from a hemolytic uremic syndrome outbreak in Australia.
In 1995 a severe haemolytic-uremic syndrome (HUS) outbreak in Adelaide occurred. A recent genomic analysis of Shiga toxigenic Escherichia coli (STEC) O111:H- strains 95JB1 and 95NR1 from this outbreak found that the more virulent isolate, 95NR1, harboured two additional copies of the Shiga toxin 2 (Stx2) genes encoded within prophage regions. The structure of the Stx2-converting prophages could not be fully resolved using short-read sequence data alone and it was not clear if there were other genomic differences between 95JB1 and 95NR1. In this study we have used Pacific Biosciences (PacBio) single molecule real-time (SMRT) sequencing to characterise the genome and methylome of 95JB1 and 95NR1. We completely resolved the structure of all prophages including two, tandemly inserted, Stx2-converting prophages in 95NR1 that were absent from 95JB1. Furthermore we defined all insertion sequences and found an additional IS1203 element in the chromosome of 95JB1. Our analysis of the methylome of 95NR1 and 95JB1 identified hemi-methylation of a novel motif (5′-CTGCm6AG-3′) in more than 4000 sites in the 95NR1 genome. These sites were entirely unmethylated in the 95JB1 genome, and included at least 177 potential promoter regions that could contribute to regulatory differences between the strains. IS1203 mediated deactivation of a novel type IIG methyltransferase in 95JB1 is the likely cause of the observed differential patterns of methylation between 95NR1 and 95JB1. This study demonstrates the capability of PacBio SMRT sequencing to resolve complex prophage regions and reveal the genetic and epigenetic heterogeneity within a clonal population of bacteria.
Normalization of cDNA is widely used to improve the coverage of rare transcripts in analysis of transcriptomes employing next-generation sequencing. Recently, long-read technology has been emerging as a powerful tool for sequencing and construction of transcriptomes, especially for complex genomes containing highly similar transcripts and transcript-spliced isoforms. Here, we analyzed the transcriptome of sugarcane, with a highly polyploidy plant genome, by PacBio isoform sequencing (Iso-Seq) of two different cDNA library preparations, with and without a normalization step. The results demonstrated that, while the two libraries included many of the same transcripts, many longer transcripts were removed and many new generally shorter transcripts were detected by normalization. For the same input cDNA and the same data yield, the normalized library recovered more total transcript isoforms, number of predicted gene families and orthologous groups, resulting in a higher representation for the sugarcane transcriptome, compared to the non-normalized library. The non-normalized library, on the other hand, included a wider transcript length range with more longer transcripts above ~1.25 kb, more transcript isoforms per gene family and gene ontology terms per transcript. A large proportion of the unique transcripts comprising ~52% of the normalized library were expressed at a lower level than the unique transcripts from the non-normalized library, across three tissue types tested including leaf, stalk and root. About 83% of the total 5,348 predicted long noncoding transcripts was derived from the normalized library, of which ~80% was derived from the lowly expressed fraction. Functional annotation of the unique transcripts suggested that each library enriched different functional transcript fractions. This demonstrated the complementation of the two approaches in obtaining a complete transcriptome of a complex genome at the sequencing depth used in this study.
Traditionally derived from fossil fuels, biological production of propionic acid has recently gained interest. Propionibacterium species produce propionic acid as their main fermentation product. Production of other organic acids reduces propionic acid yield and productivity, pointing to by-products gene-knockout strategies as a logical solution to increase yield. However, removing by-product formation has seen limited success due to our inability to genetically engineer the best producing strains (i.e. Propionibacterium acidipropionici). To overcome this limitation, random mutagenesis continues to be the best path towards improving strains for biological propionic acid production. Recent advances in next generation sequencing opened new avenues to understand improved strains. In this work, we use genome shuffling on two wild type strains to generate a better propionic acid producing strain. Using next generation sequencing, we mapped the genomic changes leading to the improved phenotype. The best strain produced 25% more propionic acid than the wild type strain. Sequencing of the strains showed that genomic changes were restricted to single point mutations and gene duplications in well-conserved regions in the genomes. Such results confirm the involvement of gene conversion in genome shuffling as opposed to long genomic insertions. © 2016 The Authors. Biotechnology Journal published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Long-read sequencing of the coffee bean transcriptome reveals the diversity of full-length transcripts.
Polyploidization contributes to the complexity of gene expression, resulting in numerous related but different transcripts. This study explored the transcriptome diversity and complexity of the tetraploid Arabica coffee (Coffea arabica) bean. Long-read sequencing (LRS) by Pacbio Isoform sequencing (Iso-seq) was used to obtain full-length transcripts without the difficulty and uncertainty of assembly required for reads from short-read technologies. The tetraploid transcriptome was annotated and compared with data from the sub-genome progenitors. Caffeine and sucrose genes were targeted for case analysis. An isoform-level tetraploid coffee bean reference transcriptome with 95 995 distinct transcripts (average 3236 bp) was obtained. A total of 88 715 sequences (92.42%) were annotated with BLASTx against NCBI non-redundant plant proteins, including 34 719 high-quality annotations. Further BLASTn analysis against NCBI non-redundant nucleotide sequences, Coffea canephora coding sequences with UTR, C. arabica ESTs, and Rfam resulted in 1213 sequences without hits, were potential novel genes in coffee. Longer UTRs were captured, especially in the 5?UTRs, facilitating the identification of upstream open reading frames. The LRS also revealed more and longer transcript variants in key caffeine and sucrose metabolism genes from this polyploid genome. Long sequences (>10 kilo base) were poorly annotated. LRS technology shows the limitation of previous studies. It provides an important tool to produce a reference transcriptome including more of the diversity of full-length transcripts to help understand the biology and support the genetic improvement of polyploid species such as coffee.© The Authors 2017. Published by Oxford University Press.
A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing.
Despite the economic importance of sugarcane in sugar and bioenergy production, there is not yet a reference genome available. Most of the sugarcane transcriptomic studies have been based on Saccharum officinarum gene indices (SoGI), expressed sequence tags (ESTs) and de novo assembled transcript contigs from short-reads; hence knowledge of the sugarcane transcriptome is limited in relation to transcript length and number of transcript isoforms.The sugarcane transcriptome was sequenced using PacBio isoform sequencing (Iso-Seq) of a pooled RNA sample derived from leaf, internode and root tissues, of different developmental stages, from 22 varieties, to explore the potential for capturing full-length transcript isoforms. A total of 107,598 unique transcript isoforms were obtained, representing about 71% of the total number of predicted sugarcane genes. The majority of this dataset (92%) matched the plant protein database, while just over 2% was novel transcripts, and over 2% was putative long non-coding RNAs. About 56% and 23% of total sequences were annotated against the gene ontology and KEGG pathway databases, respectively. Comparison with de novo contigs from Illumina RNA-Sequencing (RNA-Seq) of the internode samples from the same experiment and public databases showed that the Iso-Seq method recovered more full-length transcript isoforms, had a higher N50 and average length of largest 1,000 proteins; whereas a greater representation of the gene content and RNA diversity was captured in RNA-Seq. Only 62% of PacBio transcript isoforms matched 67% of de novo contigs, while the non-matched proportions were attributed to the inclusion of leaf/root tissues and the normalization in PacBio, and the representation of more gene content and RNA classes in the de novo assembly, respectively. About 69% of PacBio transcript isoforms and 41% of de novo contigs aligned with the sorghum genome, indicating the high conservation of orthologs in the genic regions of the two genomes.The transcriptome dataset should contribute to improved sugarcane gene models and sugarcane protein predictions; and will serve as a reference database for analysis of transcript expression in sugarcane.
About 64% of the total aboveground biomass in sugarcane production is from the culm, of which ~90% is present in fiber and sugars. Understanding the transcriptome in the sugarcane culm, and the transcripts that are associated with the accumulation of the sugar and fiber components would facilitate the modification of biomass composition for enhanced biofuel and biomaterial production. The Sugarcane Iso-Seq Transcriptome (SUGIT) database was used as a reference for RNA-Seq analysis of variation in gene expression between young and mature tissues, and between 10 genotypes with varying fiber content. Global expression analysis suggests that each genotype displayed a unique expression pattern, possibly due to different chromosome combinations and maturation amongst these genotypes. Apart from direct sugar- and fiber-related transcripts, the differentially expressed (DE) transcripts in this study belonged to various supporting pathways that are not obviously involved in the accumulation of these major biomass components. The analysis revealed 1,649 DE transcripts between the young and mature tissues, while 555 DE transcripts were found between the low and high fiber genotypes. Of these, 151 and 23 transcripts respectively, were directly involved in sugar and fiber accumulation. Most of the transcripts identified were up-regulated in the young tissues (2 to 22-fold, FDR adjusted p-value <0.05), which could be explained by the more active metabolism in the young tissues compared to the mature tissues in the sugarcane culm. The results of analysis of the contrasting genotypes suggests that due to the large number of genes contributing to these traits, some of the critical DE transcripts could display less than 2-fold differences in expression and might not be easily identified. However, this transcript profiling analysis identified full-length candidate transcripts and pathways that were likely to determine the differences in sugar and fiber accumulation between tissue types and contrasting genotypes.
Despite the significance of chicken as a model organism, our understanding of the chicken transcriptome is limited compared to human. This issue is common to all non-human vertebrate annotations due to the difficulty in transcript identification from short read RNAseq data. While previous studies have used single molecule long read sequencing for transcript discovery, they did not perform RNA normalization and 5′-cap selection which may have resulted in lower transcriptome coverage and truncated transcript sequences.We sequenced normalised chicken brain and embryo RNA libraries with Pacific Bioscience Iso-Seq. 5′ cap selection was performed on the embryo library to provide methodological comparison. From these Iso-Seq sequencing projects, we have identified 60 k transcripts and 29 k genes within the chicken transcriptome. Of these, more than 20 k are novel lncRNA transcripts with ~3 k classified as sense exonic overlapping lncRNA, which is a class that is underrepresented in many vertebrate annotations. The relative proportion of alternative transcription events revealed striking similarities between the chicken and human transcriptomes while also providing explanations for previously observed genomic differences.Our results indicate that the chicken transcriptome is similar in complexity compared to human, and provide insights into other vertebrate biology. Our methodology demonstrates the potential of Iso-Seq sequencing to rapidly expand our knowledge of transcriptomics.
Arabica coffee (Coffea arabica) has a small gene pool limiting genetic improvement. Selection for caffeine content within this gene pool would be assisted by identification of the genes controlling this important trait. Sequencing of DNA bulks from 18 genotypes with extreme high- or low-caffeine content from a population of 232 genotypes was used to identify linked polymorphisms. To obtain a reference genome, a whole genome assembly of arabica coffee (variety K7) was achieved by sequencing using short read (Illumina) and long-read (PacBio) technology. Assembly was performed using a range of assembly tools resulting in 76 409 scaffolds with a scaffold N50 of 54 544 bp and a total scaffold length of 1448 Mb. Validation of the genome assembly using different tools showed high completeness of the genome. More than 99% of transcriptome sequences mapped to the C. arabica draft genome, and 89% of BUSCOs were present. The assembled genome annotated using AUGUSTUS yielded 99 829 gene models. Using the draft arabica genome as reference in mapping and variant calling allowed the detection of 1444 nonsynonymous single nucleotide polymorphisms (SNPs) associated with caffeine content. Based on Kyoto Encyclopaedia of Genes and Genomes pathway-based analysis, 65 caffeine-associated SNPs were discovered, among which 11 SNPs were associated with genes encoding enzymes involved in the conversion of substrates, which participate in the caffeine biosynthesis pathways. This analysis demonstrated the complex genetic control of this key trait in coffee.© 2018 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.
Productivity of ruminant livestock depends on the rumen microbiota, which ferment indigestible plant polysaccharides into nutrients used for growth. Understanding the functions carried out by the rumen microbiota is important for reducing greenhouse gas production by ruminants and for developing biofuels from lignocellulose. We present 410 cultured bacteria and archaea, together with their reference genomes, representing every cultivated rumen-associated archaeal and bacterial family. We evaluate polysaccharide degradation, short-chain fatty acid production and methanogenesis pathways, and assign specific taxa to functions. A total of 336 organisms were present in available rumen metagenomic data sets, and 134 were present in human gut microbiome data sets. Comparison with the human microbiome revealed rumen-specific enrichment for genes encoding de novo synthesis of vitamin B12, ongoing evolution by gene loss and potential vertical inheritance of the rumen microbiome based on underrepresentation of markers of environmental stress. We estimate that our Hungate genome resource represents ~75% of the genus-level bacterial and archaeal taxa present in the rumen.
The human transcriptome is so large, diverse, and dynamic that, even after a decade of investigation by RNA sequencing (RNA-seq), we have yet to resolve its true dimensions. RNA-seq suffers from an expression-dependent bias that impedes characterization of low-abundance transcripts. We performed targeted single-molecule and short-read RNA-seq to survey the transcriptional landscape of a single human chromosome (Hsa21) at unprecedented resolution. Our analysis reaches the lower limits of the transcriptome, identifying a fundamental distinction between protein-coding and noncoding gene content: almost every noncoding exon undergoes alternative splicing, producing a seemingly limitless variety of isoforms. Analysis of syntenic regions of the mouse genome shows that few noncoding exons are shared between human and mouse, yet human splicing profiles are recapitulated on Hsa21 in mouse cells, indicative of regulation by a deeply conserved splicing code. We propose that noncoding exons are functionally modular, with alternative splicing generating an enormous repertoire of potential regulatory RNAs and a rich transcriptional reservoir for gene evolution. Crown Copyright © 2017. Published by Elsevier Inc. All rights reserved.
De novo assembly and characterizing of the culm-derived meta-transcriptome from the polyploid sugarcane genome based on coding transcripts
Sugarcane biomass has been used for sugar, bioenergy and biomaterial production. The majority of the sugarcane biomass comes from the culm, which makes it important to understand the genetic control of biomass production in this part of the plant. A meta-transcriptome of the culm was obtained in an earlier study by using about one billion paired-end (150 bp) reads of deep RNA sequencing of samples from 20 diverse sugarcane genotypes and combining de novo assemblies from different assemblers and different settings. Although many genes could be recovered, this resulted in a large combined assembly which created the need for clustering to reduce transcript redundancy while maintaining gene content. Here, we present a comprehensive analysis of the effect of different assembly settings and clustering methods on de novo assembly, annotation and transcript profiling focusing especially on the coding transcripts from the highly polyploid sugarcane genome. The new coding sequence-based transcript clustering resulted in a better representation of transcripts compared to the earlier approach, having 121,987 contigs, which included 78,052 main and 43,935 alternative transcripts. About 73%, 67%, 61% and 10% of the transcriptome was annotated against the NCBI NR protein database, GO terms, orthologous groups and KEGG orthologies, respectively. Using this set for a differential gene expression analysis between the young and mature sugarcane culm tissues, a total of 822 transcripts were found to be differentially expressed, including key transcripts involved in sugar/fiber accumulation in sugarcane. In the context of the lack of a whole genome sequence for sugarcane, the availability of a well annotated culm-derived meta-transcriptome through deep sequencing provides useful information on coding genes specific to the sugarcane culm and will certainly contribute to understanding the process of carbon partitioning, and biomass accumulation in the sugarcane culm.
Over the past decade, high-throughput short-read 16S rRNA gene amplicon sequencing has eclipsed clone-dependent long-read Sanger sequencing for microbial community profiling. The transition to new technologies has provided more quantitative information at the expense of taxonomic resolution with implications for inferring metabolic traits in various ecosystems. We applied single-molecule real-time sequencing for microbial community profiling, generating full-length 16S rRNA gene sequences at high throughput, which we propose to name PhyloTags. We benchmarked and validated this approach using a defined microbial community. When further applied to samples from the water column of meromictic Sakinaw Lake, we show that while community structures at the phylum level are comparable between PhyloTags and Illumina V4 16S rRNA gene sequences (iTags), variance increases with community complexity at greater water depths. PhyloTags moreover allowed less ambiguous classification. Last, a platform-independent comparison of PhyloTags and in silico generated partial 16S rRNA gene sequences demonstrated significant differences in community structure and phylogenetic resolution across multiple taxonomic levels, including a severe underestimation in the abundance of specific microbial genera involved in nitrogen and methane cycling across the Lake’s water column. Thus, PhyloTags provide a reliable adjunct or alternative to cost-effective iTags, enabling more accurate phylogenetic resolution of microbial communities and predictions on their metabolic potential.