Eukaryotes contain a diverse tapestry of specialized metabolites, many of which are of significant pharmaceutical and industrial importance to humans. Nevertheless, exploration of specialized metabolic pathways underlying specific chemical traits in nonmodel eukaryotic organisms has been technically challenging and historically lagged behind that of the bacterial systems. Recent advances in genomics, metabolomics, phylogenomics, and synthetic biology now enable a new workflow for interrogating unknown specialized metabolic systems in nonmodel eukaryotic hosts with greater efficiency and mechanistic depth. This chapter delineates such workflow by providing a collection of state-of-the-art approaches and tools, ranging from multiomics-guided candidate gene identification to in vitro and in vivo functional and structural characterization of specialized metabolic enzymes. As already demonstrated by several recent studies, this new workflow opens up a gateway into the largely untapped world of natural product biochemistry in eukaryotes. © 2016 Elsevier Inc. All rights reserved.
PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme.
High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing.We present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner.PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/.
Alternative RNA splicing is a known phenomenon, but we still do not have a complete catalog of isoforms that explain variability in the human transcriptome. We have made significant progress in developing methods to study variability of the transcriptome, but we are far away of having a complete picture of the transcriptome. The initial methods to study gene expression were based on cloning of cDNAs and Sanger sequencing. The strategy was labor-intensive and expensive. With the development of microarrays, different methods based on exon arrays and tiling arrays provided valuable information about RNA expression. However, the microarray presented significant limitations. Most of the limitations became apparent by 2005, but it was not until 2008 that an alternative method to study the transcriptome was developed. RNA Sequencing using next-generation sequencing (RNA-Seq) quickly became the technology of choice for gene expression profiling. Recently, the precision and sensitivity of RNA-Seq have come into question, especially for transcriptome reconstruction. This chapter will describe a relatively new method, “Isoform Sequencing (Iso-Seq). Iso-Seq was developed by Pacific Biosciences (PacBio), and it is capable of identifying new isoforms with extraordinary precision due to its long-read technology. The technique to create libraries is straightforward, and the PacBio RS II instrument generates the information in hours. The bioinformatics analysis is performed using the freely available SMRT® Portal software. The SMRT Portal is easy to use and capable of performing all the steps necessary to analyze the raw data and to generate high-quality full-length isoforms. For the universal acceptance of the Iso-Seq method, the capacity of the SMRT Cells needs to improve at least 10- to 100-fold to make the system affordable and attractive to users.
Genes in prokaryotic genomes are often arranged into clusters and co-transcribed into polycistronic RNAs. Isolated examples of polycistronic RNAs were also reported in some higher eukaryotes but their presence was generally considered rare. Here we developed a long-read sequencing strategy to identify polycistronic transcripts in several mushroom forming fungal species including Plicaturopsis crispa, Phanerochaete chrysosporium, Trametes versicolor, and Gloeophyllum trabeum. We found genome-wide prevalence of polycistronic transcription in these Agaricomycetes, involving up to 8% of the transcribed genes. Unlike polycistronic mRNAs in prokaryotes, these co-transcribed genes are also independently transcribed. We show that polycistronic transcription may interfere with expression of the downstream tandem gene. Further comparative genomic analysis indicates that polycistronic transcription is conserved among a wide range of mushroom forming fungi. In summary, our study revealed, for the first time, the genome prevalence of polycistronic transcription in a phylogenetic range of higher fungi. Furthermore, we systematically show that our long-read sequencing approach and combined bioinformatics pipeline is a generic powerful tool for precise characterization of complex transcriptomes that enables identification of mRNA isoforms not recovered via short-read assembly.
Translational regulation is mediated through the interaction between diffusible trans-factors and cis-elements residing within mRNA transcripts. In contrast to extensively studied transcriptional regulation, cis-regulation on translation remains underexplored. Using deep sequencing-based transcriptome and polysome profiling, we globally profiled allele-specific translational efficiency for the first time in an F1 hybrid mouse. Out of 7,156 genes with reliable quantification of both alleles, we found 1,008 (14.1%) exhibiting significant allelic divergence in translational efficiency. Systematic analysis of sequence features of the genes with biased allelic translation revealed that local RNA secondary structure surrounding the start codon and proximal out-of-frame upstream AUGs could affect translational efficiency. Finally, we observed that the cis-effect was quantitatively comparable between transcriptional and translational regulation. Such effects in the two regulatory processes were more frequently compensatory, suggesting that the regulation at the two levels could be coordinated in maintaining robustness of protein expression. © 2015 The Authors. Published under the terms of the CC BY 4.0 license.
Global RNA studies have become central to understanding biological processes, but methods such as microarrays and short-read sequencing are unable to describe an entire RNA molecule from 5′ to 3′ end. Here we use single-molecule long-read sequencing technology from Pacific Biosciences to sequence the polyadenylated RNA complement of a pooled set of 20 human organs and tissues without the need for fragmentation or amplification. We show that full-length RNA molecules of up to 1.5 kb can readily be monitored with little sequence loss at the 5′ ends. For longer RNA molecules more 5′ nucleotides are missing, but complete intron structures are often preserved. In total, we identify ~14,000 spliced GENCODE genes. High-confidence mappings are consistent with GENCODE annotations, but >10% of the alignments represent intron structures that were not previously annotated. As a group, transcripts mapping to unannotated regions have features of long, noncoding RNAs. Our results show the feasibility of deep sequencing full-length RNA from complex eukaryotic transcriptomes on a single-molecule level.
Pre-mRNA splicing is accomplished by the spliceosome, a megadalton complex that assembles de novo on each intron. Because spliceosome assembly and catalysis occur cotranscriptionally, we hypothesized that introns are removed in the order of their transcription in genomes dominated by constitutive splicing. Remarkably little is known about splicing order and the regulatory potential of nascent transcript remodeling by splicing, due to the limitations of existing methods that focus on analysis of mature splicing products (mRNAs) rather than substrates and intermediates. Here, we overcome this obstacle through long-read RNA sequencing of nascent, multi-intron transcripts in the fission yeast Schizosaccharomyces pombe Most multi-intron transcripts were fully spliced, consistent with rapid cotranscriptional splicing. However, an unexpectedly high proportion of transcripts were either fully spliced or fully unspliced, suggesting that splicing of any given intron is dependent on the splicing status of other introns in the transcript. Supporting this, mild inhibition of splicing by a temperature-sensitive mutation in prp2, the homolog of vertebrate U2AF65, increased the frequency of fully unspliced transcripts. Importantly, fully unspliced transcripts displayed transcriptional read-through at the polyA site and were degraded cotranscriptionally by the nuclear exosome. Finally, we show that cellular mRNA levels were reduced in genes with a high number of unspliced nascent transcripts during caffeine treatment, showing regulatory significance of cotranscriptional splicing. Therefore, overall splicing of individual nascent transcripts, 3′ end formation, and mRNA half-life depend on the splicing status of neighboring introns, suggesting crosstalk among spliceosomes and the polyA cleavage machinery during transcription elongation.© 2018 Herzel et al.; Published by Cold Spring Harbor Laboratory Press.
Comprehensive profiling of rhizome-associated alternative splicing and alternative polyadenylation in moso bamboo (Phyllostachys edulis).
Moso bamboo (Phyllostachys edulis) represents one of the fastest-spreading plants in the world, due in part to its well-developed rhizome system. However, the post-transcriptional mechanism for the development of the rhizome system in bamboo has not been comprehensively studied. We therefore used a combination of single-molecule long-read sequencing technology and polyadenylation site sequencing (PAS-seq) to re-annotate the bamboo genome, and identify genome-wide alternative splicing (AS) and alternative polyadenylation (APA) in the rhizome system. In total, 145 522 mapped full-length non-chimeric (FLNC) reads were analyzed, resulting in the correction of 2241 mis-annotated genes and the identification of 8091 previously unannotated loci. Notably, more than 42 280 distinct splicing isoforms were derived from 128 667 intron-containing full-length FLNC reads, including a large number of AS events associated with rhizome systems. In addition, we characterized 25 069 polyadenylation sites from 11 450 genes, 6311 of which have APA sites. Further analysis of intronic polyadenylation revealed that LTR/Gypsy and LTR/Copia were two major transposable elements within the intronic polyadenylation region. Furthermore, this study provided a quantitative atlas of poly(A) usage. Several hundred differential poly(A) sites in the rhizome-root system were identified. Taken together, these results suggest that post-transcriptional regulation may potentially have a vital role in the underground rhizome-root system.© 2017 The Authors The Plant Journal © 2017 John Wiley & Sons Ltd.
The single molecule, real time (SMRT) sequencing technology of Pacific Biosciences enables the acquisition of transcripts from end to end due to its ability to produce extraordinarily long reads (>10 kb). This new method of transcriptome sequencing has been applied to several projects on humans and model organisms. However, the raw data from SMRT sequencing are of relatively low quality, with a random error rate of approximately 15 %, for which error correction using next-generation sequencing (NGS) short reads is typically necessary. Few tools have been designed that apply a hybrid sequencing approach that combines NGS and SMRT data, and the most popular existing tool for error correction, LSC, has computing resource requirements that are too intensive for most laboratory and research groups. These shortcomings severely limit the application of SMRT long reads for transcriptome analysis.Here, we report an improved tool (LSCplus) for error correction with the LSC program as a reference. LSCplus overcomes the disadvantage of LSC’s time consumption and improves quality. Only 1/3-1/4 of the time and 1/20-1/25 of the error correction time is required using LSCplus compared with that required for using LSC.LSCplus is freely available at http://www.herbbol.org:8001/lscplus/ . Sample calculations are provided illustrating the precision and efficiency of this method regarding error correction and isoform detection.
Alternative splicing increases the diversity of transcriptomes and proteomes in metazoans. The extent to which alternative splicing is active and functional in unicellular organisms is less understood. Here, we exploit a single-molecule long-read sequencing technique and develop an open-source software program called SpliceHunter to characterize the transcriptome in the meiosis of fission yeast. We reveal 14,353 alternative splicing events in 17,669 novel isoforms at different stages of meiosis, including antisense and read-through transcripts. Intron retention is the major type of alternative splicing, followed by alternate “intron in exon.” Seven hundred seventy novel transcription units are detected; 53 of the predicted proteins show homology in other species and form theoretical stable structures. We report the complexity of alternative splicing along isoforms, including 683 intra-molecularly co-associated intron pairs. We compare the dynamics of novel isoforms based on the number of supporting full-length reads with those of annotated isoforms and explore the translational capacity and quality of novel isoforms. The evaluation of these factors indicates that the majority of novel isoforms are unlikely to be both condition-specific and translatable but consistent with the possibility of biologically functional novel isoforms. Moreover, the co-option of these unusual transcripts into newly born genes seems likely. Together, the results of this study highlight the diversity and dynamics at the isoform level in the sexual development of fission yeast. © 2017 Kuang et al.; Published by Cold Spring Harbor Laboratory Press.
Analyses of alternative polyadenylation: from old school biochemistry to high-throughput technologies.
Alternations in usage of polyadenylation sites during transcription termination yield transcript isoforms from a gene. Recent findings of transcriptome-wide alternative polyadenylation (APA) as a molecular response to changes in biology position APA not only as a molecular event of early transcriptional termination but also as a cellular regulatory step affecting various biological pathways. With the development of high-throughput profiling technologies at a single nucleotide level and their applications targeted to the 3′-end of mRNAs, dynamics in the landscape of mRNA 3′-end is measureable at a global scale. In this review, methods and technologies that have been adopted to study APA events are discussed. In addition, various bioinformatics algorithms for APA isoform analysis using publicly available RNA-seq datasets are introduced. [BMB Reports 2017; 50(4): 201-207].
This study was aimed at generating the full-length transcriptome of flea beetle Agasicles hygrophila (Selman and Vogt) using single-molecule real-time (SMRT) sequencing. Four developmental stages of A. hygrophila, including eggs, larvae, pupae, and adults were harvested for isolating total RNA. The mixed samples were used for SMRT sequencing to generate the full-length transcriptome. Based on the obtained transcriptome data, alternative splicing event, simple sequence repeat (SSR) analysis, coding sequence prediction, transcript functional annotation, and lncRNA prediction were performed. Total 9.45?Gb of clean reads were generated, including 335,045 reads of insert (ROI) and 158,085 full-length non-chimeric (FLNC) reads. Transcript clustering analysis of FLNC reads identified 40,004 consensus isoforms, including 31,015 high-quality ones. After removing redundant reads, 28,982 transcripts were obtained. Total 145 alternative splicing events were predicted. Additionally, 12,753 SSRs and 16,205 coding sequences were identified based on SSR analysis. Furthermore, 24,031 transcripts were annotated in eight functional databases, and 4,198 lncRNAs were predicted. This is the first study to perform SMRT sequencing of the full-length transcriptome of A. hygrophila. The obtained transcriptome may facilitate further exploration of the genetic data of A. hygrophila and uncover the interactions between this insect and the ecosystem.
Protein-coding genes in eukaryotes are transcribed by RNA polymerase II (Pol II) and introns are removed from pre-mRNA by the spliceosome. Understanding the time lag between Pol II progression and splicing could provide mechanistic insights into the regulation of gene expression. Here, we present two single-molecule nascent RNA sequencing methods that directly determine the progress of splicing catalysis as a function of Pol II position. Endogenous genes were analyzed on a global scale in budding yeast. We show that splicing is 50% complete when Pol II is only 45 nt downstream of introns, with the first spliced products observed as introns emerge from Pol II. Perturbations that slow the rate of spliceosome assembly or speed up the rate of transcription caused splicing delays, showing that regulation of both processes determines in vivo splicing profiles. We propose that matched rates streamline the gene expression pathway, while allowing regulation through kinetic competition. Copyright © 2016 Elsevier Inc. All rights reserved.
Complete genome sequence and analysis of the industrial Saccharomyces cerevisiae strain N85 used in Chinese rice wine production.
Chinese rice wine is a popular traditional alcoholic beverage in China, while its brewing processes have rarely been explored. We herein report the first gapless, near-finished genome sequence of the yeast strain Saccharomyces cerevisiae N85 for Chinese rice wine production. Several assembly methods were used to integrate Pacific Bioscience (PacBio) and Illumina sequencing data to achieve high-quality genome sequencing of the strain. The genome encodes more than 6,000 predicted proteins, and 238 long non-coding RNAs, which are validated by RNA-sequencing data. Moreover, our annotation predicts 171 novel genes that are not present in the reference S288c genome. We also identified 65,902 single nucleotide polymorphisms and small indels, many of which are located within genic regions. Dozens of larger copy-number variations and translocations were detected, mainly enriched in the subtelomeres, suggesting these regions may be related to genomic evolution. This study will serve as a milestone in studying of Chinese rice wine and related beverages in China and in other countries. It will help to develop more scientific and modern fermentation processes of Chinese rice wine, and explore metabolism pathways of desired and harmful components in Chinese rice wine to improve its taste and nutritional value.© The Author(s) 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
The basidiomycete yeast Rhodosporidium toruloides (also known as Rhodotorula toruloides) accumulates high concentrations of lipids and carotenoids from diverse carbon sources. It has great potential as a model for the cellular biology of lipid droplets and for sustainable chemical production. We developed a method for high-throughput genetics (RB-TDNAseq), using sequence-barcoded Agrobacterium tumefaciens T-DNA insertions. We identified 1,337 putative essential genes with low T-DNA insertion rates. We functionally profiled genes required for fatty acid catabolism and lipid accumulation, validating results with 35 targeted deletion strains. We identified a high-confidence set of 150 genes affecting lipid accumulation, including genes with predicted function in signaling cascades, gene expression, protein modification and vesicular trafficking, autophagy, amino acid synthesis and tRNA modification, and genes of unknown function. These results greatly advance our understanding of lipid metabolism in this oleaginous species and demonstrate a general approach for barcoded mutagenesis that should enable functional genomics in diverse fungi.