Menu
September 22, 2019  |  

Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon

A significant portion of genes in vertebrate genomes belongs to multigene families, with each family containing several gene copies whose presence/absence, as well as isoform structure, can be highly variable across individuals. Existing de novo techniques for assaying the sequences of such highly-similar gene families fall short of reconstructing end-to-end transcripts with nucleotide-level precision or assigning alternatively spliced transcripts to their respective gene copies. We present IsoCon, a high-precision method using long PacBio Iso-Seq reads to tackle this challenge. We apply IsoCon to nine Y chromosome ampliconic gene families and show that it outperforms existing methods on both experimental and simulated data. IsoCon has allowed us to detect an unprecedented number of novel isoforms and has opened the door for unraveling the structure of many multigene families and gaining a deeper understanding of genome evolution and human diseases.


September 22, 2019  |  

Construction of a draft reference transcripts of onion (Allium cepa) using long-read sequencing

To obtain intact and full-length RNA transcripts of onion (Allium cepa), long-read sequencing technology was first applied. Total RNAs extracted from four tissues; flowers, leaves, bulbs and roots, of red–purple and yellow-colored onions (A. cepa) were sequenced using long-read sequencing (RSII platform, P4-C2 chemistry). The 99,247 polished high-quality isoforms were produced by sequence correction processes of consensus calling, quality filtering, orientation verification, misread-nucleotide correction and dot-matrix view. The dot-matrix view was subsequently used to remove artificial inverted repeats (IRs), and resultantly 421 IRs were removed. The remaining 98,826 isoforms were condensed to 35,505 through the removal process of redundant isoforms. To assess the completeness of the 35,505 isoforms, the ratio of full-length isoforms, short-read mapping to the isoforms, and differentially expressed genes among the four tissues were analyzed along with the gene ontology across the tissues. As a result, the 35,505 isoforms were verified as a collection of isoforms with high completeness, and designated as draft reference transcripts (DRTs, ver 1.0) constructed by long-read sequencing.


September 22, 2019  |  

Emergence, retention and selection: A trilogy of origination for functional de novo proteins from ancestral lncRNAs in primates.

While some human-specific protein-coding genes have been proposed to originate from ancestral lncRNAs, the transition process remains poorly understood. Here we identified 64 hominoid-specific de novo genes and report a mechanism for the origination of functional de novo proteins from ancestral lncRNAs with precise splicing structures and specific tissue expression profiles. Whole-genome sequencing of dozens of rhesus macaque animals revealed that these lncRNAs are generally not more selectively constrained than other lncRNA loci. The existence of these newly-originated de novo proteins is also not beyond anticipation under neutral expectation, as they generally have longer theoretical lifespan than their current age, due to their GC-rich sequence property enabling stable ORFs with lower chance of non-sense mutations. Interestingly, although the emergence and retention of these de novo genes are likely driven by neutral forces, population genetics study in 67 human individuals and 82 macaque animals revealed signatures of purifying selection on these genes specifically in human population, indicating a proportion of these newly-originated proteins are already functional in human. We thus propose a mechanism for creation of functional de novo proteins from ancestral lncRNAs during the primate evolution, which may contribute to human-specific genetic novelties by taking advantage of existed genomic contexts.


September 22, 2019  |  

De novo assembly and characterizing of the culm-derived meta-transcriptome from the polyploid sugarcane genome based on coding transcripts

Sugarcane biomass has been used for sugar, bioenergy and biomaterial production. The majority of the sugarcane biomass comes from the culm, which makes it important to understand the genetic control of biomass production in this part of the plant. A meta-transcriptome of the culm was obtained in an earlier study by using about one billion paired-end (150 bp) reads of deep RNA sequencing of samples from 20 diverse sugarcane genotypes and combining de novo assemblies from different assemblers and different settings. Although many genes could be recovered, this resulted in a large combined assembly which created the need for clustering to reduce transcript redundancy while maintaining gene content. Here, we present a comprehensive analysis of the effect of different assembly settings and clustering methods on de novo assembly, annotation and transcript profiling focusing especially on the coding transcripts from the highly polyploid sugarcane genome. The new coding sequence-based transcript clustering resulted in a better representation of transcripts compared to the earlier approach, having 121,987 contigs, which included 78,052 main and 43,935 alternative transcripts. About 73%, 67%, 61% and 10% of the transcriptome was annotated against the NCBI NR protein database, GO terms, orthologous groups and KEGG orthologies, respectively. Using this set for a differential gene expression analysis between the young and mature sugarcane culm tissues, a total of 822 transcripts were found to be differentially expressed, including key transcripts involved in sugar/fiber accumulation in sugarcane. In the context of the lack of a whole genome sequence for sugarcane, the availability of a well annotated culm-derived meta-transcriptome through deep sequencing provides useful information on coding genes specific to the sugarcane culm and will certainly contribute to understanding the process of carbon partitioning, and biomass accumulation in the sugarcane culm.


September 22, 2019  |  

Single-molecule real-time transcript sequencing facilitates common wheat genome annotation and grain transcriptome research.

The large and complex hexaploid genome has greatly hindered genomics studies of common wheat (Triticum aestivum, AABBDD). Here, we investigated transcripts in common wheat developing caryopses using the emerging single-molecule real-time (SMRT) sequencing technology PacBio RSII, and assessed the resultant data for improving common wheat genome annotation and grain transcriptome research.We obtained 197,709 full-length non-chimeric (FLNC) reads, 74.6 % of which were estimated to carry complete open reading frame. A total of 91,881 high-quality FLNC reads were identified and mapped to 16,188 chromosomal loci, corresponding to 13,162 known genes and 3026 new genes not annotated previously. Although some FLNC reads could not be unambiguously mapped to the current draft genome sequence, many of them are likely useful for studying highly similar homoeologous or paralogous loci or for improving chromosomal contig assembly in further research. The 91,881 high-quality FLNC reads represented 22,768 unique transcripts, 9591 of which were newly discovered. We found 180 transcripts each spanning two or three previously annotated adjacent loci, suggesting that they should be merged to form correct gene models. Finally, our data facilitated the identification of 6030 genes differentially regulated during caryopsis development, and full-length transcripts for 72 transcribed gluten gene members that are important for the end-use quality control of common wheat.Our work demonstrated the value of PacBio transcript sequencing for improving common wheat genome annotation through uncovering the loci and full-length transcripts not discovered previously. The resource obtained may aid further structural genomics and grain transcriptome studies of common wheat.


September 22, 2019  |  

Two novel lncRNAs discovered in human mitochondrial DNA using PacBio full-length transcriptome data.

In this study, we established a general framework to use PacBio full-length transcriptome sequencing for the investigation of mitochondrial RNAs. As a result, we produced the first full-length human mitochondrial transcriptome using public PacBio data and characterized the human mitochondrial genome with more comprehensive and accurate information. Other results included determination of the H-strand primary transcript, identification of the ND5/ND6AS/tRNAGluAS transcript, discovery of palindrome small RNAs (psRNAs) and construction of the “mitochondrial cleavage” model, etc. These results reported for the first time in this study fundamentally changed annotations of human mitochondrial genome and enriched knowledge in the field of animal mitochondrial studies. The most important finding was two novel long non-coding RNAs (lncRNAs) of MDL1 and MDL1AS exist ubiquitously in animal mitochondrial genomes. Copyright © 2017. Published by Elsevier B.V.


September 22, 2019  |  

Analyses of alternative polyadenylation: from old school biochemistry to high-throughput technologies.

Alternations in usage of polyadenylation sites during transcription termination yield transcript isoforms from a gene. Recent findings of transcriptome-wide alternative polyadenylation (APA) as a molecular response to changes in biology position APA not only as a molecular event of early transcriptional termination but also as a cellular regulatory step affecting various biological pathways. With the development of high-throughput profiling technologies at a single nucleotide level and their applications targeted to the 3′-end of mRNAs, dynamics in the landscape of mRNA 3′-end is measureable at a global scale. In this review, methods and technologies that have been adopted to study APA events are discussed. In addition, various bioinformatics algorithms for APA isoform analysis using publicly available RNA-seq datasets are introduced. [BMB Reports 2017; 50(4): 201-207].


September 22, 2019  |  

The full transcription map of mouse papillomavirus type 1 (MmuPV1) in mouse wart tissues.

Mouse papillomavirus type 1 (MmuPV1) provides, for the first time, the opportunity to study infection and pathogenesis of papillomaviruses in the context of laboratory mice. In this report, we define the transcriptome of MmuPV1 genome present in papillomas arising in experimentally infected mice using a combination of RNA-seq, PacBio Iso-seq, 5′ RACE, 3′ RACE, primer-walking RT-PCR, RNase protection, Northern blot and in situ hybridization analyses. We demonstrate that the MmuPV1 genome is transcribed unidirectionally from five major promoters (P) or transcription start sites (TSS) and polyadenylates its transcripts at two major polyadenylation (pA) sites. We designate the P7503, P360 and P859 as “early” promoters because they give rise to transcripts mostly utilizing the polyadenylation signal at nt 3844 and therefore can only encode early genes, and P7107 and P533 as “late” promoters because they give rise to transcripts utilizing polyadenylation signals at either nt 3844 or nt 7047, the latter being able to encode late, capsid proteins. MmuPV1 genome contains five splice donor sites and three acceptor sites that produce thirty-six RNA isoforms deduced to express seven predicted early gene products (E6, E7, E1, E1^M1, E1^M2, E2 and E8^E2) and three predicted late gene products (E1^E4, L2 and L1). The majority of the viral early transcripts are spliced once from nt 757 to 3139, while viral late transcripts, which are predicted to encode L1, are spliced twice, first from nt 7243 to either nt 3139 (P7107) or nt 757 to 3139 (P533) and second from nt 3431 to nt 5372. Thirteen of these viral transcripts were detectable by Northern blot analysis, with the P533-derived late E1^E4 transcripts being the most abundant. The late transcripts could be detected in highly differentiated keratinocytes of MmuPV1-infected tissues as early as ten days after MmuPV1 inoculation and correlated with detection of L1 protein and viral DNA amplification. In mature warts, detection of L1 was also found in more poorly differentiated cells, as previously reported. Subclinical infections were also observed. The comprehensive transcription map of MmuPV1 generated in this study provides further evidence that MmuPV1 is similar to high-risk cutaneous beta human papillomaviruses. The knowledge revealed will facilitate the use of MmuPV1 as an animal virus model for understanding of human papillomavirus gene expression, pathogenesis and immunology.


September 22, 2019  |  

Transcriptome profiling using Illumina- and SMRT-based RNA-seq of hot pepper for in-depth understanding of genes involved in CMV infection.

Hot pepper (Capsicum annuum L.) is becoming an increasingly important vegetable crop in the world. Cucumber mosaic virus (CMV) is a destructive virus that can cause leaf distortion and fruit lesions, affecting pepper production. However, studies on the response to CMV infection in pepper at the transcriptional level are limited. In this study, the transcript profiles of pepper leaves after CMV infection were investigated using Illumina and single-molecule real-time (SMRT) RNA-sequencing (RNA-seq). A total of 2143 differentially expressed genes (DEGs) were identified at five different stages. Gene ontology (GO) and KEGG analysis revealed that these DEGs were involved in the response to stress, defense response and plant-pathogen interaction pathways. Among these DEGs, several key genes that consistently appeared in studies of plant-pathogen interactions had increased transcript abundance after inoculation, including chitinase, pathogenesis-related (PR) protein, TMV resistance protein, WRKY transcription factor and jasmonate ZIM-domain protein. Four of these DEGs were further validated by quantitative real-time RT-PCR (qRT-PCR). Furthermore, a total of 73, 597 alternative splicing (AS) events were identified in the pepper leaves after CMV infection, distributed in 12, 615 genes. The intron retention of WRKY33 (Capana09g001251) might be involved in the regulation of CMV infection. Taken together, our study provides a transcriptome-wide insight into the molecular basis of resistance to CMV infection in pepper leaves and potential candidate genes for improving resistance cultivars. Copyright © 2018 Elsevier B.V. All rights reserved.


September 22, 2019  |  

Single-cell multiomics: multiple measurements from single cells.

Single-cell sequencing provides information that is not confounded by genotypic or phenotypic heterogeneity of bulk samples. Sequencing of one molecular type (RNA, methylated DNA or open chromatin) in a single cell, furthermore, provides insights into the cell’s phenotype and links to its genotype. Nevertheless, only by taking measurements of these phenotypes and genotypes from the same single cells can such inferences be made unambiguously. In this review, we survey the first experimental approaches that assay, in parallel, multiple molecular types from the same single cell, before considering the challenges and opportunities afforded by these and future technologies. Copyright © 2016. Published by Elsevier Ltd.


September 22, 2019  |  

Genome-wide analysis of complex wheat gliadins, the dominant carriers of celiac disease epitopes.

Gliadins, specified by six compound chromosomal loci (Gli-A1/B1/D1 and Gli-A2/B2/D2) in hexaploid bread wheat, are the dominant carriers of celiac disease (CD) epitopes. Because of their complexity, genome-wide characterization of gliadins is a strong challenge. Here, we approached this challenge by combining transcriptomic, proteomic and bioinformatic investigations. Through third-generation RNA sequencing, full-length transcripts were identified for 52 gliadin genes in the bread wheat cultivar Xiaoyan 81. Of them, 42 were active and predicted to encode 25 a-, 11 ?-, one d- and five ?-gliadins. Comparative proteomic analysis between Xiaoyan 81 and six newly-developed mutants each lacking one Gli locus indicated the accumulation of 38 gliadins in the mature grains. A novel group of a-gliadins (the CSTT group) was recognized to contain very few or no CD epitopes. The d-gliadins identified here or previously did not carry CD epitopes. Finally, the mutant lacking Gli-D2 showed significant reductions in the most celiac-toxic a-gliadins and derivative CD epitopes. The insights and resources generated here should aid further studies on gliadin functions in CD and the breeding of healthier wheat.


September 22, 2019  |  

Global identification of the full-length transcripts and alternative splicing related to phenolic acid biosynthetic genes in Salvia miltiorrhiza.

Salvianolic acids are among the main bioactive components in Salvia miltiorrhiza, and their biosynthesis has attracted widespread interest. However, previous studies on the biosynthesis of phenolic acids using next-generation sequencing platforms are limited with regard to the assembly of full-length transcripts. Based on hybrid-seq (next-generation and single molecular real-time sequencing) of the S. miltiorrhiza root transcriptome, we experimentally identified 15 full-length transcripts and four alternative splicing events of enzyme-coding genes involved in the biosynthesis of rosmarinic acid. Moreover, we herein demonstrate that lithospermic acid B accumulates in the phloem and xylem of roots, in agreement with the expression patterns of the identified key genes related to rosmarinic acid biosynthesis. According to co-expression patterns, we predicted that six candidate cytochrome P450s and five candidate laccases participate in the salvianolic acid pathway. Our results provide a valuable resource for further investigation into the synthetic biology of phenolic acids in S. miltiorrhiza.


September 22, 2019  |  

Single-cell RNAseq for the study of isoforms-how is that possible?

Single-cell RNAseq and alternative splicing studies have recently become two of the most prominent applications of RNAseq. However, the combination of both is still challenging, and few research efforts have been dedicated to the intersection between them. Cell-level insight on isoform expression is required to fully understand the biology of alternative splicing, but it is still an open question to what extent isoform expression analysis at the single-cell level is actually feasible. Here, we establish a set of four conditions that are required for a successful single-cell-level isoform study and evaluate how these conditions are met by these technologies in published research.


September 22, 2019  |  

ISOdb: A comprehensive database of full-length isoforms generated by Iso-Seq.

The accurate landscape of transcript isoforms plays an important role in the understanding of gene function and gene regulation. However, building complete transcripts is very challenging for short reads generated using next-generation sequencing. Fortunately, isoform sequencing (Iso-Seq) using single-molecule sequencing technologies, such as PacBio SMRT, provides long reads spanning entire transcript isoforms which do not require assembly. Therefore, we have developed ISOdb, a comprehensive resource database for hosting and carrying out an in-depth analysis of Iso-Seq datasets and visualising the full-length transcript isoforms. The current version of ISOdb has collected 93 publicly available Iso-Seq samples from eight species and presents the samples in two levels: (1) sample level, including metainformation, long read distribution, isoform numbers, and alternative splicing (AS) events of each sample; (2) gene level, including the total isoforms, novel isoform number, novel AS number, and isoform visualisation of each gene. In addition, ISOdb provides a user interface in the website for uploading sample information to facilitate the collection and analysis of researchers’ datasets. Currently, ISOdb is the first repository that offers comprehensive resources and convenient public access for hosting, analysing, and visualising Iso-Seq data, which is freely available.


September 22, 2019  |  

High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing.

Accurate annotation of genes and their transcripts is a foundation of genomics, but currently no annotation technique combines throughput and accuracy. As a result, reference gene collections remain incomplete-many gene models are fragmentary, and thousands more remain uncataloged, particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), which combines targeted RNA capture with third-generation long-read sequencing. Here we present an experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues that resulted in novel transcript models for 3,574 and 561 gene loci, respectively. CLS approximately doubled the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enabled us to definitively characterize the genomic features of lncRNAs, including promoter and gene structure, and protein-coding potential. Thus, CLS removes a long-standing bottleneck in transcriptome annotation and generates manual-quality full-length transcript models at high-throughput scales.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.