Menu
September 22, 2019

Revealing missing human protein isoforms based on Ab initio prediction, RNA-seq and proteomics.

Biological and biomedical research relies on comprehensive understanding of protein-coding transcripts. However, the total number of human proteins is still unknown due to the prevalence of alternative splicing. In this paper, we detected 31,566 novel transcripts with coding potential by filtering our ab initio predictions with 50 RNA-seq datasets from diverse tissues/cell lines. PCR followed by MiSeq sequencing showed that at least 84.1% of these predicted novel splice sites could be validated. In contrast to known transcripts, the expression of these novel transcripts were highly tissue-specific. Based on these novel transcripts, at least 36 novel proteins were detected from shotgun proteomics data of 41 breast samples. We also showed L1 retrotransposons have a more significant impact on the origin of new transcripts/genes than previously thought. Furthermore, we found that alternative splicing is extraordinarily widespread for genes involved in specific biological functions like protein binding, nucleoside binding, neuron projection, membrane organization and cell adhesion. In the end, the total number of human transcripts with protein-coding potential was estimated to be at least 204,950.


September 22, 2019

Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci.

We report full-length draft de novo genome assemblies for 16 widely used inbred mouse strains and find extensive strain-specific haplotype variation. We identify and characterize 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome, resulting in the completion of 10 new gene structures. Also, 62 new coding loci were added to the reference genome annotation. These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids. Mutant Efcab3-like mice display anomalies in multiple brain regions, suggesting a possible role for this gene in the regulation of brain development.


September 22, 2019

Shorter unreported sequences in a RACE-Seq study involving seven tissues confirms ~150 novel transcripts identified in MCF-7 cell line PacBio transcriptome, leaving ~100 non-redundant transcripts exclusive to the cancer cell line.

PacBio sequencing generates much longer reads compared to second-generation sequencing technologies, with a trade-off of lower throughput, higher error rate and more cost per base. The PacBio transcriptome of the breast cancer cell line MCF-7 was found to have ~300 transcripts un-annotated in the current GENCODE (v25) or RefSeq, and missing in the liver, heart and brain PacBio transcriptomes [1]. RACE-sequencing (RACE-seq [2]) extends a well-established method of characterizing cDNA molecules generated by rapid amplification of cDNA ends (RACE [3]) using high-throughput sequencing technologies, reducing costs compared to PacBio. Here, shorter fragments of ~150 transcripts were found to be present in seven tissues analyzed in a recent RACE-seq study (Accid:ERP012249) [4]. These transcripts were not among the ~2500 novel transcripts reported in that study, tested separately here using the genomic coordinates provided, although “all curated novel isoforms were incorporated into the human GENCODE set (v22)” in that study. Non-redundancy analysis of the exclusive transcripts identified one transcript mapping to Chr1 with seven different splice variants, and erroneously mapped to Chr15 (PAC clone 15q11-q13) from the Prader-Willi/Angelman Syndrome region (Accid:AC004137.1). Finally, there are ~100 non-redundant transcripts missing in the seven tissues, in addition to other three tissues analyzed previously. Their absence in GENCODE and RefSeq databases rule them out as commonly transcribed regions, further increasing their likelihood as biomarkers.


September 22, 2019

Integrative analysis of three RNA sequencing methods identifies mutually exclusive exons of MADS-box isoforms during early bud development in Picea abies.

Recent efforts to sequence the genomes and transcriptomes of several gymnosperm species have revealed an increased complexity in certain gene families in gymnosperms as compared to angiosperms. One example of this is the gymnosperm sister clade to angiosperm TM3-like MADS-box genes, which at least in the conifer lineage has expanded in number of genes. We have previously identified a member of this sub-clade, the conifer gene DEFICIENS AGAMOUS LIKE 19 (DAL19), as being specifically upregulated in cone-setting shoots. Here, we show through Sanger sequencing of mRNA-derived cDNA and mapping to assembled conifer genomic sequences that DAL19 produces six mature mRNA splice variants in Picea abies. These splice variants use alternate first and last exons, while their four central exons constitute a core region present in all six transcripts. Thus, they are likely to be transcript isoforms. Quantitative Real-Time PCR revealed that two mutually exclusive first DAL19 exons are differentially expressed across meristems that will form either male or female cones, or vegetative shoots. Furthermore, mRNA in situ hybridization revealed that two mutually exclusive last DAL19 exons were expressed in a cell-specific pattern within bud meristems. Based on these findings in DAL19, we developed a sensitive approach to transcript isoform assembly from short-read sequencing of mRNA. We applied this method to 42 putative MADS-box core regions in P. abies, from which we assembled 1084 putative transcripts. We manually curated these transcripts to arrive at 933 assembled transcript isoforms of 38 putative MADS-box genes. 152 of these isoforms, which we assign to 28 putative MADS-box genes, were differentially expressed across eight female, male, and vegetative buds. We further provide evidence of the expression of 16 out of the 38 putative MADS-box genes by mapping PacBio Iso-Seq circular consensus reads derived from pooled sample sequencing to assembled transcripts. In summary, our analyses reveal the use of mutually exclusive exons of MADS-box gene isoforms during early bud development in P. abies, and we find that the large number of identified MADS-box transcripts in P. abies results not only from expansion of the gene family through gene duplication events but also from the generation of numerous splice variants.


September 22, 2019

Full-length mRNA sequencing uncovers a widespread coupling between transcription initiation and mRNA processing.

The multifaceted control of gene expression requires tight coordination of regulatory mechanisms at transcriptional and post-transcriptional level. Here, we studied the interdependence of transcription initiation, splicing and polyadenylation events on single mRNA molecules by full-length mRNA sequencing.In MCF-7 breast cancer cells, we find 2700 genes with interdependent alternative transcription initiation, splicing and polyadenylation events, both in proximal and distant parts of mRNA molecules, including examples of coupling between transcription start sites and polyadenylation sites. The analysis of three human primary tissues (brain, heart and liver) reveals similar patterns of interdependency between transcription initiation and mRNA processing events. We predict thousands of novel open reading frames from full-length mRNA sequences and obtained evidence for their translation by shotgun proteomics. The mapping database rescues 358 previously unassigned peptides and improves the assignment of others. By recognizing sample-specific amino-acid changes and novel splicing patterns, full-length mRNA sequencing improves proteogenomics analysis of MCF-7 cells.Our findings demonstrate that our understanding of transcriptome complexity is far from complete and provides a basis to reveal largely unresolved mechanisms that coordinate transcription initiation and mRNA processing.


September 22, 2019

Extensive allele-specific translational regulation in hybrid mice.

Translational regulation is mediated through the interaction between diffusible trans-factors and cis-elements residing within mRNA transcripts. In contrast to extensively studied transcriptional regulation, cis-regulation on translation remains underexplored. Using deep sequencing-based transcriptome and polysome profiling, we globally profiled allele-specific translational efficiency for the first time in an F1 hybrid mouse. Out of 7,156 genes with reliable quantification of both alleles, we found 1,008 (14.1%) exhibiting significant allelic divergence in translational efficiency. Systematic analysis of sequence features of the genes with biased allelic translation revealed that local RNA secondary structure surrounding the start codon and proximal out-of-frame upstream AUGs could affect translational efficiency. Finally, we observed that the cis-effect was quantitatively comparable between transcriptional and translational regulation. Such effects in the two regulatory processes were more frequently compensatory, suggesting that the regulation at the two levels could be coordinated in maintaining robustness of protein expression. © 2015 The Authors. Published under the terms of the CC BY 4.0 license.


September 22, 2019

Genome-wide characterization of human L1 antisense promoter-driven transcripts.

Long INterspersed Element-1 (LINE-1 or L1) is the only autonomously active, transposable element in the human genome. L1 sequences comprise approximately 17 % of the human genome, but only the evolutionarily recent, human-specific subfamily is retrotransposition competent. The L1 promoter has a bidirectional orientation containing a sense promoter that drives the transcription of two proteins required for retrotransposition and an antisense promoter. The L1 antisense promoter can drive transcription of chimeric transcripts: 5′ L1 antisense sequences spliced to the exons of neighboring genes.The impact of L1 antisense promoter activity on cellular transcriptomes is poorly understood. To investigate this, we analyzed GenBank ESTs for messenger RNAs that initiate in the L1 antisense promoter. We identified 988 putative L1 antisense chimeric transcripts, 911 of which have not been previously reported. These appear to be alternative genic transcripts, sense-oriented with respect to gene and initiating near, but typically downstream of, the gene transcriptional start site. In multiple cell lines, L1 antisense promoters display enrichment for YY1 transcription factor and histone modifications associated with active promoters. Global run-on sequencing data support the activity of the L1 antisense promoter. We independently detected 124 L1 antisense chimeric transcripts using long read Pacific Biosciences RNA-seq data. Furthermore, we validated four chimeric transcripts by quantitative RT-PCR and Sanger sequencing and demonstrated that they are readily detectable in many normal human tissues.We present a comprehensive characterization of human L1 antisense promoter-driven transcripts and provide substantial evidence that they are transcribed in a variety of human cell-types. Our findings reveal a new wide-reaching aspect of L1 biology by identifying antisense transcripts affecting as many as 4 % of all human genes.


September 22, 2019

A single-molecule long-read survey of the human transcriptome.

Global RNA studies have become central to understanding biological processes, but methods such as microarrays and short-read sequencing are unable to describe an entire RNA molecule from 5′ to 3′ end. Here we use single-molecule long-read sequencing technology from Pacific Biosciences to sequence the polyadenylated RNA complement of a pooled set of 20 human organs and tissues without the need for fragmentation or amplification. We show that full-length RNA molecules of up to 1.5 kb can readily be monitored with little sequence loss at the 5′ ends. For longer RNA molecules more 5′ nucleotides are missing, but complete intron structures are often preserved. In total, we identify ~14,000 spliced GENCODE genes. High-confidence mappings are consistent with GENCODE annotations, but >10% of the alignments represent intron structures that were not previously annotated. As a group, transcripts mapping to unannotated regions have features of long, noncoding RNAs. Our results show the feasibility of deep sequencing full-length RNA from complex eukaryotic transcriptomes on a single-molecule level.


September 22, 2019

Proteogenomic analysis reveals alternative splicing and translation as part of the abscisic acid response in Arabidopsis seedlings.

In eukaryotes, mechanisms such as alternative splicing (AS) and alternative translation initiation (ATI) contribute to organismal protein diversity. Specifically, splicing factors play crucial roles in responses to environment and development cues; however, the underlying mechanisms are not well investigated in plants. Here, we report the parallel employment of short-read RNA sequencing, single molecule long-read sequencing and proteomic identification to unravel AS isoforms and previously unannotated proteins in response to abscisic acid (ABA) treatment. Combining the data from the two sequencing methods, approximately 83.4% of intron-containing genes were alternatively spliced. Two AS types, which are referred to as alternative first exon (AFE) and alternative last exon (ALE), were more abundant than intron retention (IR); however, by contrast to AS events detected under normal conditions, differentially expressed AS isoforms were more likely to be translated. ABA extensively affects the AS pattern, indicated by the increasing number of non-conventional splicing sites. This work also identified thousands of unannotated peptides and proteins by ATI based on mass spectrometry and a virtual peptide library deduced from both strands of coding regions within the Arabidopsis genome. The results enhance our understanding of AS and alternative translation mechanisms under normal conditions, and in response to ABA treatment.© 2017 The Authors The Plant Journal © 2017 John Wiley & Sons Ltd.


September 22, 2019

High-resolution expression map of the Arabidopsis root reveals alternative splicing and lincRNA regulation.

The extent to which alternative splicing and long intergenic noncoding RNAs (lincRNAs) contribute to the specialized functions of cells within an organ is poorly understood. We generated a comprehensive dataset of gene expression from individual cell types of the Arabidopsis root. Comparisons across cell types revealed that alternative splicing tends to remove parts of coding regions from a longer, major isoform, providing evidence for a progressive mechanism of splicing. Cell-type-specific intron retention suggested a possible origin for this common form of alternative splicing. Coordinated alternative splicing across developmental stages pointed to a role in regulating differentiation. Consistent with this hypothesis, distinct isoforms of a transcription factor were shown to control developmental transitions. lincRNAs were generally lowly expressed at the level of individual cell types, but co-expression clusters provided clues as to their function. Our results highlight insights gained from analysis of expression at the level of individual cell types. Copyright © 2016 Elsevier Inc. All rights reserved.


September 22, 2019

Next generation sequencing technology: Advances and applications.

Impressive progress has been made in the field of Next Generation Sequencing (NGS). Through advancements in the fields of molecular biology and technical engineering, parallelization of the sequencing reaction has profoundly increased the total number of produced sequence reads per run. Current sequencing platforms allow for a previously unprecedented view into complex mixtures of RNA and DNA samples. NGS is currently evolving into a molecular microscope finding its way into virtually every fields of biomedical research. In this chapter we review the technical background of the different commercially available NGS platforms with respect to template generation and the sequencing reaction and take a small step towards what the upcoming NGS technologies will bring. We close with an overview of different implementations of NGS into biomedical research. This article is part of a Special Issue entitled: From Genome to Function. Copyright © 2014 Elsevier B.V. All rights reserved.


September 22, 2019

Hybrid sequencing of full-length cDNA transcripts of stems and leaves in Dendrobium officinale.

Dendrobium officinale is an extremely valuable orchid used in traditional Chinese medicine, so sought after that it has a higher market value than gold. Although the expression profiles of some genes involved in the polysaccharide synthesis have previously been investigated, little research has been carried out on their alternatively spliced isoforms in D. officinale. In addition, information regarding the translocation of sugars from leaves to stems in D. officinale also remains limited. We analyzed the polysaccharide content of D. officinale leaves and stems, and completed in-depth transcriptome sequencing of these two diverse tissue types using second-generation sequencing (SGS) and single-molecule real-time (SMRT) sequencing technology. The results of this study yielded a digital inventory of gene and mRNA isoform expressions. A comparative analysis of both transcriptomes uncovered a total of 1414 differentially expressed genes, including 844 that were up-regulated and 570 that were down-regulated in stems. Of these genes, one sugars will eventually be exported transporter (SWEET) and one sucrose transporter (SUT) are expressed to a greater extent in D. officinale stems than in leaves. Two glycosyltransferase (GT) and four cellulose synthase (Ces) genes undergo a distinct degree of alternative splicing. In the stems, the content of polysaccharides is twice as much as that in the leaves. The differentially expressed GT and transcription factor (TF) genes will be the focus of further study. The genes DoSWEET4 and DoSUT1 are significantly expressed in the stem, and are likely to be involved in sugar loading in the phloem.


September 22, 2019

Characterization of the human ESC transcriptome by hybrid sequencing.

Although transcriptional and posttranscriptional events are detected in RNA-Seq data from second-generation sequencing, full-length mRNA isoforms are not captured. On the other hand, third-generation sequencing, which yields much longer reads, has current limitations of lower raw accuracy and throughput. Here, we combine second-generation sequencing and third-generation sequencing with a custom-designed method for isoform identification and quantification to generate a high-confidence isoform dataset for human embryonic stem cells (hESCs). We report 8,084 RefSeq-annotated isoforms detected as full-length and an additional 5,459 isoforms predicted through statistical inference. Over one-third of these are novel isoforms, including 273 RNAs from gene loci that have not previously been identified. Further characterization of the novel loci indicates that a subset is expressed in pluripotent cells but not in diverse fetal and adult tissues; moreover, their reduced expression perturbs the network of pluripotency-associated genes. Results suggest that gene identification, even in well-characterized human cell lines and tissues, is likely far from complete.


September 22, 2019

Bypassing the Restriction System To Improve Transformation of Staphylococcus epidermidis.

Staphylococcus epidermidis is the leading cause of infections on indwelling medical devices worldwide. Intrinsic antibiotic resistance and vigorous biofilm production have rendered these infections difficult to treat and, in some cases, require the removal of the offending medical prosthesis. With the exception of two widely passaged isolates, RP62A and 1457, the pathogenesis of infections caused by clinical S. epidermidis strains is poorly understood due to the strong genetic barrier that precludes the efficient transformation of foreign DNA into clinical isolates. The difficulty in transforming clinical S. epidermidis isolates is primarily due to the type I and IV restriction-modification systems, which act as genetic barriers. Here, we show that efficient plasmid transformation of clinical S. epidermidis isolates from clonal complexes 2, 10, and 89 can be realized by employing a plasmid artificial modification (PAM) in Escherichia coli DC10B containing a ?dcm mutation. This transformative technique should facilitate our ability to genetically modify clinical isolates of S. epidermidis and hence improve our understanding of their pathogenesis in human infections.IMPORTANCEStaphylococcus epidermidis is a source of considerable morbidity worldwide. The underlying mechanisms contributing to the commensal and pathogenic lifestyles of S. epidermidis are poorly understood. Genetic manipulations of clinically relevant strains of S. epidermidis are largely prohibited due to the presence of a strong restriction barrier. With the introductions of the tools presented here, genetic manipulation of clinically relevant S. epidermidis isolates has now become possible, thus improving our understanding of S. epidermidis as a pathogen. Copyright © 2017 American Society for Microbiology.


September 22, 2019

Multi-platform analysis reveals a complex transcriptome architecture of a circovirus.

In this study, we used Pacific Biosciences RS II long-read and Illumina HiScanSQ short-read sequencing technologies for the characterization of porcine circovirus type 1 (PCV-1) transcripts. Our aim was to identify novel RNA molecules and transcript isoforms, as well as to determine the exact 5′- and 3′-end sequences of previously described transcripts with single base-pair accuracy. We discovered a novel 3′-UTR length isoform of the Cap transcript, and a non-spliced Cap transcript variant. Additionally, our analysis has revealed a 3′-UTR isoform of Rep and two 5′-UTR isoforms of Rep’ transcripts, and a novel splice variant of the longer Rep’ transcript. We also explored two novel long transcripts, one with a previously identified splice site, and a formerly undetected mRNA of ORF3. Altogether, our methods have identified nine novel RNA molecules, doubling the size of PCV-1 transcriptome that had been known before. Additionally, our investigations revealed an intricate pattern of transcript overlapping, which might produce transcriptional interference between the transcriptional machineries of adjacent genes, and thereby may potentially play a role in the regulation of gene expression in circoviruses. Copyright © 2017 Elsevier B.V. All rights reserved.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.