Accurate transcript structure and abundance inference from RNA sequencing (RNA-seq) data is foundational for molecular discovery. Here we present TACO, a computational method to reconstruct a consensus transcriptome from multiple RNA-seq data sets. TACO employs novel change-point detection to demarcate transcript start and end sites, leading to improved reconstruction accuracy compared with other tools in its class. The tool is available at http://tacorna.github.io and can be readily incorporated into RNA-seq analysis workflows.
Transcript prediction can be modeled as a graph problem where exons are modeled as nodes and reads spanning two or more exons are modeled as exon chains. Pacific Biosciences third-generation sequencing technology produces significantly longer reads than earlier second-generation sequencing technologies, which gives valuable information about longer exon chains in a graph. However, with the high error rates of third-generation sequencing, aligning long reads correctly around the splice sites is a challenging task. Incorrect alignments lead to spurious nodes and arcs in the graph, which in turn lead to incorrect transcript predictions. We survey several approaches to find the exon chains corresponding to long reads in a splicing graph, and experimentally study the performance of these methods using simulated data to allow for sensitivity/precision analysis. Our experiments show that short reads from second-generation sequencing can be used to significantly improve exon chain correctness either by error-correcting the long reads before splicing graph creation, or by using them to create a splicing graph on which the long-read alignments are then projected. We also study the memory and time consumption of various modules, and show that accurate exon chains lead to significantly increased transcript prediction accuracy.The simulated data and in-house scripts used for this article are available at http://www.cs.helsinki.fi/group/gsa/exon-chains/exon-chains-bib.tar.bz2.
Studies indicate that more than 90% of human genes are alternatively spliced, suggesting the complexity of the transcriptome assembly and analysis. The splicing process is often disrupted, resulting in both functional and non-functional end-products (Sveen et al. 2016) in many cancers. Harnessing the immune system to fight against malignant cancers carrying aberrantly mutated or spliced products is becoming a promising approach to cancer therapy. Advances in immune checkpoint blockade have elicited adaptive immune responses with promising clinical responses to treatments against human malignancies (Tumor Neoantigens in Personalized Cancer Immunotherapy 2017). Emerging data suggest that recognition of patient-specific mutation-associated cancer antigens (i.e. from alternative splicing isoforms) may allow scientists to dissect the immune response in the activity of clinical immunotherapies (Schumacher and Schreiber 2015). The advent of high-throughput sequencing technology has provided a comprehensive view of both splicing aberrations and somatic mutations across a range of human malignancies, allowing for a deeper understanding of the interplay of various disease mechanisms. Meanwhile, studies show that the number of transcript isoforms reported to date may be limited by the short-read sequencing due to the inherit limitation of transcriptome reconstruction algorithms, whereas long-read sequencing is able to significantly improve the detection of alternative splicing variants since there is no need to assemble full-length transcripts from short reads. The analysis of these high-throughput long-read sequencing data may permit a systematic view of tumor specific peptide epitopes (also known as neoantigens) that could serve as targets for immunotherapy (Tumor Neoantigens in Personalized Cancer Immunotherapy 2017). Currently, there is no software pipeline available that can efficiently produce mutation-associated cancer antigens from raw high-throughput sequencing data on patient tumor DNA (The Problem with Neoantigen Prediction 2017). In addressing this issue, we introduce a R package that allows the discoveries of peptide epitope candidates, which are the tumor-specific peptide fragments containing potential functional neoantigens. These peptide epitopes consist of structure variants including insertion, deletions, alternative sequences, and peptides from nonsynonymous mutations. Analysis of these precursor candidates with widely used tools such as netMHC allows for the accurate in-silico prediction of neoantigens. The pipeline named neoantigeR is currently hosted in https://github.com/ICBI/neoantigeR.
Exploring the genome and transcriptome of the cave nectar bat Eonycteris spelaea with PacBio long-read sequencing.
In the past two decades, bats have emerged as an important model system to study host-pathogen interactions. More recently, it has been shown that bats may also serve as a new and excellent model to study aging, inflammation, and cancer, among other important biological processes. The cave nectar bat or lesser dawn bat (Eonycteris spelaea) is known to be a reservoir for several viruses and intracellular bacteria. It is widely distributed throughout the tropics and subtropics from India to Southeast Asia and pollinates several plant species, including the culturally and economically important durian in the region. Here, we report the whole-genome and transcriptome sequencing, followed by subsequent de novo assembly, of the E. spelaea genome solely using the Pacific Biosciences (PacBio) long-read sequencing platform.The newly assembled E. spelaea genome is 1.97 Gb in length and consists of 4,470 sequences with a contig N50 of 8.0 Mb. Identified repeat elements covered 34.65% of the genome, and 20,640 unique protein-coding genes with 39,526 transcripts were annotated.We demonstrated that the PacBio long-read sequencing platform alone is sufficient to generate a comprehensive de novo assembled genome and transcriptome of an important bat species. These results will provide useful insights and act as a resource to expand our understanding of bat evolution, ecology, physiology, immunology, viral infection, and transmission dynamics.
Background: Long-read RNA sequencing, such as Pacific Biosciences Iso-Seq method, enables generation of sequencing reads that are 10 kilobases or even longer. These reads are ideal for discovering splice junctions and resolving full-length gene transcripts without time-consuming and error-prone techniques such as transcript assembly and junction inference. Results: Iso-Seq Browser is a Web-based visual analytics tool for long-read RNA sequencing data produced by Pacific Biosciences isoform sequencing (Iso-Seq) techniques. Key features of the Iso-Seq Browser are: 1) an exon-only web-based interface with zooming and exon highlighting for exploring reference gene transcripts and novel gene isoforms, 2) automated grouping of transcripts and isoforms by similarity, 3) many customization features for data exploration and creating publication ready figures, and 4) exporting selected isoforms into fasta files for further analysis. Iso-Seq Browser is written in Python using several scientific libraries. The application and analyses described in this paper are freely available to both academic and commercial users at https://github.com/goeckslab/isoseq-browser Conclusions: Iso-Seq Browser enables interactive genome-wide visual analysis of long RNA sequence reads. Through visualization, highlighting, clustering, and filtering of gene isoforms, ISB makes it simple to identify novel isoforms and novel isoform features such as exons, introns and untranslated regions.
Dynamic transcriptome profiling dataset of vaccinia virus obtained from long-read sequencing techniques.
Poxviruses are large DNA viruses that infect humans and animals. Vaccinia virus (VACV) has been applied as a live vaccine for immunization against smallpox, which was eradicated by 1980 as a result of worldwide vaccination. VACV is the prototype of poxviruses in the investigation of the molecular pathogenesis of the virus. Short-read sequencing methods have revolutionized transcriptomics; however, they are not efficient in distinguishing between the RNA isoforms and transcript overlaps. Long-read sequencing (LRS) is much better suited to solve these problems and also allow direct RNA sequencing. Despite the scientific relevance of VACV, no LRS data have been generated for the viral transcriptome to date.For the deep characterization of the VACV RNA profile, various LRS platforms and library preparation approaches were applied. The raw reads were mapped to the VACV reference genome and also to the host (Chlorocebus sabaeus) genome. In this study, we applied the Pacific Biosciences RSII and Sequel platforms, which altogether resulted in 937,531 mapped reads of inserts (1.42 Gb), while we obtained 2,160,348 aligned reads (1.75 Gb) from the different library preparation methods using the MinION device from Oxford Nanopore Technologies.By applying cutting-edge technologies, we were able to generate a large dataset that can serve as a valuable resource for the investigation of the dynamic VACV transcriptome, the virus-host interactions, and RNA base modifications. These data can provide useful information for novel gene annotations in the VACV genome. Our dataset can also be used to analyze the currently available LRS platforms, library preparation methods, and bioinformatics pipelines.
Androgen receptor variant AR-V9 is co-expressed with AR-V7 in prostate cancer metastases and predicts abiraterone resistance.
Purpose: Androgen receptor (AR) variant AR-V7 is a ligand-independent transcription factor that promotes prostate cancer resistance to AR-targeted therapies. Accordingly, efforts are underway to develop strategies for monitoring and inhibiting AR-V7 in castration-resistant prostate cancer (CRPC). The purpose of this study was to understand whether other AR variants may be co-expressed with AR-V7 and promote resistance to AR-targeted therapies. Experimental Design: We utilized complementary short- and long-read sequencing of intact AR mRNA isoforms to characterize AR expression in CRPC models. Co-expression of AR-V7 and AR-V9 mRNA in CRPC metastases and circulating tumor cells was assessed by RNA-seq and RT-PCR, respectively. Expression of AR-V9 protein in CRPC models was evaluated with polyclonal antisera. Multivariate analysis was performed to test whether AR variant mRNA expression in metastatic tissues was associated with a 12-week progression-free survival endpoint in a prospective clinical trial of 78 CRPC-stage patients initiating therapy with the androgen synthesis inhibitor, abiraterone acetate. Results: AR-V9 was frequently co-expressed with AR-V7. Both AR variant species were found to share a common 3′ terminal cryptic exon, which rendered AR-V9 susceptible to experimental manipulations that were previously-thought to target AR-V7 uniquely. AR-V9 promoted ligand-independent growth of prostate cancer cells. High AR-V9 mRNA expression in CRPC metastases was predictive of primary resistance to abiraterone acetate (HR = 4.0, 95% CI = 1.31-12.2, P = 0.02). Conclusions: AR-V9 may be an important component of therapeutic resistance in CRPC. Copyright ©2017, American Association for Cancer Research.
Most human protein-coding genes can be transcribed into multiple distinct mRNA isoforms. These alternative splicing patterns encourage molecular diversity, and dysregulation of isoform expression plays an important role in disease etiology. However, isoforms are difficult to characterize from short-read RNA-seq data because they share identical subsequences and occur in different frequencies across tissues and samples. Here, we develop BIISQ, a Bayesian nonparametric model for isoform discovery and individual specific quantification from short-read RNA-seq data. BIISQ does not require isoform reference sequences but instead estimates an isoform catalog shared across samples. We use stochastic variational inference for efficient posterior estimates and demonstrate superior precision and recall for simulations compared to state-of-the-art isoform reconstruction methods. BIISQ shows the most gains for low abundance isoforms, with 36% more isoforms correctly inferred at low coverage versus a multi-sample method and 170% more versus single-sample methods. We estimate isoforms in the GEUVADIS RNA-seq data and validate inferred isoforms by associating genetic variants with isoform ratios.
Meiotic drivers are selfish genes that bias their transmission into gametes, defying Mendelian inheritance. Despite the significant impact of these genomic parasites on evolution and infertility, few meiotic drive loci have been identified or mechanistically characterized. Here, we demonstrate a complex landscape of meiotic drive genes on chromosome 3 of the fission yeasts Schizosaccharomyces kambucha and S. pombe. We identify S. kambucha wtf4 as one of these genes that acts to kill gametes (known as spores in yeast) that do not inherit the gene from heterozygotes. wtf4 utilizes dual, overlapping transcripts to encode both a gamete-killing poison and an antidote to the poison. To enact drive, all gametes are poisoned, whereas only those that inherit wtf4 are rescued by the antidote. Our work suggests that the wtf multigene family proliferated due to meiotic drive and highlights the power of selfish genes to shape genomes, even while imposing tremendous costs to fertility.
Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis.
RNA-sequencing (RNA-seq) is an essential technique for transcriptome studies, hundreds of analysis tools have been developed since it was debuted. Although recent efforts have attempted to assess the latest available tools, they have not evaluated the analysis workflows comprehensively to unleash the power within RNA-seq. Here we conduct an extensive study analysing a broad spectrum of RNA-seq workflows. Surpassing the expression analysis scope, our work also includes assessment of RNA variant-calling, RNA editing and RNA fusion detection techniques. Specifically, we examine both short- and long-read RNA-seq technologies, 39 analysis tools resulting in ~120 combinations, and ~490 analyses involving 15 samples with a variety of germline, cancer and stem cell data sets. We report the performance and propose a comprehensive RNA-seq analysis protocol, named RNACocktail, along with a computational pipeline achieving high accuracy. Validation on different samples reveals that our proposed protocol could help researchers extract more biologically relevant predictions by broad analysis of the transcriptome.RNA-seq is widely used for transcriptome analysis. Here, the authors analyse a wide spectrum of RNA-seq workflows and present a comprehensive analysis protocol named RNACocktail as well as a computational pipeline leveraging the widely used tools for accurate RNA-seq analysis.
Long-read sequencing (LRS) techniques are very recent advancements, but they have already been used for transcriptome research in all of the three subfamilies of herpesviruses. These techniques have multiplied the number of known transcripts in each of the examined viruses. Meanwhile, they have revealed a so far hidden complexity of the herpesvirus transcriptome with the discovery of a large number of novel RNA molecules, including coding and non-coding RNAs, as well as transcript isoforms, and polycistronic RNAs. Additionally, LRS techniques have uncovered an intricate meshwork of transcriptional overlaps between adjacent and distally located genes. Here, we review the contribution of LRS to herpesvirus transcriptomics and present the complexity revealed by this technology, while also discussing the functional significance of this phenomenon.
In mammals, GPIHBP1 is absolutely essential for transporting lipoprotein lipase (LPL) to the lumen of capillaries, where it hydrolyzes the triglycerides in triglyceride-rich lipoproteins. In all lower vertebrate species (e.g., birds, amphibians, reptiles, fish), a gene for LPL can be found easily, but a gene for GPIHBP1 has never been found. The obvious question is whether the LPL in lower vertebrates is able to reach the capillary lumen. Using purified antibodies against chicken LPL, we showed that LPL is present on capillary endothelial cells of chicken heart and adipose tissue, colocalizing with von Willebrand factor. When the antibodies against chicken LPL were injected intravenously into chickens, they bound to LPL on the luminal surface of capillaries in heart and adipose tissue. LPL was released rapidly from chicken hearts with an infusion of heparin, consistent with LPL being located inside blood vessels. Remarkably, chicken LPL bound in a specific fashion to mammalian GPIHBP1. However, we could not identify a gene for GPIHBP1 in the chicken genome, nor could we identify a transcript for GPIHBP1 in a large chicken RNA-seq data set. We conclude that LPL reaches the capillary lumen in chickens – as it does in mammals – despite an apparent absence of GPIHBP1.
Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics.
Short read massive parallel sequencing has emerged as a standard diagnostic tool in the medical setting. However, short read technologies have inherent limitations such as GC bias, difficulties mapping to repetitive elements, trouble discriminating paralogous sequences, and difficulties in phasing alleles. Long read single molecule sequencers resolve these obstacles. Moreover, they offer higher consensus accuracies and can detect epigenetic modifications from native DNA. The first commercially available long read single molecule platform was the RS system based on PacBio’s single molecule real-time (SMRT) sequencing technology, which has since evolved into their RSII and Sequel systems. Here we capsulize how SMRT sequencing is revolutionizing constitutional, reproductive, cancer, microbial and viral genetic testing.© The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.
RNA-sequencing has revolutionized transcriptomics and the way we measure gene expression (Wang et al., 2009). As of today, short-read RNA sequencing is more widely used, and due to its low price and high throughput, is the preferred tool for the quantitative analysis of gene expression. However, the annotation of transcript isoforms is rather difficult using only short-read sequencing data, because the reads are shorter than most transcripts (Steijger et al., 2013). Long-read sequencing, on the other hand, can provide full contig information about transcripts, including exon-connectivity, and its merits in transcriptome profiling are being increasingly acknowledged (Sharon et al., 2013; Abdel-Ghany et al., 2016; Wang et al., 2016; Kuo et al., 2017). Due to the relatively low throughput of current long-read sequencing technologies, they can only characterize smaller transcriptomes in high-depth (Weirather et al., 2017). The Human cytomegalovirus (HCMV) is a ubiquitous betaherpesvirus, which can cause mononucleosis-like symptoms in adults (Cohen and Corey, 1985), and severe life-threatening infections in newborns (Wen et al., 2002). Latent HCMV infection has recently been implicated to affect cancer formation (Dziurzynski et al., 2012; Jin et al., 2014). Examining the transcriptome of the virus can go a long way in helping understand its molecular biology. Short-read RNA sequencing studies have discovered splice junctions and non-coding transcripts (Gatherer et al., 2011) and have shown that the most abundant HCMV transcripts are similarly expressed in different cell types (Cheng et al., 2017). Our long-read RNA sequencing experiments using the Pacific Biosciences (PacBio) RSII platform revealed a great number of transcript isoforms, polycistronic RNAs and transcriptional overlaps (Balázs et al., 2017a).
Genes in prokaryotic genomes are often arranged into clusters and co-transcribed into polycistronic RNAs. Isolated examples of polycistronic RNAs were also reported in some higher eukaryotes but their presence was generally considered rare. Here we developed a long-read sequencing strategy to identify polycistronic transcripts in several mushroom forming fungal species including Plicaturopsis crispa, Phanerochaete chrysosporium, Trametes versicolor, and Gloeophyllum trabeum. We found genome-wide prevalence of polycistronic transcription in these Agaricomycetes, involving up to 8% of the transcribed genes. Unlike polycistronic mRNAs in prokaryotes, these co-transcribed genes are also independently transcribed. We show that polycistronic transcription may interfere with expression of the downstream tandem gene. Further comparative genomic analysis indicates that polycistronic transcription is conserved among a wide range of mushroom forming fungi. In summary, our study revealed, for the first time, the genome prevalence of polycistronic transcription in a phylogenetic range of higher fungi. Furthermore, we systematically show that our long-read sequencing approach and combined bioinformatics pipeline is a generic powerful tool for precise characterization of complex transcriptomes that enables identification of mRNA isoforms not recovered via short-read assembly.