Human Embryonic Stem Cells (hESCs) are in vitro derivatives of the inner cell mass of the blastocyst and are characterized by an undifferentiated and pluripotent state that can be perpetuated in time, indefinitely. hESCs provide a unique opportunity to both dissect the molecular mechanisms that are predisposed to the maintenance of pluripotency and model the ability to initiate differentiation and cell commitment within the developing embryo. To fully understand these mechanisms, it is necessary to accurately identify the specific transcriptome of hESCs. Many distinct gene annotation methods, such as cDNA and EST sequencing and RNA-Seq, have been used to identify the transcriptome of hESCs. Lately, we developed a new tool (IDP) to integrate the hybrid sequencing data to characterize a more reliable and comprehensive hESC transcriptome with discoveries of many novel transcripts. Copyright © 2014 The Authors. Published by Elsevier Ltd.. All rights reserved.
The performance of RNA sequencing (RNA-seq) aligners and assemblers varies greatly across different organisms and experiments, and often the optimal approach is not known beforehand.Here, we show that the accuracy of transcript reconstruction can be boosted by combining multiple methods, and we present a novel algorithm to integrate multiple RNA-seq assemblies into a coherent transcript annotation. Our algorithm can remove redundancies and select the best transcript models according to user-specified metrics, while solving common artifacts such as erroneous transcript chimerisms.We have implemented this method in an open-source Python3 and Cython program, Mikado, available on GitHub.
Long-read sequencing uncovers transcript features missed by short-read methods.
Transcriptional adaptations during long-term persistence of Staphylococcus aureus in the airways of a cystic fibrosis patient.
The lungs of Cystic fibrosis (CF) patients are often colonized and/or infected by Staphylococcus aureus for years, mostly by one predominant clone. For long-term survival in this environment, S. aureus needs to adapt during its interactions with host factors, antibiotics, and other pathogens. Here, we study long-term transcriptional as well as genomic adaptations of an isogenic pair of S. aureus isolates from a single patient using RNA sequencing (RNA-Seq) and whole genome sequencing (WGS). Mimicking in vivo conditions, we cultivated the S. aureus isolates using artificial sputum medium before harvesting RNA for subsequent analysis. We confirmed our RNA-Seq data using quantitative real-time (qRT)-PCR and additionally investigated intermediate isolates from the same patient representing in total 13.2 years of persistence in the CF airways. Comparative RNA-Seq analysis of the first and the last (“late”) isolate revealed significant differences in the late isolate after 13.2 years of persistence. Of the 2545 genes expressed in both isolates that were cultivated aerobically, 256 genes were up- and 161 were down-regulated with a minimum 2-fold change (2f). Focusing on 25 highly (=8f) up- (n=9) or down- (n=16) regulated genes, we identified several genes encoding for virulence factors involved in immune evasion, bacterial spread or secretion (e.g. spa, sak, and esxA). Moreover, these genes displayed similar expression trends under aerobic, microaerophilic and anaerobic conditions. Further qRT-PCR-experiments of highly up- or down-regulated genes within intermediate S. aureus isolates resulted in different gene expression patterns over the years. Using sequencing analysis of the differently expressed genes and their upstream regions in the late S. aureus isolate resulted in only few genomic alterations. Comparative transcriptomic analysis revealed adaptive changes affecting mainly genes involved in host-pathogen interaction. Although the underlying mechanisms were not known, our results suggest adaptive processes beyond genomic mutations triggered by local factors rather than by activation of global regulators. Copyright © 2014 The Authors. Published by Elsevier GmbH.. All rights reserved.
Comparison of the mitochondrial genomes and steady state transcriptomes of two strains of the trypanosomatid parasite, Leishmania tarentolae.
U-insertion/deletion RNA editing is a post-transcriptional mitochondrial RNA modification phenomenon required for viability of trypanosomatid parasites. Small guide RNAs encoded mainly by the thousands of catenated minicircles contain the information for this editing. We analyzed by NGS technology the mitochondrial genomes and transcriptomes of two strains, the old lab UC strain and the recently isolated LEM125 strain. PacBio sequencing provided complete minicircle sequences which avoided the assembly problem of short reads caused by the conserved regions. Minicircles were identified by a characteristic size, the presence of three short conserved sequences, a region of inherently bent DNA and the presence of single gRNA genes at a fairly defined location. The LEM125 strain contained over 114 minicircles encoding different gRNAs and the UC strain only ~24 minicircles. Some LEM125 minicircles contained no identifiable gRNAs. Approximate copy numbers of the different minicircle classes in the network were determined by the number of PacBio CCS reads that assembled to each class. Mitochondrial RNA libraries from both strains were mapped against the minicircle and maxicircle sequences. Small RNA reads mapped to the putative gRNA genes but also to multiple regions outside the genes on both strands and large RNA reads mapped in many cases over almost the entire minicircle on both strands. These data suggest that minicircle transcription is complete and bidirectional, with 3′ processing yielding the mature gRNAs. Steady state RNAs in varying abundances are derived from all maxicircle genes, including portions of the repetitive divergent region. The relative extents of editing in both strains correlated with the presence of a cascade of cognate gRNAs. These data should provide the foundation for a deeper understanding of this dynamic genetic system as well as the evolutionary variation of editing in different strains.
HIV-1 infection of primary CD4(+) T cells regulates the expression of specific HERV-K (HML-2) elements.
Endogenous retroviruses (ERVs) occupy extensive regions of the human genome. Although many of these retroviral elements have lost their ability to replicate, those whose insertion took place more recently, such as the HML-2 group of HERV-K elements, still retain intact open reading frames and the capacity to produce certain viral RNA and/or proteins. Transcription of these ERVs is, however, tightly regulated by dedicated epigenetic control mechanisms. Nonetheless, it has been reported that some pathologic states, such as viral infections and certain cancers, coincide with ERV expression suggesting transcriptional reawakening is possible. HML-2 elements are reportedly induced during HIV-1 infection, but the conserved nature of these elements has, until recently, rendered their expression profiling problematic.Here, we provide comprehensive HERV-K HML-2 expression profiles specific for productively HIV-1 infected primary human CD4(+) T cells. We combined enrichment of HIV-1 infected cells using a reporter virus expressing a surface reporter for gentle and efficient purification with long-read Single Molecule Real-Time sequencing. We show that three HML-2 proviruses, 6q25.1, 8q24.3, and 19q13.42 are up-regulated on average between 3- and 5-fold in HIV-1 infected CD4(+) T cells. One provirus, HML-2 12q24.33, in contrast, was repressed in the presence of active HIV replication.In conclusion, this report identifies the HERV-K HML-2 loci whose expression profiles differ upon HIV-1 infection in primary human CD4(+) T cells. These data will help pave the way for further studies on the influence of endogenous retroviruses on HIV-1 replication.Importance Endogenous retroviruses inhabit big portions of our genome. And although they are mainly inert some of the evolutionarily younger members maintain the ability to express both RNA as well as proteins. We have developed an approach using long-read SMRT sequencing that produces long reads, that provides us with ability to obtain detailed and accurate HERV-K HML-2 expression profiles. We have now applied this approach to study HERV-K expression in the presence and absence of productive HIV-1 infection of primary human CD4(+) T cells. In addition to using SMRT sequencing, our strategy also includes the magnetic selection of the infected cells so that levels of background expression due to uninfected cells are kept at a minimum. The results in this manuscript provide the blueprint for in-depth studies of the interactions of the authentic upregulated HERV-K HML-2 elements and HIV-1. Copyright © 2017 American Society for Microbiology.
Global transcript structure resolution of high gene density genomes through multi-platform data integration.
Annotation of herpesvirus genomes has traditionally been undertaken through the detection of open reading frames and other genomic motifs, supplemented with sequencing of individual cDNAs. Second generation sequencing and high-density microarray studies have revealed vastly greater herpesvirus transcriptome complexity than is captured by existing annotation. The pervasive nature of overlapping transcription throughout herpesvirus genomes, however, poses substantial problems in resolving transcript structures using these methods alone. We present an approach that combines the unique attributes of Pacific Biosciences Iso-Seq long-read, Illumina short-read and deepCAGE (Cap Analysis of Gene Expression) sequencing to globally resolve polyadenylated isoform structures in replicating Epstein-Barr virus (EBV). Our method, Transcriptome Resolution through Integration of Multi-platform Data (TRIMD), identifies nearly 300 novel EBV transcripts, quadrupling the size of the annotated viral transcriptome. These findings illustrate an array of mechanisms through which EBV achieves functional diversity in its relatively small, compact genome including programmed alternative splicing (e.g. across the IR1 repeats), alternative promoter usage by LMP2 and other latency-associated transcripts, intergenic splicing at the BZLF2 locus, and antisense transcription and pervasive readthrough transcription throughout the genome.© The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Alternative transcription start sites (TSSs) have been extensively studied genome-wide for many cell types and have been shown to be important during development and to regulate transcript abundance between cell types. Likewise, single-cell gene expression has been extensively studied for many cell types. However, how single cells use TSSs has not yet been examined. In particular, it is unknown whether alternative TSSs are independently expressed, or whether they are co-activated or even mutually exclusive in single cells. Here, we use a previously published single-cell RNA-seq dataset, comprising thousands of cells, to study alternative TSS usage. We find that alternative TSS usage is a regulated process, and the correlation between two TSSs expressed in single cells of the same cell type is surprisingly high. Our findings indicate that TSSs generally are regulated by common factors rather than being independently regulated or stochastically expressed.© 2017 The Authors. Published under the terms of the CC BY 4.0 license.
Studies indicate that more than 90% of human genes are alternatively spliced, suggesting the complexity of the transcriptome assembly and analysis. The splicing process is often disrupted, resulting in both functional and non-functional end-products (Sveen et al. 2016) in many cancers. Harnessing the immune system to fight against malignant cancers carrying aberrantly mutated or spliced products is becoming a promising approach to cancer therapy. Advances in immune checkpoint blockade have elicited adaptive immune responses with promising clinical responses to treatments against human malignancies (Tumor Neoantigens in Personalized Cancer Immunotherapy 2017). Emerging data suggest that recognition of patient-specific mutation-associated cancer antigens (i.e. from alternative splicing isoforms) may allow scientists to dissect the immune response in the activity of clinical immunotherapies (Schumacher and Schreiber 2015). The advent of high-throughput sequencing technology has provided a comprehensive view of both splicing aberrations and somatic mutations across a range of human malignancies, allowing for a deeper understanding of the interplay of various disease mechanisms. Meanwhile, studies show that the number of transcript isoforms reported to date may be limited by the short-read sequencing due to the inherit limitation of transcriptome reconstruction algorithms, whereas long-read sequencing is able to significantly improve the detection of alternative splicing variants since there is no need to assemble full-length transcripts from short reads. The analysis of these high-throughput long-read sequencing data may permit a systematic view of tumor specific peptide epitopes (also known as neoantigens) that could serve as targets for immunotherapy (Tumor Neoantigens in Personalized Cancer Immunotherapy 2017). Currently, there is no software pipeline available that can efficiently produce mutation-associated cancer antigens from raw high-throughput sequencing data on patient tumor DNA (The Problem with Neoantigen Prediction 2017). In addressing this issue, we introduce a R package that allows the discoveries of peptide epitope candidates, which are the tumor-specific peptide fragments containing potential functional neoantigens. These peptide epitopes consist of structure variants including insertion, deletions, alternative sequences, and peptides from nonsynonymous mutations. Analysis of these precursor candidates with widely used tools such as netMHC allows for the accurate in-silico prediction of neoantigens. The pipeline named neoantigeR is currently hosted in https://github.com/ICBI/neoantigeR.
The genome of an underwater architect, the caddisfly Stenopsyche tienmushanensis Hwang (Insecta: Trichoptera).
Caddisflies (Insecta: Trichoptera) are a highly adapted freshwater group of insects split from a common ancestor with Lepidoptera. They are the most diverse (>16,000 species) of the strictly aquatic insect orders and are widely employed as bio-indicators in water quality assessment and monitoring. Among the numerous adaptations to aquatic habitats, caddisfly larvae use silk and materials from the environment (e.g., stones, sticks, leaf matter) to build composite structures such as fixed retreats and portable cases. Understanding how caddisflies have adapted to aquatic habitats will help explain the evolution and subsequent diversification of the group.We sequenced a retreat-builder caddisfly Stenopsyche tienmushanensis Hwang and assembled a high-quality genome from both Illumina and Pacific Biosciences (PacBio) sequencing. In total, 601.2 M Illumina reads (90.2 Gb) and 16.9 M PacBio subreads (89.0 Gb) were generated. The 451.5 Mb assembled genome has a contig N50 of 1.29 M, has a longest contig of 4.76 Mb, and covers 97.65% of the 1,658 insect single-copy genes as assessed by Benchmarking Universal Single-Copy Orthologs. The genome comprises 36.76% repetitive elements. A total of 14,672 predicted protein-coding genes were identified. The genome revealed gene expansions in specific groups of the cytochrome P450 family and olfactory binding proteins, suggesting potential genomic features associated with pollutant tolerance and mate finding. In addition, the complete gene complex of the highly repetitive H-fibroin, the major protein component of caddisfly larval silk, was assembled.We report the draft genome of Stenopsyche tienmushanensis, the highest-quality caddisfly genome so far. The genome information will be an important resource for the study of caddisflies and may shed light on the evolution of aquatic insects.
The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.
Transcriptomic studies have demonstrated that the vast majority of the genomes of mammals and other complex organisms is expressed in highly dynamic and cell-specific patterns to produce large numbers of intergenic, antisense and intronic long non-protein-coding RNAs (lncRNAs). Despite well characterized examples, their scaling with developmental complexity, and many demonstrations of their association with cellular processes, development and diseases, lncRNAs are still to be widely accepted as major players in gene regulation. This may reflect an underappreciation of the extent and precision of the epigenetic control of differentiation and development, where lncRNAs appear to have a central role, likely as organizational and guide molecules: most lncRNAs are nuclear-localized and chromatin-associated, with some involved in the formation of specialized subcellular domains. I suggest that a reassessment of the conceptual framework of genetic information and gene expression in the 4-dimensional ontogeny of spatially organized multicellular organisms is required. Together with this and further studies on their biology, the key challenges now are to determine the structure?function relationships of lncRNAs, which may be aided by emerging evidence of their modular structure, the role of RNA editing and modification in enabling epigenetic plasticity, and the role of RNA signaling in transgenerational inheritance of experience.
IsoSeq analysis and functional annotation of the infratentorial ependymoma tumor tissue on PacBio RSII platform.
Here, we sequenced and functionally annotated the long reads (1-2 kb) cDNAs library of an infratentorial ependymoma tumor tissue on PacBio RSII by Iso-Seq protocol using SMRT technology. 577 MB, data was generated from the brain tissues of ependymoma tumor patient, producing 1,19,313 high-quality reads assembled into 19,878 contigs using Celera assembler followed by Quiver pipelines, which produced 2952 unique protein accessions in the nr protein database and 307 KEGG pathways. Additionally, when we compared GO terms of second and third level with alternative splicing data obtained through HTA Array2.0. We identified four and twelve transcript cluster IDs in Level-2 and Level-3 scores respectively with alternative splicing index predicting mainly the major pathways of hallmarks of cancer. Out of these transcript cluster IDs only transcript cluster IDs of gene PNMT, SNN and LAMB1 showed Reads Per Kilobase of exon model per Million mapped reads (RPKM) values at gene-level expression (GE) and transcript-level (TE) track. Most importantly, brain-specific genes–PNMT, SNN and LAMB1 show their involvement in Ependymoma.
RNA sequencing (RNA-Seq) reveals extremely low levels of reticulocyte-derived globin gene transcripts in peripheral blood from horses (Equus caballus) and cattle (Bos taurus).
RNA-seq has emerged as an important technology for measuring gene expression in peripheral blood samples collected from humans and other vertebrate species. In particular, transcriptomics analyses of whole blood can be used to study immunobiology and develop novel biomarkers of infectious disease. However, an obstacle to these methods in many mammalian species is the presence of reticulocyte-derived globin mRNAs in large quantities, which can complicate RNA-seq library sequencing and impede detection of other mRNA transcripts. A range of supplementary procedures for targeted depletion of globin transcripts have, therefore, been developed to alleviate this problem. Here, we use comparative analyses of RNA-seq data sets generated from human, porcine, equine, and bovine peripheral blood to systematically assess the impact of globin mRNA on routine transcriptome profiling of whole blood in cattle and horses. The results of these analyses demonstrate that total RNA isolated from equine and bovine peripheral blood contains very low levels of globin mRNA transcripts, thereby negating the need for globin depletion and greatly simplifying blood-based transcriptomic studies in these two domestic species.
PacBio RS II is the first commercialized third-generation DNA sequencer able to sequence a single molecule DNA in real-time without amplification. PacBio RS II’s sequencing technology is novel and unique, enabling the direct observation of DNA synthesis by DNA polymerase. PacBio RS II confers four major advantages compared to other sequencing technologies: long read lengths, high consensus accuracy, a low degree of bias, and simultaneous capability of epigenetic characterization. These advantages surmount the obstacle of sequencing genomic regions such as high/low G+C, tandem repeat, and interspersed repeat regions. Moreover, PacBio RS II is ideal for whole genome sequencing, targeted sequencing, complex population analysis, RNA sequencing, and epigenetics characterization. With PacBio RS II, we have sequenced and analyzed the genomes of many species, from viruses to humans. Herein, we summarize and review some of our key genome sequencing projects, including full-length viral sequencing, complete bacterial genome and almost-complete plant genome assemblies, and long amplicon sequencing of a disease-associated gene region. We believe that PacBio RS II is not only an effective tool for use in the basic biological sciences but also in the medical/clinical setting.