The free-living flatworm, Macrostomum lignano, much like its better known planarian relative, Schmidtea mediterranea, has an impressive regenerative capacity. Following injury, this species has the ability to regenerate almost an entirely new organism. This is attributable to the presence of an abundant somatic stem cell population, the neoblasts. These cells are also essential for the ongoing maintenance of most tissues, as their loss leads to irreversible degeneration of the animal. This set of unique properties makes a subset of flatworms attractive organisms for studying the evolution of pathways involved in tissue self-renewal, cell fate specification, and regeneration. The use of these organisms as models, however, is hampered by the lack of a well-assembled and annotated genome sequences, fundamental to modern genetic and molecular studies. Here we report the genomic sequence of Macrostomum lignano and an accompanying characterization of its transcriptome. The genome structure of M. lignano is remarkably complex, with ~75% of its sequence being comprised of simple repeats and transposon sequences. This has made high quality assembly from Illumina reads alone impossible (N50=222 bp). We therefore generated 130X coverage by long sequencing reads from the PacBio platform to create a substantially improved assembly with an N50 of 64 Kbp. We complemented the reference genome with an assembled and annotated transcriptome, and used both of these datasets in combination to probe gene expression patterns during regeneration, examining pathways important to stem cell function. As a whole, our data will provide a crucial resource for the community for the study not only of invertebrate evolution and phylogeny but also of regeneration and somatic pluripotency.
Over the past decade, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs. However, as next-generation sequencing technologies have developed, so too has RNA-seq. Now, RNA-seq methods are available for studying many different aspects of RNA biology, including single-cell gene expression, translation (the translatome) and RNA structure (the structurome). Exciting new applications are being explored, such as spatial transcriptomics (spatialomics). Together with new long-read and direct RNA-seq technologies and better computational tools for data analysis, innovations in RNA-seq are contributing to a fuller understanding of RNA biology, from questions such as when and where transcription occurs to the folding and intermolecular interactions that govern RNA function.
Genome-wide association studies (GWAS) have identified many genomic loci associated with risk for schizophrenia, but unambiguous identification of the relationship between disease-associated variants and specific genes, and in particular their effect on risk conferring transcripts, has proven difficult. To better understand the specific molecular mechanism(s) at the schizophrenia locus in 11q25, we undertook cis expression quantitative trait loci (cis-eQTL) mapping for this 2 megabase genomic region using postmortem human brain samples. To comprehensively assess the effects of genetic risk upon local expression, we evaluated multiple transcript features: genes, exons, and exon-exon junctions in multiple brain regions-dorsolateral prefrontal cortex (DLPFC), hippocampus, and caudate. Genetic risk variants strongly associated with expression of SNX19 transcript features that tag multiple rare classes of SNX19 transcripts, whereas they only weakly affected expression of an exon-exon junction that tags the majority of abundant transcripts. The most prominent class of SNX19 risk-associated transcripts is predicted to be overexpressed, defined by an exon-exon splice junction between exons 8 and 10 (junc8.10) and that is predicted to encode proteins that lack the characteristic nexin C terminal domain. Risk alleles were also associated with either increased or decreased expression of multiple additional classes of transcripts. With RACE, molecular cloning, and long read sequencing, we found a number of novel SNX19 transcripts that further define the set of potential etiological transcripts. We explored epigenetic regulation of SNX19 expression and found that DNA methylation at CpG sites near the primary transcription start site and within exon 2 partially mediate the effects of risk variants on risk-associated expression. ATAC sequencing revealed that some of the most strongly risk-associated SNPs are located within a region of open chromatin, suggesting a nearby regulatory element is involved. These findings indicate a potentially complex molecular etiology, in which risk alleles for schizophrenia generate epigenetic alterations and dysregulation of multiple classes of SNX19 transcripts.
Targeted Long-Read RNA Sequencing Demonstrates Transcriptional Diversity Driven by Splice-Site Variation in MYBPC3.
To date, clinical sequencing has focused on genomic DNA using targeted panels and exome sequencing. Sequencing of a large hypertrophic cardiomyopathy (HCM) cohort revealed that positive identification of a disease-associated variant was returned in only 32% of patients, with an additional 15% receiving inconclusive results. When genome sequencing fails to reveal causative variants, the transcriptome may provide additional diagnostic clarity. A recent study examining patients with genetically undiagnosed muscle disorders found that RNA sequencing, when used as a complement to exome and whole genome sequencing, had an overall diagnosis rate of 35%.
Genome assembly and gene expression in the American black bear provides new insights into the renal response to hibernation.
The prevalence of chronic kidney disease (CKD) is rising worldwide and 10-15% of the global population currently suffers from CKD and its complications. Given the increasing prevalence of CKD there is an urgent need to find novel treatment options. The American black bear (Ursus americanus) copes with months of lowered kidney function and metabolism during hibernation without the devastating effects on metabolism and other consequences observed in humans. In a biomimetic approach to better understand kidney adaptations and physiology in hibernating black bears, we established a high-quality genome assembly. Subsequent RNA-Seq analysis of kidneys comparing gene expression profiles in black bears entering (late fall) and emerging (early spring) from hibernation identified 169 protein-coding genes that were differentially expressed. Of these, 101 genes were downregulated and 68 genes were upregulated after hibernation. Fold changes ranged from 1.8-fold downregulation (RTN4RL2) to 2.4-fold upregulation (CISH). Most notable was the upregulation of cytokine suppression genes (SOCS2, CISH, and SERPINC1) and the lack of increased expression of cytokines and genes involved in inflammation. The identification of these differences in gene expression in the black bear kidney may provide new insights in the prevention and treatment of CKD. © The Author(s) 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts.
Long-read, single-molecule sequencing platforms hold great potential for isoform discovery and characterization of multi-exon transcripts. However, their high error rates are an obstacle to distinguishing novel transcript isoforms from sequencing artifacts. Therefore, we developed the package TranscriptClean to correct mismatches, microindels and noncanonical splice junctions in mapped transcripts using the reference genome while preserving known variants.Our method corrects nearly all mismatches and indels present in a publically available human PacBio Iso-seq dataset, and rescues 39% of noncanonical splice junctions.All Python and R scripts used in this paper are available at https://github.com/dewyman/TranscriptClean.
Reconstruction of the full-length transcriptome atlas using PacBio Iso-Seq provides insight into the alternative splicing in Gossypium australe.
Gossypium australe F. Mueller (2n?=?2x?=?26, G2 genome) possesses valuable characteristics. For example, the delayed gland morphogenesis trait causes cottonseed protein and oil to be edible while retaining resistance to biotic stress. However, the lack of gene sequences and their alternative splicing (AS) in G. australe remain unclear, hindering to explore species-specific biological morphogenesis.Here, we report the first sequencing of the full-length transcriptome of the Australian wild cotton species, G. australe, using Pacific Biosciences single-molecule long-read isoform sequencing (Iso-Seq) from the pooled cDNA of ten tissues to identify transcript loci and splice isoforms. We reconstructed the G. australe full-length transcriptome and identified 25,246 genes, 86 pre-miRNAs and 1468 lncRNAs. Most genes (12,832, 50.83%) exhibited two or more isoforms, suggesting a high degree of transcriptome complexity in G. australe. A total of 31,448 AS events in five major types were found among the 9944 gene loci. Among these five major types, intron retention was the most frequent, accounting for 68.85% of AS events. 29,718 polyadenylation sites were detected from 14,536 genes, 7900 of which have alternative polyadenylation sites (APA). In addition, based on our AS events annotations, RNA-Seq short reads from germinating seeds showed that differential expression of these events occurred during seed germination. Ten AS events that were randomly selected were further confirmed by RT-PCR amplification in leaf and germinating seeds.The reconstructed gene sequences and their AS in G. australe would provide information for exploring beneficial characteristics in G. australe.
Our understanding of the pig transcriptome is limited. RNA transcript diversity among nine tissues was assessed using poly(A) selected single-molecule long-read isoform sequencing (Iso-seq) and Illumina RNA sequencing (RNA-seq) from a single White cross-bred pig. Across tissues, a total of 67,746 unique transcripts were observed, including 60.5% predicted protein-coding, 36.2% long non-coding RNA and 3.3% nonsense-mediated decay transcripts. On average, 90% of the splice junctions were supported by RNA-seq within tissue. A large proportion (80%) represented novel transcripts, mostly produced by known protein-coding genes (70%), while 17% corresponded to novel genes. On average, four transcripts per known gene (tpg) were identified; an increase over current EBI (1.9 tpg) and NCBI (2.9 tpg) annotations and closer to the number reported in human genome (4.2 tpg). Our new pig genome annotation extended more than 6000 known gene borders (5′ end extension, 3′ end extension, or both) compared to EBI or NCBI annotations. We validated a large proportion of these extensions by independent pig poly(A) selected 3′-RNA-seq data, or human FANTOM5 Cap Analysis of Gene Expression data. Further, we detected 10,465 novel genes (81% non-coding) not reported in current pig genome annotations. More than 80% of these novel genes had transcripts detected in >?1 tissue. In addition, more than 80% of novel intergenic genes with at least one transcript detected in liver tissue had H3K4me3 or H3K36me3 peaks mapping to their promoter and gene body, respectively, in independent liver chromatin immunoprecipitation data. These validated results show significant improvement over current pig genome annotations.