Haplotype phasing of genetic variants is important for interpretation of the maize genome, population genetic analysis, and functional genomic analysis of allelic activity. Accordingly, accurate methods for phasing full-length isoforms are essential for functional genomics study. In this study, we performed an isoform-level phasing study in maize, using two inbred lines and their reciprocal crosses, based on single-molecule full-length cDNA sequencing. To phase and analyze full-length transcripts between hybrids and parents, we developed a tool called IsoPhase. Using this tool, we validated the majority of SNPs called against matching short read data and identified cases of allele-specific, gene-level, and isoform-level expression. Our results revealed that maize parental and hybrid lines exhibit different splicing activities. After phasing 6,847 genes in two reciprocal hybrids using embryo, endosperm and root tissues, we annotated the SNPs and identified large-effect genes. In addition, based on single-molecule sequencing, we identified parent-of-origin isoforms in maize hybrids, different novel isoforms between maize parent and hybrid lines, and imprinted genes from different tissues. Finally, we characterized variation in cis- and trans-regulatory effects. Our study provides measures of haplotypic expression that could increase power and accuracy in studies of allelic expression.
Motivation: Third-generation sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results: In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research.
Characterization of the whole transcriptome of whelk Rapana venosa by single-molecule mRNA sequencing
The veined rapa whelk (Rapana venosa) is an important commercial gastropod in China but is also an important invasive pest worldwide. Lack of genome information restricts investigation of this species. Here we report the first full-length transcriptome database derived from six different developmental stages of the whelk obtained using the PacBio Iso-Seq. As a result, 88,162 high-quality and full-length unigenes were obtained with average length of 2895?bp, among which 70,496 unigenes were successfully annotated, and 192,951 CDS regions, 1182 transcription factors, 284,387 microsatellite loci, and 27,222 lncRNAs were indentified. This study provides a valuable resource for characterizing the R. venosa transcriptome and for future analyses of gene expression profiles.
TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts.
Long-read, single-molecule sequencing platforms hold great potential for isoform discovery and characterization of multi-exon transcripts. However, their high error rates are an obstacle to distinguishing novel transcript isoforms from sequencing artifacts. Therefore, we developed the package TranscriptClean to correct mismatches, microindels and noncanonical splice junctions in mapped transcripts using the reference genome while preserving known variants.Our method corrects nearly all mismatches and indels present in a publically available human PacBio Iso-seq dataset, and rescues 39% of noncanonical splice junctions.All Python and R scripts used in this paper are available at https://github.com/dewyman/TranscriptClean.
Hybrid sequencing-based personal full-length transcriptomic analysis implicates proteostatic stress in metastatic ovarian cancer.
Comprehensive molecular characterization of myriad somatic alterations and aberrant gene expressions at personal level is key to precision cancer therapy, yet limited by current short-read sequencing technology, individualized catalog of complete genomic and transcriptomic features is thus far elusive. Here, we integrated second- and third-generation sequencing platforms to generate a multidimensional dataset on a patient affected by metastatic epithelial ovarian cancer. Whole-genome and hybrid transcriptome dissection captured global genetic and transcriptional variants at previously unparalleled resolution. Particularly, single-molecule mRNA sequencing identified a vast array of unannotated transcripts, novel long noncoding RNAs and gene chimeras, permitting accurate determination of transcription start, splice, polyadenylation and fusion sites. Phylogenetic and enrichment inference of isoform-level measurements implicated early functional divergence and cytosolic proteostatic stress in shaping ovarian tumorigenesis. A complementary imaging-based high-throughput drug screen was performed and subsequently validated, which consistently pinpointed proteasome inhibitors as an effective therapeutic regime by inducing protein aggregates in ovarian cancer cells. Therefore, our study suggests that clinical application of the emerging long-read full-length analysis for improving molecular diagnostics is feasible and informative. An in-depth understanding of the tumor transcriptome complexity allowed by leveraging the hybrid sequencing approach lays the basis to reveal novel and valid therapeutic vulnerabilities in advanced ovarian malignancies.
The Populus shoot undergoes primary growth (longitudinal growth) followed by secondary growth (radial growth), which produces biomass that is an important source of energy worldwide. We adopted joint PacBio Iso-Seq and RNA-seq analysis to identify differentially expressed transcripts along a developmental gradient from the shoot apex to the fifth internode of Populus Nanlin895. We obtained 87 150 full-length transcripts, including 2081 new isoforms and 62 058 new alternatively spliced isoforms, most of which were produced by intron retention, that were used to update the Populus annotation. Among these novel isoforms, there are 1187 long non-coding RNAs and 356 fusion genes. Using this annotation, we found 15 838 differentially expressed transcripts along the shoot developmental gradient, of which 1216 were transcription factors (TFs). Only a few of these genes were reported previously. The differential expression of these TFs suggests that they may play important roles in primary and secondary growth. AP2, ARF, YABBY and GRF TFs are highly expressed in the apex, whereas NAC, bZIP, PLATZ and HSF TFs are likely to be important for secondary growth. Overall, our findings provide evidence that long-read sequencing can complement short-read sequencing for cataloguing and quantifying eukaryotic transcripts and increase our understanding of the vital and dynamic process of shoot development. © 2018 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.
Scylla paramamosain is an important aquaculture crab, which has great economical and nutritional value. To the best of our knowledge, few full-length crab transcriptomes are available. In this study, a library composed of 12 different tissues including gill, hepatopancreas, muscle, cerebral ganglion, eyestalk, thoracic ganglia, intestine, heart, testis, ovary, sperm reservoir, and hemocyte was constructed and sequenced using Pacific Biosciences single-molecule real-time (SMRT) long-read sequencing technology. A total of 284803 full-length non-chimeric reads were obtained, from which 79005 high-quality unique transcripts were obtained after error correction and sequence clustering and redundant. Additionally, a total of 52544 transcripts were annotated against protein database (NCBI nonredundant, Swiss-Prot, KOG, and KEGG database). A total of 23644 long non-coding RNAs (lncRNAs) and 131561 simple sequence repeats (SSRs) were identified. Meanwhile, the isoforms of many genes were also identified in this study. Our study provides a rich set of full-length cDNA sequences for S. paramamosain, which will greatly facilitate S. paramamosain research.
Alternative splicing performs a central role in expanding genomic coding capacity and proteomic diversity. However, programming of splicing patterns in engineered biological systems remains underused. Synthetic approaches thus far have predominantly focused on controlling expression of a single protein through alternative splicing. Here, we describe a modular and extensible platform for regulating four programmable exons that undergo a mutually exclusive alternative splicing event to generate multiple functionally-distinct proteins. We present an intron framework that enforces the mutual exclusivity of two internal exons and demonstrate a graded series of consensus sequence elements of varying strengths that set the ratio of two mutually exclusive isoforms. We apply this framework to program the DNA-binding domains of modular transcription factors to differentially control downstream gene activation. This splicing platform advances an approach for generating diverse isoforms and can ultimately be applied to program modular proteins and increase coding capacity of synthetic biological systems.
Normalization of cDNA is widely used to improve the coverage of rare transcripts in analysis of transcriptomes employing next-generation sequencing. Recently, long-read technology has been emerging as a powerful tool for sequencing and construction of transcriptomes, especially for complex genomes containing highly similar transcripts and transcript-spliced isoforms. Here, we analyzed the transcriptome of sugarcane, with a highly polyploidy plant genome, by PacBio isoform sequencing (Iso-Seq) of two different cDNA library preparations, with and without a normalization step. The results demonstrated that, while the two libraries included many of the same transcripts, many longer transcripts were removed and many new generally shorter transcripts were detected by normalization. For the same input cDNA and the same data yield, the normalized library recovered more total transcript isoforms, number of predicted gene families and orthologous groups, resulting in a higher representation for the sugarcane transcriptome, compared to the non-normalized library. The non-normalized library, on the other hand, included a wider transcript length range with more longer transcripts above ~1.25 kb, more transcript isoforms per gene family and gene ontology terms per transcript. A large proportion of the unique transcripts comprising ~52% of the normalized library were expressed at a lower level than the unique transcripts from the non-normalized library, across three tissue types tested including leaf, stalk and root. About 83% of the total 5,348 predicted long noncoding transcripts was derived from the normalized library, of which ~80% was derived from the lowly expressed fraction. Functional annotation of the unique transcripts suggested that each library enriched different functional transcript fractions. This demonstrated the complementation of the two approaches in obtaining a complete transcriptome of a complex genome at the sequencing depth used in this study.
The transcriptome of Darwin’s bark spider silk glands predicts proteins contributing to dragline silk toughness.
Darwin’s bark spider (Caerostris darwini) produces giant orb webs from dragline silk that can be twice as tough as other silks, making it the toughest biological material. This extreme toughness comes from increased extensibility relative to other draglines. We show C. darwini dragline-producing major ampullate (MA) glands highly express a novel silk gene transcript (MaSp4) encoding a protein that diverges markedly from closely related proteins and contains abundant proline, known to confer silk extensibility, in a unique GPGPQ amino acid motif. This suggests C. darwini evolved distinct proteins that may have increased its dragline’s toughness, enabling giant webs. Caerostris darwini’s MA spinning ducts also appear unusually long, potentially facilitating alignment of silk proteins into extremely tough fibers. Thus, a suite of novel traits from the level of genes to spinning physiology to silk biomechanics are associated with the unique ecology of Darwin’s bark spider, presenting innovative designs for engineering biomaterials.