We have produced an updated annotation of the Norway spruce genome on the basis of an in siliconormalised set of RNA-Seq data obtained from 1,529 samples and comprising 15.5 billion paired-end Illumina HiSeq reads complemented by 18Mbp of PacBio cDNA data (3.2M sequences). In addition to augmenting and refining the previous protein coding gene annotation, here we focus on the addition of long intergenic non-coding RNA (lincRNA) and micro RNA (miRNA) genes. In addition to non-coding loci, our analyses also identified protein coding genes that had been missed by the initial genome annotation and enabled us to update the annotation of existing gene models. In particular, splice variant information, as supported by PacBio sequencing reads, has been added to the current annotation and previously fragmented gene models have been merged by scaffolding disjoint genomic scaffolds on the basis of transcript evidence. Using this refined annotation, a targeted analysis of the lincRNAs enabled their classification as i) deeply conserved, ii) conserved in seed plants iii) gymnosperm/conifer specific. Concurrently, complementary analyses were performed as part of the aspen genome project and the results of a comparative analysis of the lincRNAs conserved in both Norway spruce and Eurasian aspen enabled us to identify conserved and diverged expression profiles. At present, we are delving further into the expression results with the aim to functionally annotate the lincRNA genes, by developing a co-expression network analyses based GO annotation.
In this webinar, Emily Hatas of PacBio shares information about the applications and benefits of SMRT Sequencing in plant and animal biology, agriculture, and industrial research fields. This session contains…
Long-read sequencing technologies like Iso-Seq analysis present researchers with a powerful tool for probing the transcriptomes of many species. The ability to sequence transcripts from end-to-end has revealed transcription complexity…
In this presentation, Elizabeth Tseng explains how PacBio’s full-length RNA Sequencing using the Iso-Seq method can characterize full-length transcripts without the need for computational transcript assembly. The Iso-Seq method is…
In this ASHG workshop presentation, Elizabeth Tseng of PacBio showed how the Iso-Seq method can be used to discover disease-associated alternative splicing. Because this approach to isoform sequencing yields accurate,…
Long-read RNA sequencing (RNA-seq) is promising to transcriptomics studies, however, the alignment of the reads is still a fundamental but non-trivial task due to the sequencing errors and complicated gene structures. We propose deSALT, a tailored two-pass long RNA-seq read alignment approach, which constructs graph-based alignment skeletons to sensitively infer exons, and use them to generate spliced reference sequence to produce refined alignments. deSALT addresses several difficult issues, such as small exons, serious sequencing errors and consensus spliced alignment. Benchmarks demonstrate that this approach has a better ability to produce high-quality full-length alignments, which has enormous potentials to transcriptomics studies.
Large Scale Profiling of Protein Isoforms Using Label-Free Quantitative Proteomics Revealed the Regulation of Nonsense-Mediated Decay in Moso Bamboo (Phyllostachys edulis).
Moso bamboo is an important forest species with a variety of ecological, economic, and cultural values. However, the gene annotation information of moso bamboo is only based on the transcriptome sequencing, lacking the evidence of proteome. The lignification and fiber in moso bamboo leads to a difficulty in the extraction of protein using conventional methods, which seriously hinders research on the proteomics of moso bamboo. The purpose of this study is to establish efficient methods for extracting the total proteins from moso bamboo for following mass spectrometry-based quantitative proteome identification. Here, we have successfully established a set of efficient methods for extracting total proteins of moso bamboo followed by mass spectrometry-based label-free quantitative proteome identification, which further improved the protein annotation of moso bamboo genes. In this study, 10,376 predicted coding genes were confirmed by quantitative proteomics, accounting for 35.8% of all annotated protein-coding genes. Proteome analysis also revealed the protein-coding potential of 1015 predicted long noncoding RNA (lncRNA), accounting for 51.03% of annotated lncRNAs. Thus, mass spectrometry-based proteomics provides a reliable method for gene annotation. Especially, quantitative proteomics revealed the translation patterns of proteins in moso bamboo. In addition, the 3284 transcript isoforms from 2663 genes identified by Pacific BioSciences (PacBio) single-molecule real-time long-read isoform sequencing (Iso-Seq) was confirmed on the protein level by mass spectrometry. Furthermore, domain analysis of mass spectrometry-identified proteins encoded in the same genomic locus revealed variations in domain composition pointing towards a functional diversification of protein isoform. Finally, we found that part transcripts targeted by nonsense-mediated mRNA decay (NMD) could also be translated into proteins. In summary, proteomic analysis in this study improves the proteomics-assisted genome annotation of moso bamboo and is valuable to the large-scale research of functional genomics in moso bamboo. In summary, this study provided a theoretical basis and technical support for directional gene function analysis at the proteomics level in moso bamboo.
Gammaherpesvirus Readthrough Transcription Generates a Long Non-Coding RNA That Is Regulated by Antisense miRNAs and Correlates with Enhanced Lytic Replication In Vivo.
Gammaherpesviruses, including the human pathogens Epstein?Barr virus (EBV) and Kaposi’s sarcoma-associated herpesvirus (KSHV) are oncogenic viruses that establish lifelong infections in hosts and are associated with the development of lymphoproliferative diseases and lymphomas. Recent studies have shown that the majority of the mammalian genome is transcribed and gives rise to numerous long non-coding RNAs (lncRNAs). Likewise, the large double-stranded DNA virus genomes of herpesviruses undergo pervasive transcription, including the expression of many as yet uncharacterized lncRNAs. Murine gammaperherpesvirus 68 (MHV68, MuHV-4, ?HV68) is a natural pathogen of rodents, and is genetically and pathogenically related to EBV and KSHV, providing a highly tractable model for studies of gammaherpesvirus biology and pathogenesis. Through the integrated use of parallel data sets from multiple sequencing platforms, we previously resolved transcripts throughout the MHV68 genome, including at least 144 novel transcript isoforms. Here, we sought to molecularly validate novel transcripts identified within the M3/M2 locus, which harbors genes that code for the chemokine binding protein M3, the latency B cell signaling protein M2, and 10 microRNAs (miRNAs). Using strand-specific northern blots, we validated the presence of M3-04, a 3.91 kb polyadenylated transcript that initiates at the M3 transcription start site and reads through the M3 open reading frame (ORF), the M3 poly(a) signal sequence, and the M2 ORF. This unexpected transcript was solely localized to the nucleus, strongly suggesting that it is not translated and instead may function as a lncRNA. Use of an MHV68 mutant lacking two M3-04-antisense pre-miRNA stem loops resulted in highly increased expression of M3-04 and increased virus replication in the lungs of infected mice, demonstrating a key role for these RNAs in regulation of lytic infection. Together these findings suggest the possibility of a tripartite regulatory relationship between the lncRNA M3-04, antisense miRNAs, and the latency gene M2.
Parallels between natural selection in the cold-adapted crop-wild relative Tripsacum dactyloides and artificial selection in temperate adapted maize.
Artificial selection has produced varieties of domesticated maize that thrive in temperate climates around the world. However, the direct progenitor of maize, teosinte, is indigenous only to a relatively small range of tropical and subtropical latitudes and grows poorly or not at all outside of this region. Tripsacum, a sister genus to maize and teosinte, is naturally endemic to the majority of areas in the western hemisphere where maize is cultivated. A full-length reference transcriptome for Tripsacum dactyloides generated using long-read Iso-Seq data was used to characterize independent adaptation to temperate climates in this clade. Genes related to phospholipid biosynthesis, a critical component of cold acclimation in other cold-adapted plant lineages, were enriched among those genes experiencing more rapid rates of protein sequence evolution in T. dactyloides. In contrast with previous studies of parallel selection, we find that there is a significant overlap between the genes that were targets of artificial selection during the adaptation of maize to temperate climates and those that were targets of natural selection in temperate-adapted T. dactyloides. Genes related to growth, development, response to stimulus, signaling, and organelles were enriched in the set of genes identified as both targets of natural and artificial selection. © 2019 The Authors The Plant Journal © 2019 John Wiley & Sons Ltd.
The Genome of Cucurbita argyrosperma (Silver-Seed Gourd) Reveals Faster Rates of Protein-Coding Gene and Long Noncoding RNA Turnover and Neofunctionalization within Cucurbita.
Whole-genome duplications are an important source of evolutionary novelties that change the mode and tempo at which genetic elements evolve within a genome. The Cucurbita genus experienced a whole-genome duplication around 30 million years ago, although the evolutionary dynamics of the coding and noncoding genes in this genus have not yet been scrutinized. Here, we analyzed the genomes of four Cucurbita species, including a newly assembled genome of Cucurbita argyrosperma, and compared the gene contents of these species with those of five other members of the Cucurbitaceae family to assess the evolutionary dynamics of protein-coding and long intergenic noncoding RNA (lincRNA) genes after the genome duplication. We report that Cucurbita genomes have a higher protein-coding gene birth-death rate compared with the genomes of the other members of the Cucurbitaceae family. C. argyrosperma gene families associated with pollination and transmembrane transport had significantly faster evolutionary rates. lincRNA families showed high levels of gene turnover throughout the phylogeny, and 67.7% of the lincRNA families in Cucurbita showed evidence of birth from the neofunctionalization of previously existing protein-coding genes. Collectively, our results suggest that the whole-genome duplication in Cucurbita resulted in faster rates of gene family evolution through the neofunctionalization of duplicated genes. Copyright © 2019 The Author. Published by Elsevier Inc. All rights reserved.
Potential use of the Pteris vittata arsenic hyperaccumulation-regulation network for phytoremediation.
Arsenic accumulation in soil is a global problem typically addressed using phytoremediation methods. Pteris vittata, a model arsenic hyperaccumulator, has great potential as a genetically engineered plant for phytoremediation. However, the lack of omic information on this species has severely limited the identification and application of its arsenic hyperaccumulation and regulation components. In this study, we used an optimized single-molecular real-time (SMRT) strategy to create a de novo full-length transcriptomic-tonoplast proteomic database for this unsequenced fern and to determine the genetic components underlying its arsenic hyperaccumulation-regulation mechanisms. We established a comprehensive network consisting of six major transporter families, two novel resistance pathways, and a regulatory system by examining alternative splicing (AS) and long non-coding RNA (lncRNA) in different tissues following As(III) and As(V) treatment. The database and network established in this study will deepen our understanding of the unique hyperaccumulation and regulation mechanisms of P. vittata, ultimately providing a valuable resource for futher research on phytoremediation of arsenic-contaminated soil. Copyright © 2019 Elsevier B.V. All rights reserved.
Full-length transcriptome analysis of Litopenaeus vannamei reveals transcript variants involved in the innate immune system.
To better understand the immune system of shrimp, this study combined PacBio isoform sequencing (Iso-Seq) and Illumina paired-end short reads sequencing methods to discover full-length immune-related molecules of the Pacific white shrimp, Litopenaeus vannamei. A total of 72,648 nonredundant full-length transcripts (unigenes) were generated with an average length of 2545 bp from five main tissues, including the hepatopancreas, cardiac stomach, heart, muscle, and pyloric stomach. These unigenes exhibited a high annotation rate (62,164, 85.57%) when compared against NR, NT, Swiss-Prot, Pfam, GO, KEGG and COG databases. A total of 7544 putative long noncoding RNAs (lncRNAs) were detected and 1164 nonredundant full-length transcripts (449 UniTransModels) participated in the alternative splicing (AS) events. Importantly, a total of 5279 nonredundant full-length unigenes were successfully identified, which were involved in the innate immune system, including 9 immune-related processes, 19 immune-related pathways and 10 other immune-related systems. We also found wide transcript variants, which increased the number and function complexity of immune molecules; for example, toll-like receptors (TLRs) and interferon regulatory factors (IRFs). The 480 differentially expressed genes (DEGs) were significantly higher or tissue-specific expression patterns in the hepatopancreas compared with that in other four tested tissues (FDR <0.05). Furthermore, the expression levels of six selected immune-related DEGs and putative IRFs were validated using real-time PCR technology, substantiating the reliability of the PacBio Iso-seq results. In conclusion, our results provide new genetic resources of long-read full-length transcripts data and information for identifying immune-related genes, which are an invaluable transcriptomic resource as genomic reference, especially for further exploration of the innate immune and defense mechanisms of shrimp. Copyright © 2019 Elsevier Ltd. All rights reserved.
Hybrid sequencing-based personal full-length transcriptomic analysis implicates proteostatic stress in metastatic ovarian cancer.
Comprehensive molecular characterization of myriad somatic alterations and aberrant gene expressions at personal level is key to precision cancer therapy, yet limited by current short-read sequencing technology, individualized catalog of complete genomic and transcriptomic features is thus far elusive. Here, we integrated second- and third-generation sequencing platforms to generate a multidimensional dataset on a patient affected by metastatic epithelial ovarian cancer. Whole-genome and hybrid transcriptome dissection captured global genetic and transcriptional variants at previously unparalleled resolution. Particularly, single-molecule mRNA sequencing identified a vast array of unannotated transcripts, novel long noncoding RNAs and gene chimeras, permitting accurate determination of transcription start, splice, polyadenylation and fusion sites. Phylogenetic and enrichment inference of isoform-level measurements implicated early functional divergence and cytosolic proteostatic stress in shaping ovarian tumorigenesis. A complementary imaging-based high-throughput drug screen was performed and subsequently validated, which consistently pinpointed proteasome inhibitors as an effective therapeutic regime by inducing protein aggregates in ovarian cancer cells. Therefore, our study suggests that clinical application of the emerging long-read full-length analysis for improving molecular diagnostics is feasible and informative. An in-depth understanding of the tumor transcriptome complexity allowed by leveraging the hybrid sequencing approach lays the basis to reveal novel and valid therapeutic vulnerabilities in advanced ovarian malignancies.
PacBio full-length cDNA sequencing integrated with RNA-seq reads drastically improves the discovery of splicing transcripts in rice.
In eukaryotes, alternative splicing (AS) greatly expands the diversity of transcripts. However, it is challenging to accurately determine full-length splicing isoforms. Recently, more studies have taken advantage of Pacific Bioscience (PacBio) long-read sequencing to identify full-length transcripts. Nevertheless, the high error rate of PacBio reads seriously offsets the advantages of long reads, especially for accurately identifying splicing junctions. To best capitalize on the features of long reads, we used Illumina RNA-seq reads to improve PacBio circular consensus sequence (CCS) quality and to validate splicing patterns in the rice transcriptome. We evaluated the impact of CCS accuracy on the number and the validation rate of splicing isoforms, and integrated a comprehensive pipeline of splicing transcripts analysis by Iso-Seq and RNA-seq (STAIR) to identify the full-length multi-exon isoforms in rice seedling transcriptome (Oryza sativa L. ssp. japonica). STAIR discovered 11 733 full-length multi-exon isoforms, 6599 more than the SMRT Portal RS_IsoSeq pipeline did. Of these splicing isoforms identified, 4453 (37.9%) were missed in assembled transcripts from RNA-seq reads, and 5204 (44.4%), including 268 multi-exon long non-coding RNAs (lncRNAs), were not reported in the MSU_osa1r7 annotation. Some randomly selected unreported splicing junctions were verified by polymerase chain reaction (PCR) amplification. In addition, we investigated alternative polyadenylation (APA) events in transcripts and identified 829 major polyadenylation [poly(A)] site clusters (PACs). The analysis of splicing isoforms and APA events will facilitate the annotation of the rice genome and studies on the expression and polyadenylation of AS genes in different developmental stages or growth conditions of rice. © 2018 The Authors The Plant Journal © 2018 John Wiley & Sons Ltd.
Analysis of transcripts and splice isoforms in Medicago sativa L. by single-molecule long-read sequencing.
The full-length transcriptome of alfalfa was analyzed with PacBio single-molecule long-read sequencing technology. The transcriptome data provided full-length sequences and gene isoforms of transcripts in alfalfa, which will improve genome annotation and enhance our understanding of the gene structure of alfalfa. As an important forage, alfalfa (Medicago sativa L.) is world-wide planted. For its complexity of genome and unfinished whole genome sequencing, the sequences and complete structure of mRNA transcripts remain unclear in alfalfa. In this study, single-molecule long-read sequencing was applied to investigate the alfalfa transcriptome using the Pacific Biosciences platform, and a total of 113,321 transcripts were obtained from young, mature and senescent leaves. We identified 72,606 open reading frames including 46,616 full-length ORFs, 1670 transcription factors from 54 TF families and 44,040 simple sequence repeats from 30,797 sequences. A total of 7568 alternative splicing events was identified and the majority of alternative splicing events in alfalfa was intron retention. In addition, we identified 17,740 long non-coding RNAs. Our results show the feasibility of deep sequencing full-length RNA from alfalfa transcriptome on a single-molecule level.