In this ASHG 2017 presentation, Jonas Korlach, the CSO of PacBio shared updates on three applications featuring SMRT Sequencing on the Sequel System, highlighting structural variant detection, targeted sequencing and…
Over the past decade, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs. However, as next-generation sequencing technologies have developed, so too has RNA-seq. Now, RNA-seq methods are available for studying many different aspects of RNA biology, including single-cell gene expression, translation (the translatome) and RNA structure (the structurome). Exciting new applications are being explored, such as spatial transcriptomics (spatialomics). Together with new long-read and direct RNA-seq technologies and better computational tools for data analysis, innovations in RNA-seq are contributing to a fuller understanding of RNA biology, from questions such as when and where transcription occurs to the folding and intermolecular interactions that govern RNA function.
Satellite repeats are a structural component of centromeres and telomeres, and in some instances their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50?bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: (1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and (2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males vs. females; using Y chromosome assemblies or FIuorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59?kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions. © The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Complete Genome Sequence of Microbacterium sp. Strain SGAir0570, Isolated from Tropical Air Collected in Singapore.
Microbacterium sp. strain SGAir0570 was isolated from air samples collected in Singapore. Its genome was assembled using single-molecule real-time sequencing and MiSeq short reads. It has one chromosome with a length of 3.38?Mb and one 59.2-kb plasmid. It contains 3,170 protein-coding genes, 48 tRNAs, and 6 rRNAs.Copyright © 2019 Kalsi et al.
Leuconostoc kimchii strain NKJ218 was isolated from homemade kimchi in South Korea. The whole genome was sequenced using the PacBio RS II and Illumina NovoSeq 6000 platforms. Here, we report a genome sequence of strain NKJ218, which consists of a 1.9-Mbp chromosome and three plasmid contigs. A total of 2,005 coding sequences (CDS) were predicted, including 1,881 protein-coding sequences.Copyright © 2019 Jung et al.
The draft genome sequence of Malassezia restricta KCTC 27527, a clinical isolate from a patient with dandruff, was previously reported. Using the PacBio Sequel platform, we completed and reannotated the genome of M. restricta KCTC 27527 for a better understanding of the genome of this fungus.Copyright © 2019 Cho et al.
Supernumerary B chromosomes (Bs) are extra karyotype units in addition to A chromosomes, and are found in some fungi and thousands of animals and plant species. Bs are uniquely characterized due to their non-Mendelian inheritance, and represent one of the best examples of genomic conflict. Over the last decades, their genetic composition, function and evolution have remained an unresolved query, although a few successful attempts have been made to address these phenomena. A classical concept based on cytogenetics and genetics is that Bs are selfish and abundant with DNA repeats and transposons, and in most cases, they do not carry any function. However, recently, the modern quantum development of high scale multi-omics techniques has shifted B research towards a new-born field that we call “B-omics”. We review the recent literature and add novel perspectives to the B research, discussing the role of new technologies to understand the mechanistic perspectives of the molecular evolution and function of Bs. The modern view states that B chromosomes are enriched with genes for many significant biological functions, including but not limited to the interesting set of genes related to cell cycle and chromosome structure. Furthermore, the presence of B chromosomes could favor genomic rearrangements and influence the nuclear environment affecting the function of other chromatin regions. We hypothesize that B chromosomes might play a key function in driving their transmission and maintenance inside the cell, as well as offer an extra genomic compartment for evolution.
Gammaherpesvirus Readthrough Transcription Generates a Long Non-Coding RNA That Is Regulated by Antisense miRNAs and Correlates with Enhanced Lytic Replication In Vivo.
Gammaherpesviruses, including the human pathogens Epstein?Barr virus (EBV) and Kaposi’s sarcoma-associated herpesvirus (KSHV) are oncogenic viruses that establish lifelong infections in hosts and are associated with the development of lymphoproliferative diseases and lymphomas. Recent studies have shown that the majority of the mammalian genome is transcribed and gives rise to numerous long non-coding RNAs (lncRNAs). Likewise, the large double-stranded DNA virus genomes of herpesviruses undergo pervasive transcription, including the expression of many as yet uncharacterized lncRNAs. Murine gammaperherpesvirus 68 (MHV68, MuHV-4, ?HV68) is a natural pathogen of rodents, and is genetically and pathogenically related to EBV and KSHV, providing a highly tractable model for studies of gammaherpesvirus biology and pathogenesis. Through the integrated use of parallel data sets from multiple sequencing platforms, we previously resolved transcripts throughout the MHV68 genome, including at least 144 novel transcript isoforms. Here, we sought to molecularly validate novel transcripts identified within the M3/M2 locus, which harbors genes that code for the chemokine binding protein M3, the latency B cell signaling protein M2, and 10 microRNAs (miRNAs). Using strand-specific northern blots, we validated the presence of M3-04, a 3.91 kb polyadenylated transcript that initiates at the M3 transcription start site and reads through the M3 open reading frame (ORF), the M3 poly(a) signal sequence, and the M2 ORF. This unexpected transcript was solely localized to the nucleus, strongly suggesting that it is not translated and instead may function as a lncRNA. Use of an MHV68 mutant lacking two M3-04-antisense pre-miRNA stem loops resulted in highly increased expression of M3-04 and increased virus replication in the lungs of infected mice, demonstrating a key role for these RNAs in regulation of lytic infection. Together these findings suggest the possibility of a tripartite regulatory relationship between the lncRNA M3-04, antisense miRNAs, and the latency gene M2.
Comprehensive transcriptome analysis reveals genes potentially involved in isoflavone biosynthesis in Pueraria thomsonii Benth.
Pueraria thomsonii Benth is an important medicinal plant. Transcriptome sequencing, unigene assembly, the annotation of transcripts and the study of gene expression profiles play vital roles in gene function research. However, the full-length transcriptome of P. thomsonii remains unknown. Here, we obtained 44,339 nonredundant transcripts of P. thomsonii by using the PacBio RS II Isoform and Illumina sequencing platforms, of which 43,195 were annotated genes. Compared with the expression levels in the plant roots, those of transcripts with a |fold change| = 4 and FDR < 0.01 in the leaves or stems were assigned as differentially expressed transcripts (DETs). In total, we found 9,225 DETs, 32 of which came from structural genes that were potentially involved in isoflavone biosynthesis. The expression profiles of 8 structural genes from the RNA-Seq data were validated by qRT-PCR. We identified 437 transcription factors (TFs) that were positively or negatively correlated with at least 1 of the structural genes involved in isoflavone biosynthesis using Pearson correlation coefficients (r) (r > 0.8 or r < -0.8). We also identified a total of 32 microRNAs (miRNAs), which targeted 805 transcripts. These miRNAs caused enriched function in 'ATP binding', 'defense response', 'ADP binding', and 'signal transduction'. Interestingly, MIR156a potentially promoted isoflavone biosynthesis by repressing SBP, and MIR319 promoted isoflavone biosynthesis by repressing TCP and HB-HD-ZIP. Finally, we identified 2,690 alternative splicing events, including that of the structural genes of trans-cinnamate 4-monooxygenase and pullulanase, which are potentially involved in the biosynthesis of isoflavone and starch, respectively, and of three TFs potentially involved in isoflavone biosynthesis. Together, these results provide us with comprehensive insight into the gene expression and regulation of P. thomsonii.
Genomic and transcriptomic insights into the survival of the subaerial cyanobacterium Nostoc flagelliforme in arid and exposed habitats.
The cyanobacterium Nostoc flagelliforme is an extremophile that thrives under extraordinary desiccation and ultraviolet (UV) radiation conditions. To investigate its survival strategies, we performed whole-genome sequencing of N. flagelliforme CCNUN1 and transcriptional profiling of its field populations upon rehydration in BG11 medium. The genome of N. flagelliforme is 10.23 Mb in size and contains 10 825 predicted protein-encoding genes, making it one of the largest complete genomes of cyanobacteria reported to date. Comparative genomics analysis among 20 cyanobacterial strains revealed that genes related to DNA replication, recombination and repair had disproportionately high contributions to the genome expansion. The ability of N. flagelliforme to thrive under extreme abiotic stresses is supported by the acquisition of genes involved in the protection of photosynthetic apparatus, the formation of monounsaturated fatty acids, responses to UV radiation, and a peculiar role of ornithine metabolism. Transcriptome analysis revealed a distinct acclimation strategy to rehydration, including the strong constitutive expression of genes encoding photosystem I assembly factors and the involvement of post-transcriptional control mechanisms of photosynthetic resuscitation. Our results provide insights into the adaptive mechanisms of subaerial cyanobacteria in their harsh habitats and have important implications to understand the evolutionary transition of cyanobacteria from aquatic environments to terrestrial ecosystems. © 2019 Society for Applied Microbiology and John Wiley & Sons Ltd.
Sinella curviseta, among the most widespread springtails (Collembola) in Northern Hemisphere, has often been treated as a model organism in soil ecology and environmental toxicology. However, little information on its genetic knowledge severely hinders our understanding of its adaptations to the soil habitat. We present the largest genome assembly within Collembola using ~44.86?Gb (118X) of single-molecule real-time Pacific Bioscience Sequel sequencing. The final assembly of 599 scaffolds was ~381.46?Mb with a N50 length of 3.28?Mb, which captured 95.3% complete and 1.5% partial arthropod Benchmarking Universal Single-Copy Orthologs (n?=?1066). Transcripts and circularized mitochondrial genome were also assembled. We predicted 23,943 protein-coding genes, of which 83.88% were supported by transcriptome-based evidence and 82.49% matched protein records in UniProt. In addition, we also identified 222,501 repeats and 881 noncoding RNAs. Phylogenetic reconstructions for Collembola support Tomoceridae sistered to the remaining Entomobryomorpha with the position of Symphypleona not fully resolved. Gene family evolution analyses identified 9,898 gene families, of which 156 experienced significant expansions or contractions. Our high-quality reference genome of S. curviseta provides the genetic basis for future investigations in evolutionary biology, soil ecology, and ecotoxicology. © The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
The Genome of Cucurbita argyrosperma (Silver-Seed Gourd) Reveals Faster Rates of Protein-Coding Gene and Long Noncoding RNA Turnover and Neofunctionalization within Cucurbita.
Whole-genome duplications are an important source of evolutionary novelties that change the mode and tempo at which genetic elements evolve within a genome. The Cucurbita genus experienced a whole-genome duplication around 30 million years ago, although the evolutionary dynamics of the coding and noncoding genes in this genus have not yet been scrutinized. Here, we analyzed the genomes of four Cucurbita species, including a newly assembled genome of Cucurbita argyrosperma, and compared the gene contents of these species with those of five other members of the Cucurbitaceae family to assess the evolutionary dynamics of protein-coding and long intergenic noncoding RNA (lincRNA) genes after the genome duplication. We report that Cucurbita genomes have a higher protein-coding gene birth-death rate compared with the genomes of the other members of the Cucurbitaceae family. C. argyrosperma gene families associated with pollination and transmembrane transport had significantly faster evolutionary rates. lincRNA families showed high levels of gene turnover throughout the phylogeny, and 67.7% of the lincRNA families in Cucurbita showed evidence of birth from the neofunctionalization of previously existing protein-coding genes. Collectively, our results suggest that the whole-genome duplication in Cucurbita resulted in faster rates of gene family evolution through the neofunctionalization of duplicated genes. Copyright © 2019 The Author. Published by Elsevier Inc. All rights reserved.
Full-length transcriptome analysis of Litopenaeus vannamei reveals transcript variants involved in the innate immune system.
To better understand the immune system of shrimp, this study combined PacBio isoform sequencing (Iso-Seq) and Illumina paired-end short reads sequencing methods to discover full-length immune-related molecules of the Pacific white shrimp, Litopenaeus vannamei. A total of 72,648 nonredundant full-length transcripts (unigenes) were generated with an average length of 2545 bp from five main tissues, including the hepatopancreas, cardiac stomach, heart, muscle, and pyloric stomach. These unigenes exhibited a high annotation rate (62,164, 85.57%) when compared against NR, NT, Swiss-Prot, Pfam, GO, KEGG and COG databases. A total of 7544 putative long noncoding RNAs (lncRNAs) were detected and 1164 nonredundant full-length transcripts (449 UniTransModels) participated in the alternative splicing (AS) events. Importantly, a total of 5279 nonredundant full-length unigenes were successfully identified, which were involved in the innate immune system, including 9 immune-related processes, 19 immune-related pathways and 10 other immune-related systems. We also found wide transcript variants, which increased the number and function complexity of immune molecules; for example, toll-like receptors (TLRs) and interferon regulatory factors (IRFs). The 480 differentially expressed genes (DEGs) were significantly higher or tissue-specific expression patterns in the hepatopancreas compared with that in other four tested tissues (FDR <0.05). Furthermore, the expression levels of six selected immune-related DEGs and putative IRFs were validated using real-time PCR technology, substantiating the reliability of the PacBio Iso-seq results. In conclusion, our results provide new genetic resources of long-read full-length transcripts data and information for identifying immune-related genes, which are an invaluable transcriptomic resource as genomic reference, especially for further exploration of the innate immune and defense mechanisms of shrimp. Copyright © 2019 Elsevier Ltd. All rights reserved.
Hybrid sequencing-based personal full-length transcriptomic analysis implicates proteostatic stress in metastatic ovarian cancer.
Comprehensive molecular characterization of myriad somatic alterations and aberrant gene expressions at personal level is key to precision cancer therapy, yet limited by current short-read sequencing technology, individualized catalog of complete genomic and transcriptomic features is thus far elusive. Here, we integrated second- and third-generation sequencing platforms to generate a multidimensional dataset on a patient affected by metastatic epithelial ovarian cancer. Whole-genome and hybrid transcriptome dissection captured global genetic and transcriptional variants at previously unparalleled resolution. Particularly, single-molecule mRNA sequencing identified a vast array of unannotated transcripts, novel long noncoding RNAs and gene chimeras, permitting accurate determination of transcription start, splice, polyadenylation and fusion sites. Phylogenetic and enrichment inference of isoform-level measurements implicated early functional divergence and cytosolic proteostatic stress in shaping ovarian tumorigenesis. A complementary imaging-based high-throughput drug screen was performed and subsequently validated, which consistently pinpointed proteasome inhibitors as an effective therapeutic regime by inducing protein aggregates in ovarian cancer cells. Therefore, our study suggests that clinical application of the emerging long-read full-length analysis for improving molecular diagnostics is feasible and informative. An in-depth understanding of the tumor transcriptome complexity allowed by leveraging the hybrid sequencing approach lays the basis to reveal novel and valid therapeutic vulnerabilities in advanced ovarian malignancies.
PacBio full-length cDNA sequencing integrated with RNA-seq reads drastically improves the discovery of splicing transcripts in rice.
In eukaryotes, alternative splicing (AS) greatly expands the diversity of transcripts. However, it is challenging to accurately determine full-length splicing isoforms. Recently, more studies have taken advantage of Pacific Bioscience (PacBio) long-read sequencing to identify full-length transcripts. Nevertheless, the high error rate of PacBio reads seriously offsets the advantages of long reads, especially for accurately identifying splicing junctions. To best capitalize on the features of long reads, we used Illumina RNA-seq reads to improve PacBio circular consensus sequence (CCS) quality and to validate splicing patterns in the rice transcriptome. We evaluated the impact of CCS accuracy on the number and the validation rate of splicing isoforms, and integrated a comprehensive pipeline of splicing transcripts analysis by Iso-Seq and RNA-seq (STAIR) to identify the full-length multi-exon isoforms in rice seedling transcriptome (Oryza sativa L. ssp. japonica). STAIR discovered 11 733 full-length multi-exon isoforms, 6599 more than the SMRT Portal RS_IsoSeq pipeline did. Of these splicing isoforms identified, 4453 (37.9%) were missed in assembled transcripts from RNA-seq reads, and 5204 (44.4%), including 268 multi-exon long non-coding RNAs (lncRNAs), were not reported in the MSU_osa1r7 annotation. Some randomly selected unreported splicing junctions were verified by polymerase chain reaction (PCR) amplification. In addition, we investigated alternative polyadenylation (APA) events in transcripts and identified 829 major polyadenylation [poly(A)] site clusters (PACs). The analysis of splicing isoforms and APA events will facilitate the annotation of the rice genome and studies on the expression and polyadenylation of AS genes in different developmental stages or growth conditions of rice. © 2018 The Authors The Plant Journal © 2018 John Wiley & Sons Ltd.