We have produced an updated annotation of the Norway spruce genome on the basis of an in siliconormalised set of RNA-Seq data obtained from 1,529 samples and comprising 15.5 billion paired-end Illumina HiSeq reads complemented by 18Mbp of PacBio cDNA data (3.2M sequences). In addition to augmenting and refining the previous protein coding gene annotation, here we focus on the addition of long intergenic non-coding RNA (lincRNA) and micro RNA (miRNA) genes. In addition to non-coding loci, our analyses also identified protein coding genes that had been missed by the initial genome annotation and enabled us to update the annotation of existing gene models. In particular, splice variant information, as supported by PacBio sequencing reads, has been added to the current annotation and previously fragmented gene models have been merged by scaffolding disjoint genomic scaffolds on the basis of transcript evidence. Using this refined annotation, a targeted analysis of the lincRNAs enabled their classification as i) deeply conserved, ii) conserved in seed plants iii) gymnosperm/conifer specific. Concurrently, complementary analyses were performed as part of the aspen genome project and the results of a comparative analysis of the lincRNAs conserved in both Norway spruce and Eurasian aspen enabled us to identify conserved and diverged expression profiles. At present, we are delving further into the expression results with the aim to functionally annotate the lincRNA genes, by developing a co-expression network analyses based GO annotation.
Divergent evolution in the genomes of closely related lacertids, Lacerta viridis and L. bilineata, and implications for speciation.
Lacerta viridis and Lacerta bilineata are sister species of European green lizards (eastern and western clades, respectively) that, until recently, were grouped together as the L. viridis complex. Genetic incompatibilities were observed between lacertid populations through crossing experiments, which led to the delineation of two separate species within the L. viridis complex. The population history of these sister species and processes driving divergence are unknown. We constructed the first high-quality de novo genome assemblies for both L. viridis and L. bilineata through Illumina and PacBio sequencing, with annotation support provided from transcriptome sequencing of several tissues. To estimate gene flow between the two species and identify factors involved in reproductive isolation, we studied their evolutionary history, identified genomic rearrangements, detected signatures of selection on non-coding RNA, and on protein-coding genes.Here we show that gene flow was primarily unidirectional from L. bilineata to L. viridis after their split at least 1.15 million years ago. We detected positive selection of the non-coding repertoire; mutations in transcription factors; accumulation of divergence through inversions; selection on genes involved in neural development, reproduction, and behavior, as well as in ultraviolet-response, possibly driven by sexual selection, whose contribution to reproductive isolation between these lacertid species needs to be further evaluated.The combination of short and long sequence reads resulted in one of the most complete lizard genome assemblies. The characterization of a diverse array of genomic features provided valuable insights into the demographic history of divergence among European green lizards, as well as key species differences, some of which are candidates that could have played a role in speciation. In addition, our study generated valuable genomic resources that can be used to address conservation-related issues in lacertids. © The Author(s) 2018. Published by Oxford University Press.
The Genome of Cucurbita argyrosperma (Silver-Seed Gourd) Reveals Faster Rates of Protein-Coding Gene and Long Noncoding RNA Turnover and Neofunctionalization within Cucurbita.
Whole-genome duplications are an important source of evolutionary novelties that change the mode and tempo at which genetic elements evolve within a genome. The Cucurbita genus experienced a whole-genome duplication around 30 million years ago, although the evolutionary dynamics of the coding and noncoding genes in this genus have not yet been scrutinized. Here, we analyzed the genomes of four Cucurbita species, including a newly assembled genome of Cucurbita argyrosperma, and compared the gene contents of these species with those of five other members of the Cucurbitaceae family to assess the evolutionary dynamics of protein-coding and long intergenic noncoding RNA (lincRNA) genes after the genome duplication. We report that Cucurbita genomes have a higher protein-coding gene birth-death rate compared with the genomes of the other members of the Cucurbitaceae family. C. argyrosperma gene families associated with pollination and transmembrane transport had significantly faster evolutionary rates. lincRNA families showed high levels of gene turnover throughout the phylogeny, and 67.7% of the lincRNA families in Cucurbita showed evidence of birth from the neofunctionalization of previously existing protein-coding genes. Collectively, our results suggest that the whole-genome duplication in Cucurbita resulted in faster rates of gene family evolution through the neofunctionalization of duplicated genes. Copyright © 2019 The Author. Published by Elsevier Inc. All rights reserved.
A global survey of full-length transcriptome of Ginkgo biloba reveals transcript variants involved in flavonoid biosynthesis
Ginkgo biloba, which contains flavonoids as bioactive components, is widely used in traditional Chinese medicine. Increasing the flavonoid production of medicinal plants through genetic engineering generally focuses on the key genes involved in flavonoid biosynthesis. However, the molecular mechanisms underlying such biosynthesis are not yet well understood. To understand these mechanisms, a combination of second-generation sequencing (SGS) and single-molecule real-time (SMRT) sequencing was applied to G. biloba. Eight tissues were sampled for SMRT sequencing to generate a high-quality, full-length transcriptome database. From 23.36 Gb clean reads, 12,954 alternative polyadenylation events, 12,290 alternative splicing events, 929 fusion transcripts, 2,286 novel transcripts, and 1,270 lncRNAs were predicted by removing redundant reads. Further studies reveal that 7 AS, 5 lncRNA, and 6 fusion gene events were identified in flavonoid biosynthesis. A total of 12 gene modules were revealed to be involved in flavonoid metabolism structural genes and transcription factors by constructing co-expression networks. Weighted gene coexpression network analysis (WGCNA) analysis reveals that some hub genes operate during the biosynthesis by identifying transcription factors (TFs) and structure genes. Seven key hub genes were also identified by analyzing the correlation between gene expression level and flavonoids content. The results highlight the importance of SMRT sequencing of the full-length transcriptome in improving genome annotation and elucidating the gene regulation of flavonoid biosynthesis in G. biloba by providing a comprehensive set of reference transcripts.
Systematic identification of intergenic long-noncoding RNAs in mouse retinas using full-length isoform sequencing.
A great mass of long noncoding RNAs (lncRNAs) have been identified in mouse genome and increasing evidences in the last decades have revealed their crucial roles in diverse biological processes. Nevertheless, the biological roles of lncRNAs in the mouse retina remains largely unknown due to the lack of a comprehensive annotation of lncRNAs expressed in the retina.In this study, we applied the long-reads sequencing strategy to unravel the transcriptomes of developing mouse retinas and identified a total of 940 intergenic lncRNAs (lincRNAs) in embryonic and neonatal retinas, including about 13% of them were transcribed from unannotated gene loci. Subsequent analysis revealed that function of lincRNAs expressed in mouse retinas were closely related to the physiological roles of this tissue, including 90 lincRNAs that were differentially expressed after the functional loss of key regulators of retinal ganglion cell (RGC) differentiation. In situ hybridization results demonstrated the enrichment of three class IV POU-homeobox genes adjacent lincRNAs (linc-3a, linc-3b and linc-3c) in ganglion cell layer and indicated they were potentially RGC-specific.In summary, this study systematically annotated the lincRNAs expressed in embryonic and neonatal mouse retinas and implied their crucial regulatory roles in retinal development such as RGC differentiation.
Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight.
The human genome contains “dark” gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions.Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are =?5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer’s Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in disease cases but not in controls.While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer’s disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.