The free-living flatworm, Macrostomum lignano, much like its better known planarian relative, Schmidtea mediterranea, has an impressive regenerative capacity. Following injury, this species has the ability to regenerate almost an entirely new organism. This is attributable to the presence of an abundant somatic stem cell population, the neoblasts. These cells are also essential for the ongoing maintenance of most tissues, as their loss leads to irreversible degeneration of the animal. This set of unique properties makes a subset of flatworms attractive organisms for studying the evolution of pathways involved in tissue self-renewal, cell fate specification, and regeneration. The use of these organisms as models, however, is hampered by the lack of a well-assembled and annotated genome sequences, fundamental to modern genetic and molecular studies. Here we report the genomic sequence of Macrostomum lignano and an accompanying characterization of its transcriptome. The genome structure of M. lignano is remarkably complex, with ~75% of its sequence being comprised of simple repeats and transposon sequences. This has made high quality assembly from Illumina reads alone impossible (N50=222 bp). We therefore generated 130X coverage by long sequencing reads from the PacBio platform to create a substantially improved assembly with an N50 of 64 Kbp. We complemented the reference genome with an assembled and annotated transcriptome, and used both of these datasets in combination to probe gene expression patterns during regeneration, examining pathways important to stem cell function. As a whole, our data will provide a crucial resource for the community for the study not only of invertebrate evolution and phylogeny but also of regeneration and somatic pluripotency.
We have produced an updated annotation of the Norway spruce genome on the basis of an in siliconormalised set of RNA-Seq data obtained from 1,529 samples and comprising 15.5 billion paired-end Illumina HiSeq reads complemented by 18Mbp of PacBio cDNA data (3.2M sequences). In addition to augmenting and refining the previous protein coding gene annotation, here we focus on the addition of long intergenic non-coding RNA (lincRNA) and micro RNA (miRNA) genes. In addition to non-coding loci, our analyses also identified protein coding genes that had been missed by the initial genome annotation and enabled us to update the annotation of existing gene models. In particular, splice variant information, as supported by PacBio sequencing reads, has been added to the current annotation and previously fragmented gene models have been merged by scaffolding disjoint genomic scaffolds on the basis of transcript evidence. Using this refined annotation, a targeted analysis of the lincRNAs enabled their classification as i) deeply conserved, ii) conserved in seed plants iii) gymnosperm/conifer specific. Concurrently, complementary analyses were performed as part of the aspen genome project and the results of a comparative analysis of the lincRNAs conserved in both Norway spruce and Eurasian aspen enabled us to identify conserved and diverged expression profiles. At present, we are delving further into the expression results with the aim to functionally annotate the lincRNA genes, by developing a co-expression network analyses based GO annotation.
A method for the identification of variants in Alzheimer’s disease candidate genes and transcripts using hybridization capture combined with long-read sequencing
Alzheimer’s disease (AD) is a devastating neurodegenerative disease that is genetically complex. Although great progress has been made in identifying fully penetrant mutations in genes such as APP, PSEN1 and PSEN2 that cause early-onset AD, these still represent a very small percentage of AD cases. Large-scale, genome-wide association studies (GWAS) have identified at least 20 additional genetic risk loci for the more common form of late-onset AD. However, the identified SNPs are typically not the actual risk variants, but are in linkage disequilibrium with the presumed causative variant (Van Cauwenberghe C, et al., The genetic landscape of Alzheimer disease: clinical implications and perspectives. Genet Med 2015;18:421-430). Long-read sequencing together with hybrid-capture targeting technologies provides a powerful combination to target candidate genes/transcripts of interest. Shearing the genomic DNA to ~5 kb fragments and then capturing with probes that span the whole gene(s) of interest can provide uniform coverage across the entire region, identifying variants and allowing for phasing into two haplotypes. Furthermore, capturing full-length cDNA from the same sample using the same capture probes can also provide an understanding of isoforms that are generated and allow them to be assigned to their corresponding haplotype. Here we present a method for capturing genomic DNA and cDNA from an AD sample using a panel of probes targeting approximately 20 late-onset AD candidate genes which includes CLU, ABCA7, CD33, TREM2, TOMM40, PSEN2, APH1 and BIN1. By combining xGen® Lockdown® probes with SMRT Sequencing, we provide completely sequenced candidate genes as well as their corresponding transcripts. In addition, we are also able to evaluate structural variants that due to their size, repetitive nature, or low sequence complexity have been un-sequenceable using short-read technologies.
A method for the identification of variants in Alzheimer’s disease candidate genes and transcripts using hybridization capture combined with long-read sequencing
Alzheimer’s disease (AD) is a devastating neurodegenerative disease that is genetically complex. Although great progress has been made in identifying fully penetrant mutations in genes such as APP, PSEN1 and PSEN2 that cause early-onset AD, these still represent a very small percentage of AD cases. Large-scale, genome-wide association studies (GWAS) have identified at least 20 additional genetic risk loci for the more common form of late-onset AD. However, the identified SNPs are typically not the actual causal variants, but are in linkage disequilibrium with the presumed causative variant (Van Cauwenberghe C, et al., The genetic landscape of Alzheimer disease: clinical implications and perspectives. Genet Med 2015;18:421-430).
ASHG PacBio Workshop: Identification and characterization of informative genetic structural variants for neurodegenerative diseases
Michael Lutz, from the Duke University Medical Center, discussed a recently published software tool that can now be used in a pipeline with SMRT Sequencing data to find structural variant…
User Group Meeting: From long reads to transcript function: Bioinformatics tools for Iso-transcriptomics analysis
In this PacBio User Group Meeting presentation, Ana Conesa Cegarra from the University of Florida spoke about Iso-Seq analysis tools developed by her group, which created the popular SQANTI tools…
Over the past decade, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs. However, as next-generation sequencing technologies have developed, so too has RNA-seq. Now, RNA-seq methods are available for studying many different aspects of RNA biology, including single-cell gene expression, translation (the translatome) and RNA structure (the structurome). Exciting new applications are being explored, such as spatial transcriptomics (spatialomics). Together with new long-read and direct RNA-seq technologies and better computational tools for data analysis, innovations in RNA-seq are contributing to a fuller understanding of RNA biology, from questions such as when and where transcription occurs to the folding and intermolecular interactions that govern RNA function.
In the wake of constant improvements in sequencing technologies, numerous insect genomes have been sequenced. Currently, 1219 insect genome-sequencing projects have been registered with the National Center for Biotechnology Information, including 401 that have genome assemblies and 155 with an official gene set of annotated protein-coding genes. Comparative genomics analysis showed that the expansion or contraction of gene families was associated with well-studied physiological traits such as immune system, metabolic detoxification, parasitism and polyphagy in insects. Here, we summarize the progress of insect genome sequencing, with an emphasis on how this impacts research on pest control. We begin with a brief introduction to the basic concepts of genome assembly, annotation and metrics for evaluating the quality of draft assemblies. We then provide an overview of genome information for numerous insect species, highlighting examples from prominent model organisms, agricultural pests and disease vectors. We also introduce the major insect genome databases. The increasing availability of insect genomic resources is beneficial for developing alternative pest control methods. However, many opportunities remain for developing data-mining tools that make maximal use of the available insect genome resources. Although rapid progress has been achieved, many challenges remain in the field of insect genomics. © 2019 The Royal Entomological Society.
Full-length mRNA sequencing and gene expression profiling reveal broad involvement of natural antisense transcript gene pairs in pepper development and response to stresses.
Pepper is an important vegetable with great economic value and unique biological features. In the past few years, significant development has been made towards understanding the huge complex pepper genome; however, pepper functional genomics has not been well studied. To better understand the pepper gene structure and pepper gene regulation, we conducted full-length mRNA sequencing by PacBio sequencing and obtained 57862 high-quality full-length mRNA sequences derived from 18362 previously annotated and 5769 newly detected genes. New gene models were built that combined the full-length mRNA sequences and corrected approximately 500 fragmented gene models from previous annotations. Based on the full-length mRNA, we identified 4114 and 5880 pepper genes forming natural antisense transcript (NAT) genes in-cis and in-trans, respectively. Most of these genes accumulate small RNAs in their overlapping regions. By analyzing these NAT gene expression patterns in our transcriptome data, we identified many NAT pairs responsive to a variety of biological processes in pepper. Pepper formate dehydrogenase 1 (FDH1), which is required for R-gene-mediated disease resistance, may be regulated by nat-siRNAs and participate in a positive feedback loop in salicylic acid biosynthesis during resistance responses. Several cis-NAT pairs and subgroups of trans-NAT genes were responsive to pepper pericarp and placenta development, which may play roles in capsanthin and capsaicin biosynthesis. Using a comparative genomics approach, the evolutionary mechanisms of cis-NATs were investigated, and we found that an increase in intergenic sequences accounted for the loss of most cis-NATs, while transposon insertion contributed to the formation of most new cis-NATs. This article is protected by copyright. All rights reserved.This article is protected by copyright. All rights reserved.
Chromosomal-level assembly of the blolsod clam, Scapharca (Anadara) broughtonii, using long sequence reads and Hi-C.
The blood clam, Scapharca (Anadara) broughtonii, is an economically and ecologically important marine bivalve of the family Arcidae. Efforts to study their population genetics, breeding, cultivation, and stock enrichment have been somewhat hindered by the lack of a reference genome. Herein, we report the complete genome sequence of S. broughtonii, a first reference genome of the family Arcidae.A total of 75.79 Gb clean data were generated with the Pacific Biosciences and Oxford Nanopore platforms, which represented approximately 86× coverage of the S. broughtonii genome. De novo assembly of these long reads resulted in an 884.5-Mb genome, with a contig N50 of 1.80 Mb and scaffold N50 of 45.00 Mb. Genome Hi-C scaffolding resulted in 19 chromosomes containing 99.35% of bases in the assembled genome. Genome annotation revealed that nearly half of the genome (46.1%) is composed of repeated sequences, while 24,045 protein-coding genes were predicted and 84.7% of them were annotated.We report here a chromosomal-level assembly of the S. broughtonii genome based on long-read sequencing and Hi-C scaffolding. The genomic data can serve as a reference for the family Arcidae and will provide a valuable resource for the scientific community and aquaculture sector. © The Author(s) 2019. Published by Oxford University Press.
Divergent evolution in the genomes of closely related lacertids, Lacerta viridis and L. bilineata, and implications for speciation.
Lacerta viridis and Lacerta bilineata are sister species of European green lizards (eastern and western clades, respectively) that, until recently, were grouped together as the L. viridis complex. Genetic incompatibilities were observed between lacertid populations through crossing experiments, which led to the delineation of two separate species within the L. viridis complex. The population history of these sister species and processes driving divergence are unknown. We constructed the first high-quality de novo genome assemblies for both L. viridis and L. bilineata through Illumina and PacBio sequencing, with annotation support provided from transcriptome sequencing of several tissues. To estimate gene flow between the two species and identify factors involved in reproductive isolation, we studied their evolutionary history, identified genomic rearrangements, detected signatures of selection on non-coding RNA, and on protein-coding genes.Here we show that gene flow was primarily unidirectional from L. bilineata to L. viridis after their split at least 1.15 million years ago. We detected positive selection of the non-coding repertoire; mutations in transcription factors; accumulation of divergence through inversions; selection on genes involved in neural development, reproduction, and behavior, as well as in ultraviolet-response, possibly driven by sexual selection, whose contribution to reproductive isolation between these lacertid species needs to be further evaluated.The combination of short and long sequence reads resulted in one of the most complete lizard genome assemblies. The characterization of a diverse array of genomic features provided valuable insights into the demographic history of divergence among European green lizards, as well as key species differences, some of which are candidates that could have played a role in speciation. In addition, our study generated valuable genomic resources that can be used to address conservation-related issues in lacertids. © The Author(s) 2018. Published by Oxford University Press.
Gammaherpesvirus Readthrough Transcription Generates a Long Non-Coding RNA That Is Regulated by Antisense miRNAs and Correlates with Enhanced Lytic Replication In Vivo.
Gammaherpesviruses, including the human pathogens Epstein?Barr virus (EBV) and Kaposi’s sarcoma-associated herpesvirus (KSHV) are oncogenic viruses that establish lifelong infections in hosts and are associated with the development of lymphoproliferative diseases and lymphomas. Recent studies have shown that the majority of the mammalian genome is transcribed and gives rise to numerous long non-coding RNAs (lncRNAs). Likewise, the large double-stranded DNA virus genomes of herpesviruses undergo pervasive transcription, including the expression of many as yet uncharacterized lncRNAs. Murine gammaperherpesvirus 68 (MHV68, MuHV-4, ?HV68) is a natural pathogen of rodents, and is genetically and pathogenically related to EBV and KSHV, providing a highly tractable model for studies of gammaherpesvirus biology and pathogenesis. Through the integrated use of parallel data sets from multiple sequencing platforms, we previously resolved transcripts throughout the MHV68 genome, including at least 144 novel transcript isoforms. Here, we sought to molecularly validate novel transcripts identified within the M3/M2 locus, which harbors genes that code for the chemokine binding protein M3, the latency B cell signaling protein M2, and 10 microRNAs (miRNAs). Using strand-specific northern blots, we validated the presence of M3-04, a 3.91 kb polyadenylated transcript that initiates at the M3 transcription start site and reads through the M3 open reading frame (ORF), the M3 poly(a) signal sequence, and the M2 ORF. This unexpected transcript was solely localized to the nucleus, strongly suggesting that it is not translated and instead may function as a lncRNA. Use of an MHV68 mutant lacking two M3-04-antisense pre-miRNA stem loops resulted in highly increased expression of M3-04 and increased virus replication in the lungs of infected mice, demonstrating a key role for these RNAs in regulation of lytic infection. Together these findings suggest the possibility of a tripartite regulatory relationship between the lncRNA M3-04, antisense miRNAs, and the latency gene M2.
Identification, expression, alternative splicing and functional analysis of pepper WRKY gene family in response to biotic and abiotic stresses.
WRKY proteins are a large group of plant transcription factors that are involved in various biological processes, including biotic and abiotic stress responses, hormone response, plant development, and metabolism. WRKY proteins have been identified in several plants, but only a few have been identified in Capsicum annuum. Here, we identified a total of 62 WRKY genes in the latest pepper genome. These genes were classified into three groups (Groups 1-3) based on the structural features of their proteins. The structures of the encoded proteins, evolution, and expression under normal growth conditions were analyzed and 35 putative miRNA target sites were predicted in 20 CaWRKY genes. Moreover, the response to cold or CMV treatments of selected WRKY genes were examined to validate the roles under stresses. And alternative splicing (AS) events of some CaWRKYs were also identified under CMV infection. Promoter analysis confirmed that CaWRKY genes are involved in growth, development, and biotic or abiotic stress responses in hot pepper. The comprehensive analysis provides fundamental information for better understanding of the signaling pathways involved in the WRKY-mediated regulation of developmental processes, as well as biotic and abiotic stress responses.
Comprehensive transcriptome analysis reveals genes potentially involved in isoflavone biosynthesis in Pueraria thomsonii Benth.
Pueraria thomsonii Benth is an important medicinal plant. Transcriptome sequencing, unigene assembly, the annotation of transcripts and the study of gene expression profiles play vital roles in gene function research. However, the full-length transcriptome of P. thomsonii remains unknown. Here, we obtained 44,339 nonredundant transcripts of P. thomsonii by using the PacBio RS II Isoform and Illumina sequencing platforms, of which 43,195 were annotated genes. Compared with the expression levels in the plant roots, those of transcripts with a |fold change| = 4 and FDR < 0.01 in the leaves or stems were assigned as differentially expressed transcripts (DETs). In total, we found 9,225 DETs, 32 of which came from structural genes that were potentially involved in isoflavone biosynthesis. The expression profiles of 8 structural genes from the RNA-Seq data were validated by qRT-PCR. We identified 437 transcription factors (TFs) that were positively or negatively correlated with at least 1 of the structural genes involved in isoflavone biosynthesis using Pearson correlation coefficients (r) (r > 0.8 or r < -0.8). We also identified a total of 32 microRNAs (miRNAs), which targeted 805 transcripts. These miRNAs caused enriched function in 'ATP binding', 'defense response', 'ADP binding', and 'signal transduction'. Interestingly, MIR156a potentially promoted isoflavone biosynthesis by repressing SBP, and MIR319 promoted isoflavone biosynthesis by repressing TCP and HB-HD-ZIP. Finally, we identified 2,690 alternative splicing events, including that of the structural genes of trans-cinnamate 4-monooxygenase and pullulanase, which are potentially involved in the biosynthesis of isoflavone and starch, respectively, and of three TFs potentially involved in isoflavone biosynthesis. Together, these results provide us with comprehensive insight into the gene expression and regulation of P. thomsonii.
A High-Quality Grapevine Downy Mildew Genome Assembly Reveals Rapidly Evolving and Lineage-Specific Putative Host Adaptation Genes.
Downy mildews are obligate biotrophic oomycete pathogens that cause devastating plant diseases on economically important crops. Plasmopara viticola is the causal agent of grapevine downy mildew, a major disease in vineyards worldwide. We sequenced the genome of Pl. viticola with PacBio long reads and obtained a new 92.94?Mb assembly with high contiguity (359 scaffolds for a N50 of 706.5?kb) due to a better resolution of repeat regions. This assembly presented a high level of gene completeness, recovering 1,592 genes encoding secreted proteins involved in plant-pathogen interactions. Plasmopara viticola had a two-speed genome architecture, with secreted protein-encoding genes preferentially located in gene-sparse, repeat-rich regions and evolving rapidly, as indicated by pairwise dN/dS values. We also used short reads to assemble the genome of Plasmopara muralis, a closely related species infecting grape ivy (Parthenocissus tricuspidata). The lineage-specific proteins identified by comparative genomics analysis included a large proportion of RxLR cytoplasmic effectors and, more generally, genes with high dN/dS values. We identified 270 candidate genes under positive selection, including several genes encoding transporters and components of the RNA machinery potentially involved in host specialization. Finally, the Pl. viticola genome assembly generated here will allow the development of robust population genomics approaches for investigating the mechanisms involved in adaptation to biotic and abiotic selective pressures in this species. © The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.