Haplotype phasing of genetic variants is important for interpretation of the maize genome, population genetic analysis, and functional genomic analysis of allelic activity. Accordingly, accurate methods for phasing full-length isoforms are essential for functional genomics study. In this study, we performed an isoform-level phasing study in maize, using two inbred lines and their reciprocal crosses, based on single-molecule full-length cDNA sequencing. To phase and analyze full-length transcripts between hybrids and parents, we developed a tool called IsoPhase. Using this tool, we validated the majority of SNPs called against matching short read data and identified cases of allele-specific, gene-level, and isoform-level expression. Our results revealed that maize parental and hybrid lines exhibit different splicing activities. After phasing 6,847 genes in two reciprocal hybrids using embryo, endosperm and root tissues, we annotated the SNPs and identified large-effect genes. In addition, based on single-molecule sequencing, we identified parent-of-origin isoforms in maize hybrids, different novel isoforms between maize parent and hybrid lines, and imprinted genes from different tissues. Finally, we characterized variation in cis- and trans-regulatory effects. Our study provides measures of haplotypic expression that could increase power and accuracy in studies of allelic expression.
Transcriptome Profiling Provides Insight into the Genes in Carotenoid Biosynthesis during the Mesocarp and Seed Developmental Stages of Avocado (Persea americana).
Avocado (Persea americana Mill.) is an economically important crop because of its high nutritional value. However, the absence of a sequenced avocado reference genome has hindered investigations of secondary metabolism. For next-generation high-throughput transcriptome sequencing, we obtained 365,615,152 and 348,623,402 clean reads as well as 109.13 and 104.10 Gb of sequencing data for avocado mesocarp and seed, respectively, during five developmental stages. High-quality reads were assembled into 100,837 unigenes with an average length of 847.40 bp (N50 = 1725 bp). Additionally, 16,903 differentially expressed genes (DEGs) were detected, 17 of which were related to carotenoid biosynthesis. The expression levels of most of these 17 DEGs were higher in the mesocarp than in the seed during five developmental stages. In this study, the avocado mesocarp and seed transcriptome were also sequenced using single-molecule long-read sequencing to acquired 25.79 and 17.67 Gb clean data, respectively. We identified 233,014 and 238,219 consensus isoforms in avocado mesocarp and seed, respectively. Furthermore, 104 and 59 isoforms were found to correspond to the putative 11 carotenoid biosynthetic-related genes in the avocado mesocarp and seed, respectively. The isoform numbers of 10 out of the putative 11 genes involved in the carotenoid biosynthetic pathway were higher in the mesocarp than those in the seed. Besides, alpha- and beta-carotene contents in the avocado mesocarp and seed during five developmental stages were also measured, and they were higher in the mesocarp than in the seed, which validated the results of transcriptome profiling. Gene expression changes and the associated variations in gene dosage could influence carotenoid biosynthesis. These results will help to further elucidate carotenoid biosynthesis in avocado.
Dynamic Changes in Metabolite Accumulation and the Transcriptome during Leaf Growth and Development in Eucommia ulmoides.
Eucommia ulmoides Oliver is widely distributed in China. This species has been used mainly in medicine due to the high concentration of chlorogenic acid (CGA), flavonoids, lignans, and other compounds in the leaves and barks. However, the categories of metabolites, dynamic changes in metabolite accumulation and overall molecular mechanisms involved in metabolite biosynthesis during E. ulmoides leaf growth and development remain unknown. Here, a total of 515 analytes, including 127 flavonoids, 46 organic acids, 44 amino acid derivatives, 9 phenolamides, and 16 vitamins, were identified from four E. ulmoides samples using ultraperformance liquid chromatography-mass spectrometry (UPLC-MS) (for widely targeted metabolites). The accumulation of most flavonoids peaked in growing leaves, followed by old leaves. UPLC-MS analysis indicated that CGA accumulation increased steadily to a high concentration during leaf growth and development, and rutin showed a high accumulation level in leaf buds and growing leaves. Based on single-molecule long-read sequencing technology, 69,020 transcripts and 2880 novel loci were identified in E. ulmoides. Expression analysis indicated that isoforms in the flavonoid biosynthetic pathway and flavonoid metabolic pathway were highly expressed in growing leaves and old leaves. Co-expression network analysis suggested a potential direct link between the flavonoid and phenylpropanoid biosynthetic pathways via the regulation of transcription factors, including MYB (v-myb avian myeloblastosis viral oncogene homolog) and bHLH (basic/helix-loop-helix). Our study predicts dynamic metabolic models during leaf growth and development and will support further molecular biological studies of metabolite biosynthesis in E. ulmoides. In addition, our results significantly improve the annotation of the E. ulmoides genome.
The damage caused by Bradysia odoriphaga is the main factor threatening the production of vegetables in the Liliaceae family. However, few genetic studies of B. odoriphaga have been conducted because of a lack of genomic resources. Many long-read sequencing technologies have been developed in the last decade; therefore, in this study, the transcriptome including all development stages of B. odoriphaga was sequenced for the first time by Pacific single-molecule long-read sequencing. Here, 39,129 isoforms were generated, and 35,645 were found to have annotation results when checked against sequences available in different databases. Overall, 18,473 isoforms were distributed in 25 various Clusters of Orthologous Groups, and 11,880 isoforms were categorized into 60 functional groups that belonged to the three main Gene Ontology classifications. Moreover, 30,610 isoforms were assigned into 44 functional categories belonging to six main Kyoto Encyclopedia of Genes and Genomes functional categories. Coding DNA sequence (CDS) prediction showed that 36,419 out of 39,129 isoforms were predicted to have CDS, and 4319 simple sequence repeats were detected in total. Finally, 266 insecticide resistance and metabolism-related isoforms were identified as candidate genes for further investigation of insecticide resistance and metabolism in B. odoriphaga.
Genome assembly and annotation of the Trichoplusia ni Tni-FNL insect cell line enabled by long-read technologies.
Trichoplusiani derived cell lines are commonly used to enable recombinant protein expression via baculovirus infection to generate materials approved for clinical use and in clinical trials. In order to develop systems biology and genome engineering tools to improve protein expression in this host, we performed de novo genome assembly of the Trichoplusiani-derived cell line Tni-FNL.By integration of PacBio single-molecule sequencing, Bionano optical mapping, and 10X Genomics linked-reads data, we have produced a draft genome assembly of Tni-FNL.Our assembly contains 280 scaffolds, with a N50 scaffold size of 2.3 Mb and a total length of 359 Mb. Annotation of the Tni-FNL genome resulted in 14,101 predicted genes and 93.2% of the predicted proteome contained recognizable protein domains. Ortholog searches within the superorder Holometabola provided further evidence of high accuracy and completeness of the Tni-FNL genome assembly.This first draft Tni-FNL genome assembly was enabled by complementary long-read technologies and represents a high-quality, well-annotated genome that provides novel insight into the complexity of this insect cell line and can serve as a reference for future large-scale genome engineering work in this and other similar recombinant protein production hosts.
Rapid antigen diversification through mitotic recombination in the human malaria parasite Plasmodium falciparum.
Malaria parasites possess the remarkable ability to maintain chronic infections that fail to elicit a protective immune response, characteristics that have stymied vaccine development and cause people living in endemic regions to remain at risk of malaria despite previous exposure to the disease. These traits stem from the tremendous antigenic diversity displayed by parasites circulating in the field. For Plasmodium falciparum, the most virulent of the human malaria parasites, this diversity is exemplified by the variant gene family called var, which encodes the major surface antigen displayed on infected red blood cells (RBCs). This gene family exhibits virtually limitless diversity when var gene repertoires from different parasite isolates are compared. Previous studies indicated that this remarkable genome plasticity results from extensive ectopic recombination between var genes during mitotic replication; however, the molecular mechanisms that direct this process to antigen-encoding loci while the rest of the genome remains relatively stable were not determined. Using targeted DNA double-strand breaks (DSBs) and long-read whole-genome sequencing, we show that a single break within an antigen-encoding region of the genome can result in a cascade of recombination events leading to the generation of multiple chimeric var genes, a process that can greatly accelerate the generation of diversity within this family. We also found that recombinations did not occur randomly, but rather high-probability, specific recombination products were observed repeatedly. These results provide a molecular basis for previously described structured rearrangements that drive diversification of this highly polymorphic gene family.
Parallels between natural selection in the cold-adapted crop-wild relative Tripsacum dactyloides and artificial selection in temperate adapted maize.
Artificial selection has produced varieties of domesticated maize that thrive in temperate climates around the world. However, the direct progenitor of maize, teosinte, is indigenous only to a relatively small range of tropical and subtropical latitudes and grows poorly or not at all outside of this region. Tripsacum, a sister genus to maize and teosinte, is naturally endemic to the majority of areas in the western hemisphere where maize is cultivated. A full-length reference transcriptome for Tripsacum dactyloides generated using long-read Iso-Seq data was used to characterize independent adaptation to temperate climates in this clade. Genes related to phospholipid biosynthesis, a critical component of cold acclimation in other cold-adapted plant lineages, were enriched among those genes experiencing more rapid rates of protein sequence evolution in T. dactyloides. In contrast with previous studies of parallel selection, we find that there is a significant overlap between the genes that were targets of artificial selection during the adaptation of maize to temperate climates and those that were targets of natural selection in temperate-adapted T. dactyloides. Genes related to growth, development, response to stimulus, signaling, and organelles were enriched in the set of genes identified as both targets of natural and artificial selection. © 2019 The Authors The Plant Journal © 2019 John Wiley & Sons Ltd.
Characterization and analysis of the transcriptome in Gymnocypris selincuoensis on the Qinghai-Tibetan Plateau using single-molecule long-read sequencing and RNA-seq.
The lakes on the Qinghai-Tibet Plateau (QTP) are the largest and highest lake group in the world. Gymnocypris selincuoensis is the only cyprinid fish living in lake Selincuo, the largest lake on QTP. However, its genetic resource is still blank, limiting studies on molecular and genetic analysis. In this study, the transcriptome of G. selincuoensis was first generated by using PacBio Iso-Seq and Illumina RNA-seq. A full-length (FL) transcriptome with 75,435 transcripts was obtained by Iso-Seq with N50 length of 3,870 bp. Among all transcripts, 75,016 were annotated to public databases, 64,710 contain complete open reading frames and 2,811 were long non-coding RNAs. Based on all- vs.-all BLAST, 2,069 alternative splicing events were detected, and 80% of them were validated by reverse transcription polymerase chain reaction (RT-PCR). Tissue gene expression atlas showed that the number of detected expressed transcripts ranged from 37,397 in brain to 19,914 in muscle, with 10,488 transcripts detected in all seven tissues. Comparative genomic analysis with other cyprinid fishes identified 77 orthologous genes with potential positive selection (Ka/Ks > 0.3). A total of 56,696 perfect simple sequence repeats were identified from FL transcripts. Our results provide valuable genetic resources for further studies on adaptive evolution, gene expression and population genetics in G. selincuoensis and other congeneric fishes. © The Author(s) 2019. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Full-length transcriptome analysis of Litopenaeus vannamei reveals transcript variants involved in the innate immune system.
To better understand the immune system of shrimp, this study combined PacBio isoform sequencing (Iso-Seq) and Illumina paired-end short reads sequencing methods to discover full-length immune-related molecules of the Pacific white shrimp, Litopenaeus vannamei. A total of 72,648 nonredundant full-length transcripts (unigenes) were generated with an average length of 2545 bp from five main tissues, including the hepatopancreas, cardiac stomach, heart, muscle, and pyloric stomach. These unigenes exhibited a high annotation rate (62,164, 85.57%) when compared against NR, NT, Swiss-Prot, Pfam, GO, KEGG and COG databases. A total of 7544 putative long noncoding RNAs (lncRNAs) were detected and 1164 nonredundant full-length transcripts (449 UniTransModels) participated in the alternative splicing (AS) events. Importantly, a total of 5279 nonredundant full-length unigenes were successfully identified, which were involved in the innate immune system, including 9 immune-related processes, 19 immune-related pathways and 10 other immune-related systems. We also found wide transcript variants, which increased the number and function complexity of immune molecules; for example, toll-like receptors (TLRs) and interferon regulatory factors (IRFs). The 480 differentially expressed genes (DEGs) were significantly higher or tissue-specific expression patterns in the hepatopancreas compared with that in other four tested tissues (FDR <0.05). Furthermore, the expression levels of six selected immune-related DEGs and putative IRFs were validated using real-time PCR technology, substantiating the reliability of the PacBio Iso-seq results. In conclusion, our results provide new genetic resources of long-read full-length transcripts data and information for identifying immune-related genes, which are an invaluable transcriptomic resource as genomic reference, especially for further exploration of the innate immune and defense mechanisms of shrimp. Copyright © 2019 Elsevier Ltd. All rights reserved.
Heterochromatin-enriched assemblies reveal the sequence and organization of the Drosophila melanogaster Y chromosome.
Heterochromatic regions of the genome are repeat-rich and poor in protein coding genes, and are therefore underrepresented in even the best genome assemblies. One of the most difficult regions of the genome to assemble are sex-limited chromosomes. The Drosophila melanogaster Y chromosome is entirely heterochromatic, yet has wide-ranging effects on male fertility, fitness, and genome-wide gene expression. The genetic basis of this phenotypic variation is difficult to study, in part because we do not know the detailed organization of the Y chromosome. To study Y chromosome organization in D. melanogaster, we develop an assembly strategy involving the in silico enrichment of heterochromatic long single-molecule reads and use these reads to create targeted de novo assemblies of heterochromatic sequences. We assigned contigs to the Y chromosome using Illumina reads to identify male-specific sequences. Our pipeline extends the D. melanogaster reference genome by 11.9 Mb, closes 43.8% of the gaps, and improves overall contiguity. The addition of 10.6 MB of Y-linked sequence permitted us to study the organization of repeats and genes along the Y chromosome. We detected a high rate of duplication to the pericentric regions of the Y chromosome from other regions in the genome. Most of these duplicated genes exist in multiple copies. We detail the evolutionary history of one sex-linked gene family, crystal-Stellate While the Y chromosome does not undergo crossing over, we observed high gene conversion rates within and between members of the crystal-Stellate gene family, Su(Ste), and PCKR, compared to genome-wide estimates. Our results suggest that gene conversion and gene duplication play an important role in the evolution of Y-linked genes. Copyright © 2019 Chang and Larracuente.
Hybrid sequencing-based personal full-length transcriptomic analysis implicates proteostatic stress in metastatic ovarian cancer.
Comprehensive molecular characterization of myriad somatic alterations and aberrant gene expressions at personal level is key to precision cancer therapy, yet limited by current short-read sequencing technology, individualized catalog of complete genomic and transcriptomic features is thus far elusive. Here, we integrated second- and third-generation sequencing platforms to generate a multidimensional dataset on a patient affected by metastatic epithelial ovarian cancer. Whole-genome and hybrid transcriptome dissection captured global genetic and transcriptional variants at previously unparalleled resolution. Particularly, single-molecule mRNA sequencing identified a vast array of unannotated transcripts, novel long noncoding RNAs and gene chimeras, permitting accurate determination of transcription start, splice, polyadenylation and fusion sites. Phylogenetic and enrichment inference of isoform-level measurements implicated early functional divergence and cytosolic proteostatic stress in shaping ovarian tumorigenesis. A complementary imaging-based high-throughput drug screen was performed and subsequently validated, which consistently pinpointed proteasome inhibitors as an effective therapeutic regime by inducing protein aggregates in ovarian cancer cells. Therefore, our study suggests that clinical application of the emerging long-read full-length analysis for improving molecular diagnostics is feasible and informative. An in-depth understanding of the tumor transcriptome complexity allowed by leveraging the hybrid sequencing approach lays the basis to reveal novel and valid therapeutic vulnerabilities in advanced ovarian malignancies.
PacBio full-length cDNA sequencing integrated with RNA-seq reads drastically improves the discovery of splicing transcripts in rice.
In eukaryotes, alternative splicing (AS) greatly expands the diversity of transcripts. However, it is challenging to accurately determine full-length splicing isoforms. Recently, more studies have taken advantage of Pacific Bioscience (PacBio) long-read sequencing to identify full-length transcripts. Nevertheless, the high error rate of PacBio reads seriously offsets the advantages of long reads, especially for accurately identifying splicing junctions. To best capitalize on the features of long reads, we used Illumina RNA-seq reads to improve PacBio circular consensus sequence (CCS) quality and to validate splicing patterns in the rice transcriptome. We evaluated the impact of CCS accuracy on the number and the validation rate of splicing isoforms, and integrated a comprehensive pipeline of splicing transcripts analysis by Iso-Seq and RNA-seq (STAIR) to identify the full-length multi-exon isoforms in rice seedling transcriptome (Oryza sativa L. ssp. japonica). STAIR discovered 11 733 full-length multi-exon isoforms, 6599 more than the SMRT Portal RS_IsoSeq pipeline did. Of these splicing isoforms identified, 4453 (37.9%) were missed in assembled transcripts from RNA-seq reads, and 5204 (44.4%), including 268 multi-exon long non-coding RNAs (lncRNAs), were not reported in the MSU_osa1r7 annotation. Some randomly selected unreported splicing junctions were verified by polymerase chain reaction (PCR) amplification. In addition, we investigated alternative polyadenylation (APA) events in transcripts and identified 829 major polyadenylation [poly(A)] site clusters (PACs). The analysis of splicing isoforms and APA events will facilitate the annotation of the rice genome and studies on the expression and polyadenylation of AS genes in different developmental stages or growth conditions of rice. © 2018 The Authors The Plant Journal © 2018 John Wiley & Sons Ltd.
Analysis of transcripts and splice isoforms in Medicago sativa L. by single-molecule long-read sequencing.
The full-length transcriptome of alfalfa was analyzed with PacBio single-molecule long-read sequencing technology. The transcriptome data provided full-length sequences and gene isoforms of transcripts in alfalfa, which will improve genome annotation and enhance our understanding of the gene structure of alfalfa. As an important forage, alfalfa (Medicago sativa L.) is world-wide planted. For its complexity of genome and unfinished whole genome sequencing, the sequences and complete structure of mRNA transcripts remain unclear in alfalfa. In this study, single-molecule long-read sequencing was applied to investigate the alfalfa transcriptome using the Pacific Biosciences platform, and a total of 113,321 transcripts were obtained from young, mature and senescent leaves. We identified 72,606 open reading frames including 46,616 full-length ORFs, 1670 transcription factors from 54 TF families and 44,040 simple sequence repeats from 30,797 sequences. A total of 7568 alternative splicing events was identified and the majority of alternative splicing events in alfalfa was intron retention. In addition, we identified 17,740 long non-coding RNAs. Our results show the feasibility of deep sequencing full-length RNA from alfalfa transcriptome on a single-molecule level.
The Populus shoot undergoes primary growth (longitudinal growth) followed by secondary growth (radial growth), which produces biomass that is an important source of energy worldwide. We adopted joint PacBio Iso-Seq and RNA-seq analysis to identify differentially expressed transcripts along a developmental gradient from the shoot apex to the fifth internode of Populus Nanlin895. We obtained 87 150 full-length transcripts, including 2081 new isoforms and 62 058 new alternatively spliced isoforms, most of which were produced by intron retention, that were used to update the Populus annotation. Among these novel isoforms, there are 1187 long non-coding RNAs and 356 fusion genes. Using this annotation, we found 15 838 differentially expressed transcripts along the shoot developmental gradient, of which 1216 were transcription factors (TFs). Only a few of these genes were reported previously. The differential expression of these TFs suggests that they may play important roles in primary and secondary growth. AP2, ARF, YABBY and GRF TFs are highly expressed in the apex, whereas NAC, bZIP, PLATZ and HSF TFs are likely to be important for secondary growth. Overall, our findings provide evidence that long-read sequencing can complement short-read sequencing for cataloguing and quantifying eukaryotic transcripts and increase our understanding of the vital and dynamic process of shoot development. © 2018 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.
Mutation and recombination are key evolutionary processes governing phenotypic variation and reproductive isolation. We here demonstrate that biodiversity within all globally known strains of Schizosaccharomyces pombe arose through admixture between two divergent ancestral lineages. Initial hybridization was inferred to have occurred ~20-60 sexual outcrossing generations ago consistent with recent, human-induced migration at the onset of intensified transcontinental trade. Species-wide heritable phenotypic variation was explained near-exclusively by strain-specific arrangements of alternating ancestry components with evidence for transgressive segregation. Reproductive compatibility between strains was likewise predicted by the degree of shared ancestry. To assess the genetic determinants of ancestry block distribution across the genome, we characterized the type, frequency, and position of structural genomic variation using nanopore and single-molecule real-time sequencing. Despite being associated with double-strand break initiation points, over 800 segregating structural variants exerted overall little influence on the introgression landscape or on reproductive compatibility between strains. In contrast, we found strong ancestry disequilibrium consistent with negative epistatic selection shaping genomic ancestry combinations during the course of hybridization. This study provides a detailed, experimentally tractable example that genomes of natural populations are mosaics reflecting different evolutionary histories. Exploiting genome-wide heterogeneity in the history of ancestral recombination and lineage-specific mutations sheds new light on the population history of S. pombe and highlights the importance of hybridization as a creative force in generating biodiversity. © The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.