While advances in RNA sequencing methods have accelerated our understanding of the human transcriptome, isoform discovery remains a challenge because short read lengths require complicated assembly algorithms to infer the contiguity of full-length transcripts. With PacBio’s long reads, one can now sequence full-length transcript isoforms up to 10 kb. The PacBio Iso- Seq protocol produces reads that originate from independent observations of single molecules, meaning no assembly is needed. Here, we sequenced the transcriptome of the human MCF-7 breast cancer cell line using the Clontech SMARTer® cDNA preparation kit and the PacBio RS II. Using PacBio Iso-Seq bioinformatics software, we obtained 55,770 unique, full-length, high-quality transcript sequences that were subsequently mapped back to the human genome with = 99% accuracy. In addition, we identified both known and novel fusion transcripts. To assess our results, we compared the predicted ORFs from the PacBio data against a published mass spectrometry dataset from the same cell line. 84% of the proteins identified with the Uniprot protein database were recovered by the PacBio predictions. Notably, 251 peptides solely matched to the PacBio generated ORFs and were entirely novel, including abundant cases of single amino acid polymorphisms, cassette exon splicing and potential alternative protein coding frames.
Making the most of long reads: towards efficient assemblers for reference quality, de novo reconstructions
2015 SMRT Informatics Developers Conference Presentation Slides: Gene Myers, Ph.D., Founding Director, Systems Biology Center, Max Planck Institute delivered the keynote presentation. He talked about building efficient assemblers, the importance of random error distribution in sequencing data, and resolving tricky repeats with very long reads. He also encouraged developers to release assembly modules openly, and noted that data should be straightforward to parse since sharing data interfaces is easier than sharing software interfaces.
Although the accuracy of the human reference genome is critical for basic and clinical research, structural variants (SVs) have been difficult to assess because data capable of resolving them have been limited. To address potential bias, we sequenced a diversity panel of nine human genomes to high depth using long-read, single-molecule, real-time sequencing data. Systematically identifying and merging SVs =50 bp in length for these nine and one public genome yielded 83,909 sequence-resolved insertions, deletions, and inversions. Among these, 2,839 (2.0 Mbp) are shared among all discovery genomes with an additional 13,349 (6.9 Mbp) present in the majority of humans, indicating minor alleles or errors in the reference, which is partially explained by an enrichment for GC-content and repetitive DNA. Genotyping 83% of these in 290 additional genomes confirms that at least one allele of the most common SVs in unique euchromatin are now sequence-resolved. We observe a 9-fold increase within 5 Mbp of chromosome telomeric ends and correlation with de novo single-nucleotide variant mutations showing that such variation is nonrandomly distributed defining potential hotspots of mutation. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. To illustrate the utility of sequence-resolved SVs in resequencing experiments, we mapped 30 diverse high-coverage Illumina-sequenced samples to GRCh38 with and without contigs containing SV insertions as alternate sequences, and we found these additional sequences recover 6.4% of unmapped reads. For reads mapped within the SV insertion, 25.7% have a better mapping quality, and 18.7% improved by 1,000-fold or more. We reveal 72,964 occurrences of 15,814 unique variants that were not discoverable with the reference sequence alone, and we note that 7% of the insertions contain an SV in at least one sample indicating that there are additional alleles in the population that remain to be discovered. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity. We present a summary of our findings and discuss ideas for revealing variation that was once difficult to ascertain.
This systems biology animation depicts the type of connectivity that exists at multiple scales in a living system. Starting at the molecular level, interactions between DNA (red cubes), RNA (blue…
This documentary film features the wave of cutting-edge technologies that now provide the opportunity to create predictive models of living systems, and gain wisdom about the fundamental nature of life…
Part I of The New Biology documentary. This documentary film features the wave of cutting-edge technologies that now provide the opportunity to create predictive models of living systems, and gain…
In this PacBio User Group Meeting presentation, PacBio scientist Kristin Mars speaks about recent updates, such as the single-day library prep that’s now possible with the Iso-Seq Express workflow. She…
Domestication of clonally propagated crops such as pineapple from South America was hypothesized to be a ‘one-step operation’. We sequenced the genome of Ananas comosus var. bracteatus CB5 and assembled 513?Mb into 25 chromosomes with 29,412 genes. Comparison of the genomes of CB5, F153 and MD2 elucidated the genomic basis of fiber production, color formation, sugar accumulation and fruit maturation. We also resequenced 89 Ananas genomes. Cultivars ‘Smooth Cayenne’ and ‘Queen’ exhibited ancient and recent admixture, while ‘Singapore Spanish’ supported a one-step operation of domestication. We identified 25 selective sweeps, including a strong sweep containing a pair of tandemly duplicated bromelain inhibitors. Four candidate genes for self-incompatibility were linked in F153, but were not functional in self-compatible CB5. Our findings support the coexistence of sexual recombination and a one-step operation in the domestication of clonally propagated crops. This work guides the exploration of sexual and asexual domestication trajectories in other clonally propagated crops.
Chlorella vulgaris genome assembly and annotation reveals the molecular basis for metabolic acclimation to high light conditions.
Chlorella vulgaris is a fast-growing fresh-water microalga cultivated at the industrial scale for applications ranging from food to biofuel production. To advance our understanding of its biology and to establish genetics tools for biotechnological manipulation, we sequenced the nuclear and organelle genomes of Chlorella vulgaris 211/11P by combining next generation sequencing and optical mapping of isolated DNA molecules. This hybrid approach allowed to assemble the nuclear genome in 14 pseudo-molecules with an N50 of 2.8 Mb and 98.9% of scaffolded genome. The integration of RNA-seq data obtained at two different irradiances of growth (high light-HL versus low light -LL) enabled to identify 10,724 nuclear genes, coding for 11,082 transcripts. Moreover 121 and 48 genes were respectively found in the chloroplast and mitochondrial genome. Functional annotation and expression analysis of nuclear, chloroplast and mitochondrial genome sequences revealed peculiar features of Chlorella vulgaris. Evidence of horizontal gene transfers from chloroplast to mitochondrial genome was observed. Furthermore, comparative transcriptomic analyses of LL vs HL provide insights into the molecular basis for metabolic rearrangement in HL vs. LL conditions leading to enhanced de novo fatty acid biosynthesis and triacylglycerol accumulation. The occurrence of a cytosolic fatty acid biosynthetic pathway can be predicted and its upregulation upon HL exposure is observed, consistent with increased lipid amount under HL. These data provide a rich genetic resource for future genome editing studies, and potential targets for biotechnological manipulation of Chlorella vulgaris or other microalgae species to improve biomass and lipid productivity.This article is protected by copyright. All rights reserved.
Over the past decade, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs. However, as next-generation sequencing technologies have developed, so too has RNA-seq. Now, RNA-seq methods are available for studying many different aspects of RNA biology, including single-cell gene expression, translation (the translatome) and RNA structure (the structurome). Exciting new applications are being explored, such as spatial transcriptomics (spatialomics). Together with new long-read and direct RNA-seq technologies and better computational tools for data analysis, innovations in RNA-seq are contributing to a fuller understanding of RNA biology, from questions such as when and where transcription occurs to the folding and intermolecular interactions that govern RNA function.
Genome data of Fusarium oxysporum f. sp. cubense race 1 and tropical race 4 isolates using long-read sequencing.
Fusarium wilt of banana is caused by the soil-borne fungal pathogen Fusarium oxysporum f. sp. cubense (Foc). We generated two chromosome-level assemblies of Foc race 1 and tropical race 4 strains using single-molecule real-time sequencing. The Foc1 and FocTR4 assemblies had 35 and 29 contigs with contig N50 lengths of 2.08 Mb and 4.28 Mb, respectively. These two new references genomes represent a greater than 100-fold improvement over the contig N50 statistics of the previous short read-based Foc assemblies. The two high-quality assemblies reported here will be a valuable resource for the comparative analysis of Foc races at the pathogenic levels.
Full-length mRNA sequencing and gene expression profiling reveal broad involvement of natural antisense transcript gene pairs in pepper development and response to stresses.
Pepper is an important vegetable with great economic value and unique biological features. In the past few years, significant development has been made towards understanding the huge complex pepper genome; however, pepper functional genomics has not been well studied. To better understand the pepper gene structure and pepper gene regulation, we conducted full-length mRNA sequencing by PacBio sequencing and obtained 57862 high-quality full-length mRNA sequences derived from 18362 previously annotated and 5769 newly detected genes. New gene models were built that combined the full-length mRNA sequences and corrected approximately 500 fragmented gene models from previous annotations. Based on the full-length mRNA, we identified 4114 and 5880 pepper genes forming natural antisense transcript (NAT) genes in-cis and in-trans, respectively. Most of these genes accumulate small RNAs in their overlapping regions. By analyzing these NAT gene expression patterns in our transcriptome data, we identified many NAT pairs responsive to a variety of biological processes in pepper. Pepper formate dehydrogenase 1 (FDH1), which is required for R-gene-mediated disease resistance, may be regulated by nat-siRNAs and participate in a positive feedback loop in salicylic acid biosynthesis during resistance responses. Several cis-NAT pairs and subgroups of trans-NAT genes were responsive to pepper pericarp and placenta development, which may play roles in capsanthin and capsaicin biosynthesis. Using a comparative genomics approach, the evolutionary mechanisms of cis-NATs were investigated, and we found that an increase in intergenic sequences accounted for the loss of most cis-NATs, while transposon insertion contributed to the formation of most new cis-NATs. This article is protected by copyright. All rights reserved.This article is protected by copyright. All rights reserved.
Supernumerary B chromosomes (Bs) are extra karyotype units in addition to A chromosomes, and are found in some fungi and thousands of animals and plant species. Bs are uniquely characterized due to their non-Mendelian inheritance, and represent one of the best examples of genomic conflict. Over the last decades, their genetic composition, function and evolution have remained an unresolved query, although a few successful attempts have been made to address these phenomena. A classical concept based on cytogenetics and genetics is that Bs are selfish and abundant with DNA repeats and transposons, and in most cases, they do not carry any function. However, recently, the modern quantum development of high scale multi-omics techniques has shifted B research towards a new-born field that we call “B-omics”. We review the recent literature and add novel perspectives to the B research, discussing the role of new technologies to understand the mechanistic perspectives of the molecular evolution and function of Bs. The modern view states that B chromosomes are enriched with genes for many significant biological functions, including but not limited to the interesting set of genes related to cell cycle and chromosome structure. Furthermore, the presence of B chromosomes could favor genomic rearrangements and influence the nuclear environment affecting the function of other chromatin regions. We hypothesize that B chromosomes might play a key function in driving their transmission and maintenance inside the cell, as well as offer an extra genomic compartment for evolution.
Genome assembly and annotation of the Trichoplusia ni Tni-FNL insect cell line enabled by long-read technologies.
Trichoplusiani derived cell lines are commonly used to enable recombinant protein expression via baculovirus infection to generate materials approved for clinical use and in clinical trials. In order to develop systems biology and genome engineering tools to improve protein expression in this host, we performed de novo genome assembly of the Trichoplusiani-derived cell line Tni-FNL.By integration of PacBio single-molecule sequencing, Bionano optical mapping, and 10X Genomics linked-reads data, we have produced a draft genome assembly of Tni-FNL.Our assembly contains 280 scaffolds, with a N50 scaffold size of 2.3 Mb and a total length of 359 Mb. Annotation of the Tni-FNL genome resulted in 14,101 predicted genes and 93.2% of the predicted proteome contained recognizable protein domains. Ortholog searches within the superorder Holometabola provided further evidence of high accuracy and completeness of the Tni-FNL genome assembly.This first draft Tni-FNL genome assembly was enabled by complementary long-read technologies and represents a high-quality, well-annotated genome that provides novel insight into the complexity of this insect cell line and can serve as a reference for future large-scale genome engineering work in this and other similar recombinant protein production hosts.
Epstein-Barr virus (EBV) is a ubiquitous human pathogen associated with Burkitt’s lymphoma and nasopharyngeal carcinoma. Although the EBV genome harbors more than a hundred genes, a full transcription map with EBV polyadenylation profiles remains unknown. To elucidate the 3′ ends of all EBV transcripts genome-wide, we performed the first comprehensive analysis of viral polyadenylation sites (pA sites) using our previously reported polyadenylation sequencing (PA-seq) technology. We identified that EBV utilizes a total of 62?pA sites in JSC-1, 60 in Raji, and 53 in Akata cells for the expression of EBV genes from both plus and minus DNA strands; 42 of these pA sites are commonly used in all three cell lines. The majority of identified pA sites were mapped to the intergenic regions downstream of previously annotated EBV open reading frames (ORFs) and viral promoters. pA sites lacking an association with any known EBV genes were also identified, mostly for the minus DNA strand within the EBNA locus, a major locus responsible for maintenance of viral latency and cell transformation. The expression of these novel antisense transcripts to EBNA were verified by 3′ rapid amplification of cDNA ends (RACE) and Northern blot analyses in several EBV-positive (EBV+) cell lines. In contrast to EBNA RNA expressed during latency, expression of EBNA-antisense transcripts, which is restricted in latent cells, can be significantly induced by viral lytic infection, suggesting potential regulation of viral gene expression by EBNA-antisense transcription during lytic EBV infection. Our data provide the first evidence that EBV has an unrecognized mechanism that regulates EBV reactivation from latency.IMPORTANCE Epstein-Barr virus represents an important human pathogen with an etiological role in the development of several cancers. By elucidation of a genome-wide polyadenylation landscape of EBV in JSC-1, Raji, and Akata cells, we have redefined the EBV transcriptome and mapped individual polymerase II (Pol II) transcripts of viral genes to each one of the mapped pA sites at single-nucleotide resolution as well as the depth of expression. By unveiling a new class of viral lytic RNA transcripts antisense to latent EBNAs, we provide a novel mechanism of how EBV might control the expression of viral latent genes and lytic infection. Thus, this report takes another step closer to understanding EBV gene structure and expression and paves a new path for antiviral approaches.This is a work of the U.S. Government and is not subject to copyright protection in the United States. Foreign copyrights may apply.