April 21, 2020  |  

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.

The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.


April 21, 2020  |  

Updated assembly resource of Phytophthora ramorum Pr102 isolate incorporating long reads from PacBio sequencing.

The NA1 clonal lineage of Phytophthora ramorum is responsible for Sudden Oak Death, an epidemic that has devastated California’s coastal forest ecosystems. An NA1 isolate Pr102 derived from coast live oak in California was previously sequenced and reported with 65 Mb assembly containing 12 Mb gaps in 2576 scaffolds. Here we report an improved 70 Mb genome in 1512 scaffolds with 6752 bp gaps after incorporating PacBio P5-C3 longreads. This assembly contains 19494 gene models (average gene length 2515 bp) compared to 16134 genes (average gene length of 1673 bp) in the previous version. We predicted 29 new RXLRs and 76 new paralogs of a total 392 RXLRs from this assembly. We predicted 35 CRNs compared to 19 in earlier version with six paralogs. Our lncRNAs prediction identified 255 candidates. This new resource will be invaluable for future evolution studies on the invasive plant pathogen.


April 21, 2020  |  

eIF5B gates the transition from translation initiation to elongation.

Translation initiation determines both the quantity and identity of the protein that is encoded in an mRNA by establishing the reading frame for protein synthesis. In eukaryotic cells, numerous translation initiation factors prepare ribosomes for polypeptide synthesis; however, the underlying dynamics of this process remain unclear1,2. A central question is how eukaryotic ribosomes transition from translation initiation to elongation. Here we use in vitro single-molecule fluorescence microscopy approaches in a purified yeast Saccharomyces cerevisiae translation system to monitor directly, in real time, the pathways of late translation initiation and the transition to elongation. This transition was slower in our eukaryotic system than that reported for Escherichia coli3-5. The slow entry to elongation was defined by a long residence time of eukaryotic initiation factor 5B (eIF5B) on the 80S ribosome after the joining of individual ribosomal subunits-a process that is catalysed by this universally conserved initiation factor. Inhibition of the GTPase activity of eIF5B after the joining of ribosomal subunits prevented the dissociation of eIF5B from the 80S complex, thereby preventing elongation. Our findings illustrate how the dissociation of eIF5B serves as a kinetic checkpoint for the transition from initiation to elongation, and how its release may be governed by a change in the conformation of the ribosome complex that triggers GTP hydrolysis.


April 21, 2020  |  

Development of CRISPR-Cas systems for genome editing and beyond

The development of clustered regularly interspaced short-palindromic repeat (CRISPR)-Cas systems for genome editing has transformed the way life science research is conducted and holds enormous potential for the treatment of disease as well as for many aspects of biotech- nology. Here, I provide a personal perspective on the development of CRISPR-Cas9 for genome editing within the broader context of the field and discuss our work to discover novel Cas effectors and develop them into additional molecular tools. The initial demonstra- tion of Cas9-mediated genome editing launched the development of many other technologies, enabled new lines of biological inquiry, and motivated a deeper examination of natural CRISPR-Cas systems, including the discovery of new types of CRISPR-Cas systems. These new discoveries in turn spurred further technological developments. I review these exciting discoveries and technologies as well as provide an overview of the broad array of applications of these technologies in basic research and in the improvement of human health. It is clear that we are only just beginning to unravel the potential within microbial diversity, and it is quite likely that we will continue to discover other exciting phenomena, some of which it may be possible to repurpose as molecular technologies. The transformation of mysterious natural phenomena to powerful tools, however, takes a collective effort to discover, characterize, and engineer them, and it has been a privilege to join the numerous researchers who have contributed to this transformation of CRISPR-Cas systems.


April 21, 2020  |  

Circular consensus sequencing with long reads.

Long-read sequencing technologies have advantages in genome assembly, structural variant detection and haplotype phasing, but are less suited for single-nucleotide variant (SNV) and insertion/deletion (indel) calling due to the high error rate in comparison with short-read sequencing. Wenger et al., from Pacific Biosciences, optimized the circular consensus sequencing (CCS) protocol to achieve long, high-fidelity reads, in which they selected the SMRTbell library with fractions tightly distributed at 15 kb for high-coverage sequencing.


April 21, 2020  |  

Comparative Genomic Analyses Reveal Core-Genome-Wide Genes Under Positive Selection and Major Regulatory Hubs in Outlier Strains of Pseudomonas aeruginosa.

Genomic information for outlier strains of Pseudomonas aeruginosa is exiguous when compared with classical strains. We sequenced and constructed the complete genome of an environmental strain CR1 of P. aeruginosa and performed the comparative genomic analysis. It clustered with the outlier group, hence we scaled up the analyses to understand the differences in environmental and clinical outlier strains. We identified eight new regions of genomic plasticity and a plasmid pCR1 with a VirB/D4 complex followed by trimeric auto-transporter that can induce virulence phenotype in the genome of strain CR1. Virulence genotype analysis revealed that strain CR1 lacked hemolytic phospholipase C and D, three genes for LPS biosynthesis and had reduced antibiotic resistance genes when compared with clinical strains. Genes belonging to proteases, bacterial exporters and DNA stabilization were found to be under strong positive selection, thus facilitating pathogenicity and survival of the outliers. The outliers had the complete operon for the production of vibrioferrin, a siderophore present in plant growth promoting bacteria. The competence to acquire multidrug resistance and new virulence factors makes these strains a potential threat. However, we identified major regulatory hubs that can be used as drug targets against both the classical and outlier groups.


October 23, 2019  |  

Real-time observation of flexible domain movements in CRISPR-Cas9.

The CRISPR-associated protein Cas9 is widely used for genome editing because it cleaves target DNA through the assistance of a single-guide RNA (sgRNA). Structural studies have revealed the multi-domain architecture of Cas9 and suggested sequential domain movements of Cas9 upon binding to the sgRNA and the target DNA These studies also hinted at the flexibility between domains; however, it remains unclear whether these flexible movements occur in solution. Here, we directly observed dynamic fluctuations of multiple Cas9 domains, using single-molecule FRET We found that the flexible domain movements allow Cas9 to adopt transient conformations beyond those captured in the crystal structures. Importantly, the HNH nuclease domain only accessed the DNA cleavage position during such flexible movements, suggesting the importance of this flexibility in the DNA cleavage process. Our FRET data also revealed the conformational flexibility of apo-Cas9, which may play a role in the assembly with the sgRNA Collectively, our results highlight the potential role of domain fluctuations in driving Cas9-catalyzed DNA cleavage.© 2018 The Authors. Published under the terms of the CC BY NC ND 4.0 license.


October 23, 2019  |  

Identification and expression analysis of chemosensory genes in the citrus fruit fly Bactrocera (Tetradacus) minax

The citrus fruit fly Bactrocera (Tetradacus) minax is a major and devastating agricultural pest in Asian subtropical countries. Previous studies have shown that B. minax interacts with hosts via an efficient chemosensory system. However, knowledge regarding the molecular components of the B. minax chemosensory system has not yet been well established. Herein, based on our newly generated whole-genome dataset for B. minax and by comparison with the characterized genomes of 6 other fruit fly species, we identified, for the first time, a total of 25 putative odorant-binding receptors (OBPs), 4 single-copy chemosensory proteins (CSPs) and 53 candidate odorant receptors (ORs). To further survey the expression of these candidate genes, the transcriptomes from three developmental stages (larvae, pupae and adults) of B. minax and Bactrocera dorsalis were analyzed. We found that 1) at the adult developmental stage, there were 14 highly expressed OBPs (FPKM>100) in B. dorsalis and 7 highly expressed OBPs in B. minax; 2) the expression of CSP3 and CSP4 in adult B. dorsalis was higher than that in B. minax; and 3) most of the OR genes exhibited low expression at the three developmental stages in both species. This study on the identification of the chemosensory system of B. minax not only enriches the existing research on insect olfactory receptors but also provides new targets for preventative control and ecological regulation of B. minax in the future.


September 22, 2019  |  

A workflow for studying specialized metabolism in nonmodel eukaryotic organisms

Eukaryotes contain a diverse tapestry of specialized metabolites, many of which are of significant pharmaceutical and industrial importance to humans. Nevertheless, exploration of specialized metabolic pathways underlying specific chemical traits in nonmodel eukaryotic organisms has been technically challenging and historically lagged behind that of the bacterial systems. Recent advances in genomics, metabolomics, phylogenomics, and synthetic biology now enable a new workflow for interrogating unknown specialized metabolic systems in nonmodel eukaryotic hosts with greater efficiency and mechanistic depth. This chapter delineates such workflow by providing a collection of state-of-the-art approaches and tools, ranging from multiomics-guided candidate gene identification to in vitro and in vivo functional and structural characterization of specialized metabolic enzymes. As already demonstrated by several recent studies, this new workflow opens up a gateway into the largely untapped world of natural product biochemistry in eukaryotes. © 2016 Elsevier Inc. All rights reserved.


September 22, 2019  |  

PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme.

High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing.We present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner.PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/.


September 22, 2019  |  

Recent developments in using advanced sequencing technologies for the genomic studies of lignin and cellulose degrading microorganisms.

Lignin is a complex polyphenyl aromatic compound which exists in tight associations with cellulose and hemicellulose to form plant primary and secondary cell wall. Lignocellulose is an abundant renewable biomaterial present on the earth. It has gained much attention in the scientific community in recent years because of its potential applications in bio-based industries. Microbial degradation of lignocellulose polymers was well studied in wood decaying fungi. Based on the plant materials they degrade these fungi were classified as white rot, brown rot and soft rot. However, some groups of bacteria belonging to the actinomycetes, a-proteobacteria and ß-proteobacteria were also found to be efficient in degrading lignocellulosic biomass but not well understood unlike the fungi. In this review we focus on recent advancements deployed for finding and understanding the lignocellulose degradation by microorganisms. Conventional molecular methods like sequencing 16s rRNA and Inter Transcribed Spacer (ITS) regions were used for identification and classification of microbes. Recent progression in genomics mainly next generation sequencing technologies made the whole genome sequencing of microbes possible in a great ease. The whole genome sequence studies reveals high quality information about genes and canonical pathways involved in the lignin and other cell wall components degradation.


September 22, 2019  |  

Long-read sequencing of human cytomegalovirus transcriptome reveals RNA isoforms carrying distinct coding potentials.

The human cytomegalovirus (HCMV) is a ubiquitous, human pathogenic herpesvirus. The complete viral genome is transcriptionally active during infection; however, a large part of its transcriptome has yet to be annotated. In this work, we applied the amplified isoform sequencing technique from Pacific Biosciences to characterize the lytic transcriptome of HCMV strain Towne varS. We developed a pipeline for transcript annotation using long-read sequencing data. We identified 248 transcriptional start sites, 116 transcriptional termination sites and 80 splicing events. Using this information, we have annotated 291 previously undescribed or only partially annotated transcript isoforms, including eight novel antisense transcripts and their isoforms, as well as a novel transcript (RS2) in the short repeat region, partially antisense to RS1. Similarly to other organisms, we discovered a high transcriptional diversity in HCMV, with many transcripts only slightly differing from one another. Comparing our transcriptome profiling results to an earlier ribosome footprint analysis, we have concluded that the majority of the transcripts contain multiple translationally active ORFs, and also that most isoforms contain unique combinations of ORFs. Based on these results, we propose that one important function of this transcriptional diversity may be to provide a regulatory mechanism at the level of translation.


September 22, 2019  |  

Single molecule, full-length transcript sequencing provides insight into the extreme metabolism of ruby-throated hummingbird Archilochus colubris

Hummingbirds oxidize ingested nectar sugars directly to fuel foraging but cannot sustain this fuel use during fasting periods, such as during the night or during long-distance migratory flights. Instead, fasting hummingbirds switch to oxidizing stored lipids, derived from ingested sugars. The hummingbird liver plays a key role in moderating energy homeostasis and this remarkable capacity for fuel switching. Additionally, liver is the principle location of de novo lipogenesis, which can occur at exceptionally high rates, such as during premigratory fattening. Yet understanding how this tissue and whole organism moderates energy turnover is hampered by a lack of information regarding how relevant enzymes differ in sequence, expression, and regulation. We generated a de novo transcriptome of the hummingbird liver using PacBio full-length cDNA sequencing (Iso-Seq), yielding a total of 8.6Gb of sequencing data, or 2.6M reads from 4 different size fractions. We analyzed data using the SMRTAnalysis v3.1 Iso-Seq pipeline, then clustered isoforms into gene families to generate de novo gene contigs using Cogent. We performed orthology analysis to identify closely related sequences between our transcriptome and other avian and human gene sets. Finally, we closely examined homology of critical lipid metabolism genes between our transcriptome data and avian and human genomes. We confirmed high levels of sequence divergence within hummingbird lipogenic enzymes, suggesting a high probability of adaptive divergent function in the hepatic lipogenic pathways. Our results leverage cutting-edge technology and a novel bioinformatics pipeline to provide a first direct look at the transcriptome of this incredible organism.


September 22, 2019  |  

Fluorescently-tagged human eIF3 for single-molecule spectroscopy.

Human translation initiation relies on the combined activities of numerous ribosome-associated eukaryotic initiation factors (eIFs). The largest factor, eIF3, is an ~800 kDa multiprotein complex that orchestrates a network of interactions with the small 40S ribosomal subunit, other eIFs, and mRNA, while participating in nearly every step of initiation. How these interactions take place during the time course of translation initiation remains unclear. Here, we describe a method for the expression and affinity purification of a fluorescently-tagged eIF3 from human cells. The tagged eIF3 dodecamer is structurally intact, functions in cell-based assays, and interacts with the HCV IRES mRNA and the 40S-IRES complex in vitro. By tracking the binding of single eIF3 molecules to the HCV IRES RNA with a zero-mode waveguides-based instrument, we show that eIF3 samples both wild-type IRES and an IRES that lacks the eIF3-binding region, and that the high-affinity eIF3-IRES interaction is largely determined by slow dissociation kinetics. The application of single-molecule methods to more complex systems involving eIF3 may unveil dynamics underlying mRNA selection and ribosome loading during human translation initiation.© The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.


September 22, 2019  |  

The genomes of Crithidia bombi and C. expoeki, common parasites of bumblebees.

Trypanosomatids (Trypanosomatidae, Kinetoplastida) are flagellated protozoa containing many parasites of medical or agricultural importance. Among those, Crithidia bombi and C. expoeki, are common parasites in bumble bees around the world, and phylogenetically close to Leishmania and Leptomonas. They have a simple and direct life cycle with one host, and partially castrate the founding queens greatly reducing their fitness. Here, we report the nuclear genome sequences of one clone of each species, extracted from a field-collected infection. Using a combination of Roche 454 FLX Titanium, Pacific Biosciences PacBio RS, and Illumina GA2 instruments for C. bombi, and PacBio for C. expoeki, we could produce high-quality and well resolved sequences. We find that these genomes are around 32 and 34 MB, with 7,808 and 7,851 annotated genes for C. bombi and C. expoeki, respectively-which is somewhat less than reported from other trypanosomatids, with few introns, and organized in polycistronic units. A large fraction of genes received plausible functional support in comparison primarily with Leishmania and Trypanosoma. Comparing the annotated genes of the two species with those of six other trypanosomatids (C. fasciculata, L. pyrrhocoris, L. seymouri, B. ayalai, L. major, and T. brucei) shows similar gene repertoires and many orthologs. Similar to other trypanosomatids, we also find signs of concerted evolution in genes putatively involved in the interaction with the host, a high degree of synteny between C. bombi and C. expoeki, and considerable overlap with several other species in the set. A total of 86 orthologous gene groups show signatures of positive selection in the branch leading to the two Crithidia under study, mostly of unknown function. As an example, we examined the initiating glycosylation pathway of surface components in C. bombi, finding it deviates from most other eukaryotes and also from other kinetoplastids, which may indicate rapid evolution in the extracellular matrix that is involved in interactions with the host. Bumble bees are important pollinators and Crithidia-infections are suspected to cause substantial selection pressure on their host populations. These newly sequenced genomes provide tools that should help better understand host-parasite interactions in these pollinator pathogens.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.