Menu
April 21, 2020  |  

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.

The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.


April 21, 2020  |  

The Genome of the Zebra Mussel, Dreissena polymorpha: A Resource for Invasive Species Research

The zebra mussel, Dreissena polymorpha, continues to spread from its native range in Eurasia to Europe and North America, causing billions of dollars in damage and dramatically altering invaded aquatic ecosystems. Despite these impacts, there are few genomic resources for Dreissena or related bivalves, with nearly 450 million years of divergence between zebra mussels and its closest sequenced relative. Although the D. polymorpha genome is highly repetitive, we have used a combination of long-read sequencing and Hi-C-based scaffolding to generate the highest quality molluscan assembly to date. Through comparative analysis and transcriptomics experiments we have gained insights into processes that likely control the invasive success of zebra mussels, including shell formation, synthesis of byssal threads, and thermal tolerance. We identified multiple intact Steamer-Like Elements, a retrotransposon that has been linked to transmissible cancer in marine clams. We also found that D. polymorpha have an unusual 67 kb mitochondrial genome containing numerous tandem repeats, making it the largest observed in Eumetazoa. Together these findings create a rich resource for invasive species research and control efforts.


April 21, 2020  |  

Whole-genome sequence of Arthrinium phaeospermum, a globally distributed pathogenic fungus.

Arthrinium phaeospermum (Corda) M.B. Ellis is a globally distributed pathogenic fungus with a wide host range; its hosts include not only plants, but also humans and animals. This study aimed to develop genomic resources for A. phaeospermum to provide solid data and a theoretical basis for further studies of its pathogenesis, transcriptomics, proteomics, metabolomics and RNA genomics. The genome was obtained from the mycelia of the strain AP-Z13 using a combination of analyses with the high-throughput Illumina HiSeq 4000 system and PacBio RSII LongRead sequencing platform. Functional annotation was performed by BLASTing protein sequences against those in different publicly available databases to obtain their corresponding annotations. The genome is 48.45?Mb in size, with an N90 scaffold size of 1,931,147?bp, and encodes 19,836 putative predicted genes. This is the first report of the genome-scale assembly and annotation for A. phaeospermum, the first species in the genus Arthrinium to be subjected to whole genome sequencing. Copyright © 2019 Elsevier Inc. All rights reserved.


April 21, 2020  |  

RNA sequencing: the teenage years.

Over the past decade, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs. However, as next-generation sequencing technologies have developed, so too has RNA-seq. Now, RNA-seq methods are available for studying many different aspects of RNA biology, including single-cell gene expression, translation (the translatome) and RNA structure (the structurome). Exciting new applications are being explored, such as spatial transcriptomics (spatialomics). Together with new long-read and direct RNA-seq technologies and better computational tools for data analysis, innovations in RNA-seq are contributing to a fuller understanding of RNA biology, from questions such as when and where transcription occurs to the folding and intermolecular interactions that govern RNA function.


April 21, 2020  |  

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

Long-read RNA sequencing (RNA-seq) is promising to transcriptomics studies, however, the alignment of the reads is still a fundamental but non-trivial task due to the sequencing errors and complicated gene structures. We propose deSALT, a tailored two-pass long RNA-seq read alignment approach, which constructs graph-based alignment skeletons to sensitively infer exons, and use them to generate spliced reference sequence to produce refined alignments. deSALT addresses several difficult issues, such as small exons, serious sequencing errors and consensus spliced alignment. Benchmarks demonstrate that this approach has a better ability to produce high-quality full-length alignments, which has enormous potentials to transcriptomics studies.


April 21, 2020  |  

A comprehensive evaluation of long read error correction methods

Motivation: Third-generation sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results: In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research.


April 21, 2020  |  

The Complete Genome of the Atypical Enteropathogenic Escherichia coli Archetype Isolate E110019 Highlights a Role for Plasmids in Dissemination of the Type III Secreted Effector EspT.

Enteropathogenic Escherichia coli (EPEC) is a leading cause of moderate to severe diarrhea among young children in developing countries, and EPEC isolates can be subdivided into two groups. Typical EPEC (tEPEC) bacteria are characterized by the presence of both the locus of enterocyte effacement (LEE) and the plasmid-encoded bundle-forming pilus (BFP), which are involved in adherence and translocation of type III effectors into the host cells. Atypical EPEC (aEPEC) bacteria also contain the LEE but lack the BFP. In the current report, we describe the complete genome of outbreak-associated aEPEC isolate E110019, which carries four plasmids. Comparative genomic analysis demonstrated that the type III secreted effector EspT gene, an autotransporter gene, a hemolysin gene, and putative fimbrial genes are all carried on plasmids. Further investigation of 65 espT-containing E. coli genomes demonstrated that different espT alleles are associated with multiple plasmids that differ in their overall gene content from the E110019 espT-containing plasmid. EspT has been previously described with respect to its role in the ability of E110019 to invade host cells. While other type III secreted effectors of E. coli have been identified on insertion elements and prophages of the chromosome, we demonstrated in the current study that the espT gene is located on multiple unique plasmids. These findings highlight a role of plasmids in dissemination of a unique E. coli type III secreted effector that is involved in host invasion and severe diarrheal illness.Copyright © 2019 American Society for Microbiology.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.