Third generation single molecule sequencing technology from Pacific Biosciences, Moleculo, Oxford Nanopore, and other companies are revolutionizing genomics by enabling the sequencing of long, individual molecules of DNA and RNA. One major advantage of these technologies over current short read sequencing is the ability to sequence much longer molecules, thousands or tens of thousands of nucleotides instead of mere hundreds. This capacity gives researchers substantially greater power to probe into microbial, plant, and animal genomes, but it remains unknown on how to best use these data. To answer this, we systematically evaluated the human genome and 25 other important genomes across the tree of life ranging in size from 1Mbp to 3Gbp in an attempt to answer how long the reads need to be and how much coverage is necessary to completely assemble their chromosomes with single molecule sequencing. We also present a novel error correction and assembly algorithm using a combination of PacBio and pre-assembled Illumina sequencing. This new algorithm greatly outperforms other published hybrid algorithms.
SFAF 2014 Presentation Slides: James Gurtowski of Cold Spring Harbor Laboratory (CSHL) shared assembly results for a variety of eukaryotic genomes, including yeast, arabidopsis, and rice.
Comparative genome analysis of Clavibacter michiganensis subsp. michiganensis strains provides insights into genetic diversity and virulence.
Clavibacter michiganensis subsp. michiganensis (Cmm) is a gram positive actinomycete, causing bacterial canker of tomato (Solanum lycopersicum) a disease that can cause significant losses in tomato production. In this study, we determined the complete genome sequence of 13 California Cmm strains and one saprophytic Clavibacter strain using a combination of Ilumina and PacBio sequencing. The California Cmm strains have genome size (3.2 -3.3 mb) similar to the reference strain NCPPB382 (3.3 mb) with =98% sequence identity. Cmm strains from California share =92% genes (8-10% are noble genes) with the reference Cmm strain NCPPB382. Despite this similarity, we detected significant alternatives in California strains with respect to plasmid number, plasmid composition, and genomic island presence indicating acquisition of unique mechanisms controlling virulence. Plasmids pCM1 and pCM2, that were previously demonstrated to be required for NCPPB382 virulence, also differ in their presence and gene content across Cmm strains. pCM2 is absent in some Cmm strains and that still retain virulence in tomato. Saprophytic Clavibacter possess a novel plasmid, pSCM, and lacks the majority of characterized virulence factors. Genome sequence information was also used to design specific and sensitive primer pairs for Cmm detection. A mechanistic understanding of how genomic changes have impacted Cmm virulence and survival across diverse strains will be necessary for developing a robust disease control strategies for bacterial canker of tomato.
PacBio bioinformatician, Elizabeth Tseng, reviews the bioinformatics strategies utilizing PacBio long-read sequencing data for isoform sequencing for full-length transcript sequencing without assembly.
2015 SMRT Informatics Developers Conference Presentation Slides: Shinichi Morishita of the University of Tokyo presented on how his team has been using SMRT Sequencing to better understand methylomes, metagenomes and structural variation of various eukaryotic genomes.
A comprehensive study of the sugar pine (Pinus lambertiana) transcriptome implemented through diverse next-generation sequencing approaches
The assembly, annotation, and characterization of the sugar pine (Pinus lambertiana Dougl.) transcriptome represents an opportunity to study the genetic mechanisms underlying resistance to the invasive white pine blister rust (Cronartium ribicola) as well as responses to other abiotic stresses. The assembled transcripts also provide a resource to improve the genome assembly. We selected a diverse set of tissues allowing the first comprehensive evaluation of the sugar pine gene space. We have combined short read sequencing technologies (Illumina MiSeq and HiSeq) with the relatively new Pacific Biosciences Iso-Seq approach. From the 2.5 billion and 1.6 million Illumina and PacBio (46 SMRT cells) reads, 33,720 unigenes were de novo assembled. Comparison of sequencing technologies revealed improved coverage with Illumina HiSeq reads and better splice variant detection with PacBio Iso-Seq reads. The genes identified as unique to each library ranges from 199 transcripts (basket seedling) to 3,482 transcripts (female cones). In total, 10,026 transcripts were shared by all libraries. Genes differentially expressed in response to these provided insight on abiotic and biotic stress responses. To analyze orthologous sequences, we compared the translated sequences against 19 plant species, identifying 7,229 transcripts that clustered uniquely among the conifers. We have generated here a high quality transcriptome from one WPBR susceptible and one WPBR resistant sugar pine individual. Through the comprehensive tissue sampling and the depth of the sequencing achieved, detailed information on disease resistance can be further examined.
From Sequencing to Chromosomes: New de novo assembly and scaffolding methods improve the goat reference genome
Single-molecule sequencing is now routinely used to assemble complete, high-quality microbial genomes, but these assembly methods have not scaled well to large genomes. To address this problem, we previously introduced the MinHash Alignment Process (MHAP) for overlapping single-molecule reads using probabilistic, locality-sensitive hashing. Integrating MHAP with Celera Assembler (CA) has enabled reference-grade assemblies of model organisms, revealing novel heterochromatic sequences and filling low-complexity gap sequences in the GRCh38 human reference genome. We have applied our methods to assemble the San Clemente goat genome. Combining single-molecule sequencing from Pacific Biosciences and BioNano Genomics generates and assembly that is over 150-fold more contiguous than the latest Capra hircus reference. In combination with Hi-C sequencing, the assembly surpasses reference assemblies, de novo, with minimal manual intervention. The autosomes are each assembled into a single scaffold. Our assembly provides a more complete gene reconstruction, better alignments with Goat 52k chip, and improved allosome reconstruction. In addition to providing increased continuity of sequence, our assembly achieves a higher BUSCO completion score (84%) than the existing goat reference assembly suggesting better quality annotation of gene models. Our results demonstrate that single-molecule sequencing can produce near-complete eukaryotic genomes at modest cost and minimal manual effort.
This tutorial provides an overview of the Hierarchical Genome Assembly Process (HGAP4) de novo assembly analysis application. HGAP4 generates accurate de novo assemblies using only PacBio data. HGAP4 is suitable…
Webinar: Complete genomes within reach – Closing bacterial genomes from the lakes of Minnesota to NYC hospitals
In this webinar, Ben Auch, Research Scientist, Innovation Lab, University of Minnesota Genomics Center, Cody Sheik, Assistant Professor of Biology, University of Minnesota Duluth, and Harm van Bakel, Assistant Professor…
To start Day 1 of the PacBio User Group Meeting, Jonas Korlach, PacBio CSO, provides an update on the latest releases and performance metrics for the Sequel II System. The…
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
Evolution of a 72-kb cointegrant, conjugative multiresistance plasmid from early community-associated methicillin-resistant Staphylococcus aureus isolates.
Horizontal transfer of plasmids encoding antimicrobial-resistance and virulence determinants has been instrumental in Staphylococcus aureus evolution, including the emergence of community-associated methicillin-resistant S. aureus (CA-MRSA). In the early 1990s the first CA-MRSA isolated in Western Australia (WA), WA-5, encoded cadmium, tetracycline and penicillin-resistance genes on plasmid pWBG753 (~30 kb). WA-5 and pWBG753 appeared only briefly in WA, however, fusidic-acid-resistance plasmids related to pWBG753 were also present in the first European CA-MRSA at the time. Here we characterized a 72-kb conjugative plasmid pWBG731 present in multiresistant WA-5-like clones from the same period. pWBG731 was a cointegrant formed from pWBG753 and a pWBG749-family conjugative plasmid. pWBG731 carried mupirocin, trimethoprim, cadmium and penicillin-resistance genes. The stepwise evolution of pWBG731 likely occurred through the combined actions of IS257, IS257-dependent miniature inverted-repeat transposable elements (MITEs) and the BinL resolution system of the ß-lactamase transposon Tn552 An evolutionary intermediate ~42-kb non-conjugative plasmid pWBG715, possessed the same resistance genes as pWBG731 but retained an integrated copy of the small tetracycline-resistance plasmid pT181. IS257 likely facilitated replacement of pT181 with conjugation genes on pWBG731, thus enabling autonomous transfer. Like conjugative plasmid pWBG749, pWBG731 also mobilized non-conjugative plasmids carrying oriT mimics. It seems likely that pWBG731 represents the product of multiple recombination events between the WA-5 pWBG753 plasmid and other mobile genetic elements present in indigenous CA-MSSA. The molecular evolution of pWBG731 saliently illustrates how diverse mobile genetic elements can together facilitate rapid accrual and horizontal dissemination of multiresistance in S. aureus CA-MRSA.Copyright © 2019 American Society for Microbiology.
Over the past decade, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs. However, as next-generation sequencing technologies have developed, so too has RNA-seq. Now, RNA-seq methods are available for studying many different aspects of RNA biology, including single-cell gene expression, translation (the translatome) and RNA structure (the structurome). Exciting new applications are being explored, such as spatial transcriptomics (spatialomics). Together with new long-read and direct RNA-seq technologies and better computational tools for data analysis, innovations in RNA-seq are contributing to a fuller understanding of RNA biology, from questions such as when and where transcription occurs to the folding and intermolecular interactions that govern RNA function.
Forest tree species are increasingly subject to severe mortalities from exotic pests, diseases, and invasive organisms, accelerated by climate change. Forest health issues are threatening multiple species and ecosystem sustainability globally. While sources of resistance may be available in related species, or among surviving trees, introgression of resistance genes into threatened tree species in reasonable time frames requires genome-wide breeding tools. Asian species of chestnut (Castanea spp.) are being employed as donors of disease resistance genes to restore native chestnut species in North America and Europe. To aid in the restoration of threatened chestnut species, we present the assembly of a reference genome with chromosome-scale sequences for Chinese chestnut (C. mollissima), the disease-resistance donor for American chestnut restoration. We also demonstrate the value of the genome as a platform for research and species restoration, including new insights into the evolution of blight resistance in Asian chestnut species, the locations in the genome of ecologically important signatures of selection differentiating American chestnut from Chinese chestnut, the identification of candidate genes for disease resistance, and preliminary comparisons of genome organization with related species.
Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline
Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and allow for annotation of TEs. There are numerous methods for each class of elements with unknown relative performance metrics. We benchmarked existing programs based on a curated library of rice TEs. Using the most robust programs, we created a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a condensed TE library for annotations of structurally intact and fragmented elements. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.List of abbreviationsTETransposable ElementsLTRLong Terminal RepeatLINELong Interspersed Nuclear ElementSINEShort Interspersed Nuclear ElementMITEMiniature Inverted Transposable ElementTIRTerminal Inverted RepeatTSDTarget Site DuplicationTPTrue PositivesFPFalse PositivesTNTrue NegativeFNFalse NegativesGRFGeneric Repeat FinderEDTAExtensive de-novo TE Annotator