Understanding the genetic basis of infectious diseases is critical to enacting effective treatments, and several large-scale sequencing initiatives are underway to collect this information. Sequencing bacterial samples is typically performed by mapping sequence reads against genomes of known reference strains. While such resequencing informs on the spectrum of single-nucleotide differences relative to the chosen reference, it can miss numerous other forms of variation known to influence pathogenicity: structural variations (duplications, inversions), acquisition of mobile elements (phages, plasmids), homonucleotide length variation causing phase variation, and epigenetic marks (methylation, phosphorothioation) that influence gene expression to switch bacteria from non- pathogenic to pathogenic states. Therefore, sequencing methods which provide complete, de novo genome assemblies and epigenomes are necessary to fully characterize infectious disease agents in an unbiased, hypothesis-free manner. Hybrid assembly methods have been described that combine long sequence reads from SMRT DNA Sequencing with short reads (SMRT CCS (circular consensus) or second-generation reads), wherein the short reads are used to error-correct the long reads which are then used for assembly. We have developed a new paradigm for microbial de novo assemblies in which SMRT sequencing reads from a single long insert library are used exclusively to close the genome through a hierarchical genome assembly process, thereby obviating the need for a second sample preparation, sequencing run, and data set. We have applied this method to achieve closed de novo genomes with accuracies exceeding QV50 (>99.999%) for numerous disease outbreak samples, including E. coli, Salmonella, Campylobacter, Listeria, Neisseria, and H. pylori. The kinetic information from the same SMRT Sequencing reads is utilized to determine epigenomes. Approximately 70% of all methyltransferase specificities we have determined to date represent previously unknown bacterial epigenetic signatures. With relatively short sequencing run times and automated analysis pipelines, it is possible to go from an unknown DNA sample to its complete de novo genome and epigenome in about a day.
An improved circular consensus algorithm with an application to detection of HIV-1 Drug-Resistance Associated Mutations (DRAMs)
Scientists who require confident resolution of heterogeneous populations across complex regions have been unable to transition to short-read sequencing methods. They continue to depend on Sanger Sequencing despite its cost and time inefficiencies. Here we present a new redesigned algorithm that allows the generation of circular consensus sequences (CCS) from individual SMRT Sequencing reads. With this new algorithm, dubbed CCS2, it is possible to reach arbitrarily high quality across longer insert lengths at a lower cost and higher throughput than Sanger Sequencing. We apply this new algorithm, dubbed CCS2, to the characterization of the HIV-1 K103N drug-resistance associated mutation, which is both important clinically, and represents a challenge due to regional sequence context. A mutation was introduced into the 3rd position of amino acid position 103 (A>C substitution) of the RT gene on a pNL4-3 backbone by site-directed mutagenesis. Regions spanning ~1,300 bp were PCR amplified from both the non-mutated and mutant (K103N) plasmids, and were sequenced individually and as a 50:50 mixture. Sequencing data were analyzed using the new CCS2 algorithm, which uses a fully-generative probabilistic model of our SMRT Sequencing process to polish consensus sequences to arbitrarily high accuracy. This result, previously demonstrated for multi-molecule consensus sequences with the Quiver algorithm, is made possible by incorporating per-Zero Mode Waveguide (ZMW) characteristics, thus accounting for the intrinsic changes in the sequencing process that are unique to each ZMW. With CCS2, we are able to achieve a per-read empirical quality of QV30 with 19X coverage. This yields ~5000 1.3 kb consensus sequences with a collective empirical quality of ~QV40. Additionally, we demonstrate a 0% miscall rate in both unmixed samples, and estimate a 48:52% frequency for the K103N mutation in the mixed sample, consistent with data produced by orthogonal platforms.
A brief animated introduction to Pacific Biosciences’ Single Molecule, Real-Time (SMRT) Sequencing, including the SMRT Cell and ZMW (zero mode waveguide).
Translation initiation determines both the quantity and identity of the protein that is encoded in an mRNA by establishing the reading frame for protein synthesis. In eukaryotic cells, numerous translation initiation factors prepare ribosomes for polypeptide synthesis; however, the underlying dynamics of this process remain unclear1,2. A central question is how eukaryotic ribosomes transition from translation initiation to elongation. Here we use in vitro single-molecule fluorescence microscopy approaches in a purified yeast Saccharomyces cerevisiae translation system to monitor directly, in real time, the pathways of late translation initiation and the transition to elongation. This transition was slower in our eukaryotic system than that reported for Escherichia coli3-5. The slow entry to elongation was defined by a long residence time of eukaryotic initiation factor 5B (eIF5B) on the 80S ribosome after the joining of individual ribosomal subunits-a process that is catalysed by this universally conserved initiation factor. Inhibition of the GTPase activity of eIF5B after the joining of ribosomal subunits prevented the dissociation of eIF5B from the 80S complex, thereby preventing elongation. Our findings illustrate how the dissociation of eIF5B serves as a kinetic checkpoint for the transition from initiation to elongation, and how its release may be governed by a change in the conformation of the ribosome complex that triggers GTP hydrolysis.
Prokaryotic DNA contains three types of methylation: N6-methyladenine, N4-methylcytosine and 5-methylcytosine. The lack of tools to analyse the frequency and distribution of methylated residues in bacterial genomes has prevented a full understanding of their functions. Now, advances in DNA sequencing technology, including single-molecule, real-time sequencing and nanopore-based sequencing, have provided new opportunities for systematic detection of all three forms of methylated DNA at a genome-wide scale and offer unprecedented opportunities for achieving a more complete understanding of bacterial epigenomes. Indeed, as the number of mapped bacterial methylomes approaches 2,000, increasing evidence supports roles for methylation in regulation of gene expression, virulence and pathogen-host interactions.
Assessment of the microbial diversity of Chinese Tianshan tibicos by single molecule, real-time sequencing technology.
Chinese Tianshan tibico grains were collected from the rural area of Tianshan in Xinjiang province, China. Typical tibico grains are known to consist of polysaccharide matrix that embeds a variety of bacteria and yeasts. These grains are widely used in some rural regions to produce a beneficial sugary beverage that is slightly acidic and contains low level of alcohol. This work aimed to characterize the microbiota composition of Chinese Tianshan tibicos using the single molecule, real-time sequencing technology, which is advantageous in generating long reads. Our results revealed that the microbiota mainly comprised of the bacterial species of Lactobacillus hilgardii, Lactococcus raffinolactis, Leuconostoc mesenteroides, Zymomonas mobilis, together with a Guehomyces pullulans-dominating fungal community. The data generated in this work helps identify beneficial microbes in Chinese Tianshan tibico grains.
Caenorhabditis elegans was the first multicellular eukaryotic genome sequenced to apparent completion. Although this assembly employed a standard C. elegans strain (N2), it used sequence data from several laboratories, with DNA propagated in bacteria and yeast. Thus, the N2 assembly has many differences from any C. elegans available today. To provide a more accurate C. elegans genome, we performed long-read assembly of VC2010, a modern strain derived from N2. Our VC2010 assembly has 99.98% identity to N2 but with an additional 1.8 Mb including tandem repeat expansions and genome duplications. For 116 structural discrepancies between N2 and VC2010, 97 structures matching VC2010 (84%) were also found in two outgroup strains, implying deficiencies in N2. Over 98% of N2 genes encoded unchanged products in VC2010; moreover, we predicted =53 new genes in VC2010. The recompleted genome of C. elegans should be a valuable resource for genetics, genomics, and systems biology. © 2019 Yoshimura et al.; Published by Cold Spring Harbor Laboratory Press.
Single-molecule long-read sequencing datasets were generated for a son-father-mother trio of Han Chinese descent that is part of the Genome in a Bottle (GIAB) consortium portfolio. The dataset was generated using the Pacific Biosciences Sequel System. The son and each parent were sequenced to an average coverage of 60 and 30, respectively, with N50 subread lengths between 16 and 18?kb. Raw reads and reads aligned to both the GRCh37 and GRCh38 are available at the NCBI GIAB ftp site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/). The GRCh38 aligned read data are archived in NCBI SRA (SRX4739017, SRX4739121, and SRX4739122). This dataset is available for anyone to develop and evaluate long-read bioinformatics methods.
A high-quality reference genome is a fundamental resource for functional genetics, comparative genomics, and population genomics, and is increasingly important for conservation biology. PacBio Single Molecule, Real-Time (SMRT) sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. Improvements in throughput and concomitant reductions in cost have made PacBio an attractive core technology for many large genome initiatives, however, relatively high DNA input requirements (~5 µg for standard library protocol) have placed PacBio out of reach for many projects on small organisms that have lower DNA content, or on projects with limited input DNA for other reasons. Here we present a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System with chemistry 3.0 and software v6.0, generating, on average, 25 Gb of sequence per SMRT Cell with 20 h movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting curated assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes were present and full-length). In addition, this single-insect assembly now places 667 (>90%) of formerly unplaced genes into their appropriate chromosomal contexts in the AgamP4 PEST reference. We were also able to resolve maternal and paternal haplotypes for over 1/3 of the genome. By sequencing and assembling material from a single diploid individual, only two haplotypes were present, simplifying the assembly process compared to samples from multiple pooled individuals. The method presented here can be applied to samples with starting DNA amounts as low as 100 ng per 1 Gb genome size. This new low-input approach puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.
Adeno-associated virus genome population sequencing achieves full vector genome resolution and reveals human-vector chimeras
Recombinant adeno-associated virus (rAAV)-based gene therapy has entered a phase of clinical translation and commercialization. Despite this progress, vector integrity following production is often overlooked. Compromised vectors may negatively impact therapeutic efficacy and safety. Using single molecule, real-time (SMRT) sequencing, we can comprehensively profile packaged genomes as a single intact molecule and directly assess vector integrity without extensive preparation. We have exploited this methodology to profile all heterogeneic populations of self-complementary AAV genomes via bioinformatics pipelines and have coined this approach AAV-genome population sequencing (AAV-GPseq). The approach can reveal the relative distribution of truncated genomes versus full-length genomes in vector preparations. Preparations that seemingly show high genome homogeneity by gel electrophoresis are revealed to consist of less than 50% full-length species. With AAV-GPseq, we can also detect many reverse-packaged genomes that encompass sequences originating from plasmid backbone, as well as sequences from packaging and helper plasmids. Finally, we detect host-cell genomic sequences that are chimeric with inverted terminal repeat (ITR)-containing vector sequences. We show that vector populations can contain between 1.3% and 2.3% of this type of undesirable genome. These discoveries redefine quality control standards for viral vector preparations and highlight the degree of foreign products in rAAV-based therapeutic vectors.
High-throughput sequencing of 16S rRNA gene amplicons has revolutionized the capacity and depth of microbial community profiling. Several sequencing platforms are available, but most phylogenetic studies are performed on the 454-pyrosequencing platform because its longer reads can give finer phylogenetic resolution. The Pacific Biosciences (PacBio) sequencing platform is significantly less expensive per run, does not rely on amplification for library generation, and generates reads that are, on average, four times longer than those from 454 (C2 chemistry), but the resulting high error rates appear to preclude its use in phylogenetic profiling. Recently, however, the PacBio platform was used to characterize four electrosynthetic microbiomes to the genus-level for less than USD 1,000 through the use of PacBio’s circular consensus sequence technology. Here, we describe in greater detail: 1) the output from successful 16S rRNA gene amplicon profiling with PacBio, 2) how the analysis was contingent upon several alterations to standard bioinformatic quality control workflows, and 3) the advantages and disadvantages of using the PacBio platform for community profiling.
Recent advances in sequencing technologies have transformed the field of virus discovery and virome analysis. Once mostly confined to the traditional Sanger sequencing based individual virus discovery, is now entirely replaced by high throughput sequencing (HTS) based virus metagenomics that can be used to characterize the nature and composition of entire viromes. To better harness the potential of HTS for the study of viromes, sample preparation methodologies use different approaches to exclude amplification of non-viral components that can overshadow low-titer viruses. These virus-sequence enrichment approaches mostly focus on the sample preparation methods, like enzymatic digestion of non-viral nucleic acids and size exclusion of non-viral constituents by column filtration, ultrafiltration or density gradient centrifugation. However, recently a new approach of virus-sequence enrichment called virome-capture sequencing, focused on the amplification or HTS library preparation stage, was developed to increase the ability of virome characterization. This new approach has the potential to further transform the field of virus discovery and virome analysis, but its technical complexity and sequence-dependence warrants further improvements. In this review we discuss the different methods, their applications and evolution, for selective sequencing based virome analysis and also propose refinements needed to harness the full potential of HTS for virome analysis. Copyright © 2017 Elsevier B.V. All rights reserved.
Forty years ago the advent of Sanger sequencing was revolutionary as it allowed complete genome sequences to be deciphered for the first time. A second revolution came when next-generation sequencing (NGS) technologies appeared, which made genome sequencing much cheaper and faster. However, NGS methods have several drawbacks and pitfalls, most notably their short reads. Recently, third-generation/long-read methods appeared, which can produce genome assemblies of unprecedented quality. Moreover, these technologies can directly detect epigenetic modifications on native DNA and allow whole-transcript sequencing without the need for assembly. This marks the third revolution in sequencing technology. Here we review and compare the various long-read methods. We discuss their applications and their respective strengths and weaknesses and provide future perspectives. Copyright © 2018 Elsevier Ltd. All rights reserved.
Analysis of RNA base modification and structural rearrangement by single-molecule real-time detection of reverse transcription.
Zero-mode waveguides (ZMWs) are photonic nanostructures that create highly confined optical observation volumes, thereby allowing single-molecule-resolved biophysical studies at relatively high concentrations of fluorescent molecules. This principle has been successfully applied in single-molecule, real-time (SMRT®) DNA sequencing for the detection of DNA sequences and DNA base modifications. In contrast, RNA sequencing methods cannot provide sequence and RNA base modifications concurrently as they rely on complementary DNA (cDNA) synthesis by reverse transcription followed by sequencing of cDNA. Thus, information on RNA modifications is lost during the process of cDNA synthesis.Here we describe an application of SMRT technology to follow the activity of reverse transcriptase enzymes synthesizing cDNA on thousands of single RNA templates simultaneously in real time with single nucleotide turnover resolution using arrays of ZMWs. This method thereby obtains information from the RNA template directly. The analysis of the kinetics of the reverse transcriptase can be used to identify RNA base modifications, shown by example for N6-methyladenine (m6A) in oligonucleotides and in a specific mRNA extracted from total cellular mRNA. Furthermore, the real-time reverse transcriptase dynamics informs about RNA secondary structure and its rearrangements, as demonstrated on a ribosomal RNA and an mRNA template.Our results highlight the feasibility of studying RNA modifications and RNA structural rearrangements in ZMWs in real time. In addition, they suggest that technology can be developed for direct RNA sequencing provided that the reverse transcriptase is optimized to resolve homonucleotide stretches in RNA.
Single Molecule Sequencing: new outlooks for solving genome assembly and transcripts identification challenges
In this review, we introduce a novel sequencing technology, named Single Molecule Real Time sequencing. Also called Single Molecule Sequencing, as it do not requires any amplification, this new technology is able to pro- duce much longer reads than previous NGS technologies such as Illumina. This read size improvements, which can reach 150 fold, will solve many challenges caused by the actual NGS technologies. Short NGS reads, reach- ing a maximum size of 300 bp, make it hard to reconstitute a whole genome and are always leading to fragmented genome assembly. It is also difficult to correctly infer transcript quantification and identification when there is a high isoforms diversity. Despite their higher error rate, long reads have shown very promising result concerning these actual issues. We show that longer reads can produce less fragmented assembly, with a better quality, but also sequence from start to end mRNA, making it much more easier to infer correct transcript quantification, and even allow new intron structure and so new isoforms discovery.