First- and second-generation sequencing technologies have led the way in revolutionizing the field of genomics and beyond, motivating an astonishing number of scientific advances, including enabling a more complete understanding of whole genome sequences and the information encoded therein, a more complete characterization of the methylome and transcriptome and a better understanding of interactions between proteins and DNA. Nevertheless, there are sequencing applications and aspects of genome biology that are presently beyond the reach of current sequencing technologies, leaving fertile ground for additional innovation in this space. In this review, we describe a new generation of single-molecule sequencing technologies (third-generation sequencing) that is emerging to fill this space, with the potential for dramatically longer read lengths, shorter time to result and lower overall cost.
Preparation of next-generation DNA sequencing libraries from ultra-low amounts of input DNA: Application to single-molecule, real-time (SMRT) sequencing on the Pacific Biosciences RS II.
We have developed and validated an amplification-free method for generating DNA sequencing libraries from very low amounts of input DNA (500 picograms – 20 nanograms) for single- molecule sequencing on the Pacific Biosciences (PacBio) RS II sequencer. The common challenge of high input requirements for single-molecule sequencing is overcome by using a carrier DNA in conjunction with optimized sequencing preparation conditions and re-use of the MagBead-bound complex. Here we describe how this method can be used to produce sequencing yields comparable to those generated from standard input amounts, but by using 1000-fold less starting material.
Pacific Biosciences has developed a method for real-time sequencing of single DNA molecules (Eid et al., 2009), with intrinsic sequencing rates of several bases per second and read lengths into the kilobase range. Conceptually, this sequencing approach is based on eavesdropping on the activity of DNA polymerase carrying out template-directed DNA polymerization. Performed in a highly parallel operational mode, sequential base additions catalyzed by each polymerase are detected with terminal phosphate-linked, fluorescence-labeled nucleotides. This chapter will first outline the principle of this single-molecule, real-time (SMRT) DNA sequencing method, followed by descriptions of its underlying components and typical sequencing run conditions. Two examples are provided which illustrate that, in addition to the DNA sequence, the dynamics of DNA polymerization from each enzyme molecules is directly accessible: the determination of base-specific kinetic parameters from single-molecule sequencing reads, and the characterization of DNA synthesis rate heterogeneities. Copyright 2010 Elsevier Inc. All rights reserved.
Third generation single molecule sequencing technology is poised to revolutionize genomics by en- abling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction. We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones. All source code and the assembly model are available open-source.
Detecting DNA modifications from SMRT sequencing data by modeling sequence context dependence of polymerase kinetic.
DNA modifications such as methylation and DNA damage can play critical regulatory roles in biological systems. Single molecule, real time (SMRT) sequencing technology generates DNA sequences as well as DNA polymerase kinetic information that can be used for the direct detection of DNA modifications. We demonstrate that local sequence context has a strong impact on DNA polymerase kinetics in the neighborhood of the incorporation site during the DNA synthesis reaction, allowing for the possibility of estimating the expected kinetic rate of the enzyme at the incorporation site using kinetic rate information collected from existing SMRT sequencing data (historical data) covering the same local sequence contexts of interest. We develop an Empirical Bayesian hierarchical model for incorporating historical data. Our results show that the model could greatly increase DNA modification detection accuracy, and reduce requirement of control data coverage. For some DNA modifications that have a strong signal, a control sample is not even needed by using historical data as alternative to control. Thus, sequencing costs can be greatly reduced by using the model. We implemented the model in a R package named seqPatch, which is available at https://github.com/zhixingfeng/seqPatch.
Identification of restriction-modification systems of Bifidobacterium animalis subsp. lactis CNCM I-2494 by SMRT Sequencing and associated methylome analysis.
Bifidobacterium animalis subsp. lactis CNCM I-2494 is a component of a commercialized fermented dairy product for which beneficial effects on health has been studied by clinical and preclinical trials. To date little is known about the molecular mechanisms that could explain the beneficial effects that bifidobacteria impart to the host. Restriction-modification (R-M) systems have been identified as key obstacles in the genetic accessibility of bifidobacteria, and circumventing these is a prerequisite to attaining a fundamental understanding of bifidobacterial attributes, including the genes that are responsible for health-promoting properties of this clinically and industrially important group of bacteria. The complete genome sequence of B. animalis subsp. lactis CNCM I-2494 is predicted to harbour the genetic determinants for two type II R-M systems, designated BanLI and BanLII. In order to investigate the functionality and specificity of these two putative R-M systems in B. animalis subsp. lactis CNCM I-2494, we employed PacBio SMRT sequencing with associated methylome analysis. In addition, the contribution of the identified R-M systems to the genetic accessibility of this strain was assessed.
We performed whole-genome analyses of DNA methylation in Shewanella oneidensis MR-1 to examine its possible role in regulating gene expression and other cellular processes. Single-molecule real-time (SMRT) sequencing revealed extensive methylation of adenine (N6mA) throughout the genome. These methylated bases were located in five sequence motifs, including three novel targets for type I restriction/modification enzymes. The sequence motifs targeted by putative methyltranferases were determined via SMRT sequencing of gene knockout mutants. In addition, we found that S. oneidensis MR-1 cultures grown under various culture conditions displayed different DNA methylation patterns. However, the small number of differentially methylated sites could not be directly linked to the much larger number of differentially expressed genes under these conditions, suggesting that DNA methylation is not a major regulator of gene expression in S. oneidensis MR-1. The enrichment of methylated GATC motifs in the origin of replication indicates that DNA methylation may regulate genome replication in a manner similar to that seen in Escherichia coli. Furthermore, comparative analyses suggest that many Gammaproteobacteria, including all members of the Shewanellaceae family, may also utilize DNA methylation to regulate genome replication.
DNA modifications, such as methylation guide numerous critical biological processes, yet epigenetic information has not routinely been collected as part of DNA sequence analyses. Recently, the development of single molecule real time (SMRT) DNA sequencing has enabled detection of modified nucleotides (e.g. 6mA, 4mC, 5mC) in parallel with acquisition of primary sequence data, based on analysis of the kinetics of DNA synthesis reactions. In bacteria, genome-wide mapping of methylated and unmethylated loci is now feasible. This technological advance sets the stage for comprehensive, mechanistic assessment of the effects of bacterial DNA methyltransferases (MTases)-which are ubiquitous, extremely diverse, and largely uncharacterized-on gene expression, chromosome structure, chromosome replication, and other fundamental biological processes. SMRT sequencing also enables detection of damaged DNA and has the potential to uncover novel DNA modifications. Copyright © 2013 Elsevier Ltd. All rights reserved.
Targeted genome editing with engineered nucleases has transformed the ability to introduce precise sequence modifications at almost any site within the genome. A major obstacle to probing the efficiency and consequences of genome editing is that no existing method enables the frequency of different editing events to be simultaneously measured across a cell population at any endogenous genomic locus. We have developed a novel method for quantifying individual genome editing outcomes at any site of interest using single molecule real time (SMRT) DNA sequencing. We show that this approach can be applied at various loci, using multiple engineered nuclease platforms including TALENs, RNA guided endonucleases (CRISPR/Cas9), and ZFNs, and in different cell lines to identify conditions and strategies in which the desired engineering outcome has occurred. This approach facilitates the evaluation of new gene editing technologies and permits sensitive quantification of editing outcomes in almost every experimental system used.
Despite modern sequencing efforts, the difficulty in assembly of highly repetitive sequences has prevented resolution of human genome gaps, including some in the coding regions of genes with important biological functions. One such gene, MUC5AC, encodes a large, secreted mucin, which is one of the two major secreted mucins in human airways. The MUC5AC region contains a gap in the human genome reference (hg19) across the large, highly repetitive, and complex central exon. This exon is predicted to contain imperfect tandem repeat sequences and multiple conserved cysteine-rich (CysD) domains. To resolve the MUC5AC genomic gap, we used high-fidelity long PCR followed by single molecule real-time (SMRT) sequencing. This technology yielded long sequence reads and robust coverage that allowed for de novo sequence assembly spanning the entire repetitive region. Furthermore, we used SMRT sequencing of PCR amplicons covering the central exon to identify genetic variation in four individuals. The results demonstrated the presence of segmental duplications of CysD domains, insertions/deletions (indels) of tandem repeats, and single nucleotide variants. Additional studies demonstrated that one of the identified tandem repeat insertions is tagged by nonexonic single nucleotide polymorphisms. Taken together, these data illustrate the successful utility of SMRT sequencing long reads for de novo assembly of large repetitive sequences to fill the gaps in the human genome. Characterization of the MUC5AC gene and the sequence variation in the central exon will facilitate genetic and functional studies for this critical airway mucin.
qDNAmod: a statistical model-based tool to reveal intercellular heterogeneity of DNA modification from SMRT sequencing data.
In an isogenic cell population, phenotypic heterogeneity among individual cells is common and critical for survival of the population under different environment conditions. DNA modification is an important epigenetic factor that can regulate phenotypic heterogeneity. The single molecule real-time (SMRT) sequencing technology provides a unique platform for detecting a wide range of DNA modifications, including N6-methyladenine (6-mA), N4-methylcytosine (4-mC) and 5-methylcytosine (5-mC). Here we present qDNAmod, a novel bioinformatic tool for genome-wide quantitative profiling of intercellular heterogeneity of DNA modification from SMRT sequencing data. It is capable of estimating proportion of isogenic haploid cells, in which the same loci of the genome are differentially modified. We tested the reliability of qDNAmod with the SMRT sequencing data of Streptococcus pneumoniae strain ST556. qDNAmod detected extensive intercellular heterogeneity of DNA methylation (6-mA) in a clonal population of ST556. Subsequent biochemical analyses revealed that the recognition sequences of two type I restriction–modification (R-M) systems are responsible for the intercellular heterogeneity of DNA methylation initially identified by qDNAmod. qDNAmod thus represents a valuable tool for studying intercellular phenotypic heterogeneity from genome-wide DNA modification.
DNA sequencing has provided a wealth of information about biological systems, but thus far has focused on the four canonical bases, and 5-methylcytosine through comparison of the genomic DNA sequence to a transformed four-base sequence obtained after treatment with bisulfite. However, numerous other chemical modifications to the nucleotides are known to control fundamental life functions, influence virulence of pathogens, and are associated with many diseases. These modifications cannot be accessed with traditional sequencing methods. In this opinion, we highlight several emerging single-molecule sequencing techniques that have the potential to directly detect many types of DNA modifications as an integral part of the sequencing protocol. Copyright © 2012 Elsevier Ltd. All rights reserved.
Determining the genomic sequences of microorganisms is the basis and prerequisite for understanding their biology and functional characterization. While the advent of low-cost, extremely high-throughput second-generation sequencing technologies and the parallel development of assembly algorithms have generated rapid and cost-effective genome assemblies, such assemblies are often unfinished, fragmented draft genomes as a result of short read lengths and long repeats present in multiple copies. Third-generation, PacBio sequencing technologies circumvented this problem by greatly increasing read length. Hybrid approaches including ALLPATHS-LG, PacBio corrected reads pipeline, SPAdes, and SSPACE-LongRead, and non-hybrid approaches-hierarchical genome-assembly process (HGAP) and PacBio corrected reads pipeline via self-correction-have therefore been proposed to utilize the PacBio long reads that can span many thousands of bases to facilitate the assembly of complete microbial genomes. However, standardized procedures that aim at evaluating and comparing these approaches are currently insufficient. To address the issue, we herein provide a comprehensive comparison by collecting datasets for the comparative assessment on the above-mentioned five assemblers. In addition to offering explicit and beneficial recommendations to practitioners, this study aims to aid in the design of a paradigm positioned to complete bacterial genome assembly.
Quantitative and multiplexed DNA methylation analysis using long-read single-molecule real-time bisulfite sequencing (SMRT-BS).
DNA methylation has essential roles in transcriptional regulation, imprinting, X chromosome inactivation and other cellular processes, and aberrant CpG methylation is directly involved in the pathogenesis of human imprinting disorders and many cancers. To address the need for a quantitative and highly multiplexed bisulfite sequencing method with long read lengths for targeted CpG methylation analysis, we developed single-molecule real-time bisulfite sequencing (SMRT-BS).Optimized bisulfite conversion and PCR conditions enabled the amplification of DNA fragments up to ~1.5 kb, and subjecting overlapping 625-1491 bp amplicons to SMRT-BS indicated high reproducibility across all amplicon lengths (r?=?0.972) and low standard deviations (=0.10) between individual CpG sites sequenced in triplicate. Higher variability in CpG methylation quantitation was correlated with reduced sequencing depth, particularly for intermediately methylated regions. SMRT-BS was validated by orthogonal bisulfite-based microarray (r?=?0.906; 42 CpG sites) and second generation sequencing (r?=?0.933; 174 CpG sites); however, longer SMRT-BS amplicons (>1.0 kb) had reduced, but very acceptable, correlation with both orthogonal methods (r?=?0.836-0.897 and r?=?0.892-0.927, respectively) compared to amplicons less than ~1.0 kb (r?=?0.940-0.951 and r?=?0.948-0.963, respectively). Multiplexing utility was assessed by simultaneously subjecting four distinct CpG island amplicons (702-866 bp; 325 CpGs) and 30 hematological malignancy cell lines to SMRT-BS (average depth of 110X), which identified a spectrum of highly quantitative methylation levels across all interrogated CpG sites and cell lines.SMRT-BS is a novel, accurate and cost-effective targeted CpG methylation method that is amenable to a high degree of multiplexing with minimal clonal PCR artifacts. Increased sequencing depth is necessary when interrogating longer amplicons (>1.0 kb) and the previously reported bisulfite sequencing PCR bias towards unmethylated DNA should be considered when measuring intermediately methylated regions. Coupled with an optimized bisulfite PCR protocol, SMRT-BS is capable of interrogating ~1.5 kb amplicons, which theoretically can cover ~91% of CpG islands in the human genome.
Genome modification in Enterococcus faecalis OG1RF assessed by bisulfite sequencing and Single-Molecule Real-Time Sequencing.
Enterococcus faecalis is a Gram-positive bacterium that natively colonizes the human gastrointestinal tract and opportunistically causes life-threatening infections. Multidrug-resistant (MDR) E. faecalis strains have emerged, reducing treatment options for these infections. MDR E. faecalis strains have large genomes containing mobile genetic elements (MGEs) that harbor genes for antibiotic resistance and virulence determinants. Bacteria commonly possess genome defense mechanisms to block MGE acquisition, and we hypothesize that these mechanisms have been compromised in MDR E. faecalis. In restriction-modification (R-M) defense, the bacterial genome is methylated at cytosine (C) or adenine (A) residues by a methyltransferase (MTase), such that nonself DNA can be distinguished from self DNA. A cognate restriction endonuclease digests improperly modified nonself DNA. Little is known about R-M in E. faecalis. Here, we use genome resequencing to identify DNA modifications occurring in the oral isolate OG1RF. OG1RF has one of the smallest E. faecalis genomes sequenced to date and possesses few MGEs. Single-molecule real-time (SMRT) and bisulfite sequencing revealed that OG1RF has global 5-methylcytosine (m5C) methylation at 5′-GCWGC-3′ motifs. A type II R-M system confers the m5C modification, and disruption of this system impacts OG1RF electrotransformability and conjugative transfer of an antibiotic resistance plasmid. A second DNA MTase was poorly expressed under laboratory conditions but conferred global N(4)-methylcytosine (m4C) methylation at 5′-CCGG-3′ motifs when expressed in Escherichia coli. Based on our results, we conclude that R-M can act as a barrier to MGE acquisition and likely influences antibiotic resistance gene dissemination in the E. faecalis species.The horizontal transfer of antibiotic resistance genes among bacteria is a critical public health concern. Enterococcus faecalis is an opportunistic pathogen that causes life-threatening infections in humans. Multidrug resistance acquired by horizontal gene transfer limits treatment options for these infections. In this study, we used innovative DNA sequencing methodologies to investigate how a model strain of E. faecalis discriminates its own DNA from foreign DNA, i.e., self versus nonself discrimination. We also assess the role of an E. faecalis genome modification system in modulating conjugative transfer of an antibiotic resistance plasmid. These results are significant because they demonstrate that differential genome modification impacts horizontal gene transfer frequencies in E. faecalis. Copyright © 2015, American Society for Microbiology. All Rights Reserved.