Next-generation sequencing has become the most widely used sequencing technology in genomics research, but it has inherent drawbacks when dealing with high-GC content genomes. Recently, single-molecule real-time sequencing technology (SMRT) was introduced as a third-generation sequencing strategy to compensate for this drawback. Here, we report that the unbiased and longer read length of SMRT sequencing markedly improved genome assembly with high GC content via gap filling and repeat resolution.
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results.
First- and second-generation sequencing technologies have led the way in revolutionizing the field of genomics and beyond, motivating an astonishing number of scientific advances, including enabling a more complete understanding of whole genome sequences and the information encoded therein, a more complete characterization of the methylome and transcriptome and a better understanding of interactions between proteins and DNA. Nevertheless, there are sequencing applications and aspects of genome biology that are presently beyond the reach of current sequencing technologies, leaving fertile ground for additional innovation in this space. In this review, we describe a new generation of single-molecule sequencing technologies (third-generation sequencing) that is emerging to fill this space, with the potential for dramatically longer read lengths, shorter time to result and lower overall cost.
Preparation of next-generation DNA sequencing libraries from ultra-low amounts of input DNA: Application to single-molecule, real-time (SMRT) sequencing on the Pacific Biosciences RS II.
We have developed and validated an amplification-free method for generating DNA sequencing libraries from very low amounts of input DNA (500 picograms – 20 nanograms) for single- molecule sequencing on the Pacific Biosciences (PacBio) RS II sequencer. The common challenge of high input requirements for single-molecule sequencing is overcome by using a carrier DNA in conjunction with optimized sequencing preparation conditions and re-use of the MagBead-bound complex. Here we describe how this method can be used to produce sequencing yields comparable to those generated from standard input amounts, but by using 1000-fold less starting material.
Third generation single molecule sequencing technology is poised to revolutionize genomics by en- abling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction. We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones. All source code and the assembly model are available open-source.
Detecting DNA modifications from SMRT sequencing data by modeling sequence context dependence of polymerase kinetic.
DNA modifications such as methylation and DNA damage can play critical regulatory roles in biological systems. Single molecule, real time (SMRT) sequencing technology generates DNA sequences as well as DNA polymerase kinetic information that can be used for the direct detection of DNA modifications. We demonstrate that local sequence context has a strong impact on DNA polymerase kinetics in the neighborhood of the incorporation site during the DNA synthesis reaction, allowing for the possibility of estimating the expected kinetic rate of the enzyme at the incorporation site using kinetic rate information collected from existing SMRT sequencing data (historical data) covering the same local sequence contexts of interest. We develop an Empirical Bayesian hierarchical model for incorporating historical data. Our results show that the model could greatly increase DNA modification detection accuracy, and reduce requirement of control data coverage. For some DNA modifications that have a strong signal, a control sample is not even needed by using historical data as alternative to control. Thus, sequencing costs can be greatly reduced by using the model. We implemented the model in a R package named seqPatch, which is available at https://github.com/zhixingfeng/seqPatch.
Identification of restriction-modification systems of Bifidobacterium animalis subsp. lactis CNCM I-2494 by SMRT Sequencing and associated methylome analysis.
Bifidobacterium animalis subsp. lactis CNCM I-2494 is a component of a commercialized fermented dairy product for which beneficial effects on health has been studied by clinical and preclinical trials. To date little is known about the molecular mechanisms that could explain the beneficial effects that bifidobacteria impart to the host. Restriction-modification (R-M) systems have been identified as key obstacles in the genetic accessibility of bifidobacteria, and circumventing these is a prerequisite to attaining a fundamental understanding of bifidobacterial attributes, including the genes that are responsible for health-promoting properties of this clinically and industrially important group of bacteria. The complete genome sequence of B. animalis subsp. lactis CNCM I-2494 is predicted to harbour the genetic determinants for two type II R-M systems, designated BanLI and BanLII. In order to investigate the functionality and specificity of these two putative R-M systems in B. animalis subsp. lactis CNCM I-2494, we employed PacBio SMRT sequencing with associated methylome analysis. In addition, the contribution of the identified R-M systems to the genetic accessibility of this strain was assessed.
As resequencing projects become more prevalent across a larger number of species, accurate variant identification will further elucidate the nature of genetic diversity and become increasingly relevant in genomic studies. However, the identification of larger genomic variants via DNA sequencing is limited by both the incomplete information provided by sequencing reads and the nature of the genome itself. Long-read sequencing technologies provide high-resolution access to structural variants often inaccessible to shorter reads.We present PBHoney, software that considers both intra-read discordance and soft-clipped tails of long reads (>10,000 bp) to identify structural variants. As a proof of concept, we identify four structural variants and two genomic features in a strain of Escherichia coli with PBHoney and validate them via de novo assembly. PBHoney is available for download at http://sourceforge.net/projects/pb-jelly/.Implementing two variant-identification approaches that exploit the high mappability of long reads, PBHoney is demonstrated as being effective at detecting larger structural variants using whole-genome Pacific Biosciences RS II Continuous Long Reads. Furthermore, PBHoney is able to discover two genomic features: the existence of Rac-Phage in isolate; evidence of E. coli’s circular genome.
Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome some limitations of current next generation sequencing platforms by providing significantly longer reads, single molecule sequencing, low composition bias and an error profile that is orthogonal to other platforms. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical amplicon resequencing projects.We evaluated the Pacific Biosciences technology for SNP discovery in medical resequencing projects using the Genome Analysis Toolkit, observing high sensitivity and specificity for calling differences in amplicons containing known true or false SNPs. We assessed data quality: most errors were indels (~14%) with few apparent miscalls (~1%). In this work, we define a custom data processing pipeline for Pacific Biosciences data for human data analysis.Critically, the error properties were largely free of the context-specific effects that affect other sequencing technologies. These data show excellent utility for follow-up validation and extension studies in human data and medical genetics projects, but can be extended to other organisms with a reference genome.
Deletion of tumor-suppressor genes as well as other genomic rearrangements pervade cancer genomes across numerous types of solid tumor and hematologic malignancies. However, even for a specific rearrangement, the breakpoints may vary between individuals, such as the recurrent CDKN2A deletion. Characterizing the exact breakpoints for structural variants (SVs) is useful for designating patient-specific tumor biomarkers. We propose AmBre (Amplification of Breakpoints), a method to target SV breakpoints occurring in samples composed of heterogeneous tumor and germline DNA. Additionally, AmBre validates SVs called by whole-exome/genome sequencing and hybridization arrays. AmBre involves a PCR-based approach to amplify the DNA segment containing an SV’s breakpoint and then confirms breakpoints using sequencing by Pacific Biosciences RS. To amplify breakpoints with PCR, primers tiling specified target regions are carefully selected with a simulated annealing algorithm to minimize off-target amplification and maximize efficiency at capturing all possible breakpoints within the target regions. To confirm correct amplification and obtain breakpoints, PCR amplicons are combined without barcoding and simultaneously long-read sequenced using a single SMRT cell. Our algorithm efficiently separates reads based on breakpoints. Each read group supporting the same breakpoint corresponds with an amplicon and a consensus amplicon sequence is called. AmBre was used to discover CDKN2A deletion breakpoints in cancer cell lines: A549, CEM, Detroit562, MOLT4, MCF7, and T98G. Also, we successfully assayed RUNX1-RUNX1T1 reciprocal translocations by finding both breakpoints in the Kasumi-1 cell line. AmBre successfully targets SVs where DNA harboring the breakpoints are present in 1:1000 mixtures.
Next-generation sequencing (NGS) technologies enable new insights into the diversity of virus populations within their hosts. Diversity estimation is currently restricted to single-nucleotide variants or to local fragments of no more than a few hundred nucleotides defined by the length of sequence reads. To study complex heterogeneous virus populations comprehensively, novel methods are required that allow for complete reconstruction of the individual viral haplotypes. Here, we show that assembly of whole viral genomes of ~8600 nucleotides length is feasible from mixtures of heterogeneous HIV-1 strains derived from defined combinations of cloned virus strains and from clinical samples of an HIV-1 superinfected individual. Haplotype reconstruction was achieved using optimized experimental protocols and computational methods for amplification, sequencing and assembly. We comparatively assessed the performance of the three NGS platforms 454 Life Sciences/Roche, Illumina and Pacific Biosciences for this task. Our results prove and delineate the feasibility of NGS-based full-length viral haplotype reconstruction and provide new tools for studying evolution and pathogenesis of viruses.© The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Despite modern sequencing efforts, the difficulty in assembly of highly repetitive sequences has prevented resolution of human genome gaps, including some in the coding regions of genes with important biological functions. One such gene, MUC5AC, encodes a large, secreted mucin, which is one of the two major secreted mucins in human airways. The MUC5AC region contains a gap in the human genome reference (hg19) across the large, highly repetitive, and complex central exon. This exon is predicted to contain imperfect tandem repeat sequences and multiple conserved cysteine-rich (CysD) domains. To resolve the MUC5AC genomic gap, we used high-fidelity long PCR followed by single molecule real-time (SMRT) sequencing. This technology yielded long sequence reads and robust coverage that allowed for de novo sequence assembly spanning the entire repetitive region. Furthermore, we used SMRT sequencing of PCR amplicons covering the central exon to identify genetic variation in four individuals. The results demonstrated the presence of segmental duplications of CysD domains, insertions/deletions (indels) of tandem repeats, and single nucleotide variants. Additional studies demonstrated that one of the identified tandem repeat insertions is tagged by nonexonic single nucleotide polymorphisms. Taken together, these data illustrate the successful utility of SMRT sequencing long reads for de novo assembly of large repetitive sequences to fill the gaps in the human genome. Characterization of the MUC5AC gene and the sequence variation in the central exon will facilitate genetic and functional studies for this critical airway mucin.
DNA sequencing has provided a wealth of information about biological systems, but thus far has focused on the four canonical bases, and 5-methylcytosine through comparison of the genomic DNA sequence to a transformed four-base sequence obtained after treatment with bisulfite. However, numerous other chemical modifications to the nucleotides are known to control fundamental life functions, influence virulence of pathogens, and are associated with many diseases. These modifications cannot be accessed with traditional sequencing methods. In this opinion, we highlight several emerging single-molecule sequencing techniques that have the potential to directly detect many types of DNA modifications as an integral part of the sequencing protocol. Copyright © 2012 Elsevier Ltd. All rights reserved.
Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characteristics of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4C2 and P5C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.
Staphylococcus aureus has evolved as a pathogen that causes a range of diseases in humans. There are two dominant modes of evolution thought to explain most of the virulence differences between strains. First, virulence genes may be acquired from other organisms. Second, mutations may cause changes in the regulation and expression of genes. Here we describe an evolutionary event in which transposition of an IS element has a direct impact on virulence gene regulation resulting in hypervirulence. Whole-genome analysis of a methicillin-resistant S. aureus (MRSA) strain USA500 revealed acquisition of a transposable element (IS256) that is absent from close relatives of this strain. Of the multiple copies of IS256 found in the USA500 genome, one was inserted in the promoter sequence of repressor of toxins (Rot), a master transcriptional regulator responsible for the expression of virulence factors in S. aureus. We show that insertion into the rot promoter by IS256 results in the derepression of cytotoxin expression and increased virulence. Taken together, this work provides new insight into evolutionary strategies by which S. aureus is able to modify its virulence properties and demonstrates a novel mechanism by which horizontal gene transfer directly impacts virulence through altering toxin regulation. © 2014 John Wiley & Sons Ltd.