Our understanding of microbiology has evolved enormously over the last 150 years. Few institutions have witnessed our collective progress more closely than the National Collection of Type Cultures (NCTC). In fact, the collection itself is a record of the many milestones microbiologists have crossed, building on the discoveries of those who came before. To date, 60% of NCTC’s historic collection now has a closed, finished reference genome, thanks to PacBio Single Molecule, Real- Time (SMRT) Sequencing. We are excited to be their partner in crossing this latest milestone on their quest to improve human and animal health by understanding the microscopic world.
The UK’s National Collection of Type Cultures (NCTC) is a unique collection of more than 5,000 expertly preserved and authenticated bacterial cultures, many of historical significance. Founded in 1920, NCTC is the longest established collection of its type anywhere in the world, with a history of its own that has reflected — and contributed to — the evolution of microbiology for more than 100 years.
Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing.
Single-molecule real-time (SMRT) DNA sequencing allows the systematic detection of chemical modifications such as methylation but has not previously been applied on a genome-wide scale. We used this approach to detect 49,311 putative 6-methyladenine (m6A) residues and 1,407 putative 5-methylcytosine (m5C) residues in the genome of a pathogenic Escherichia coli strain. We obtained strand-specific information for methylation sites and a quantitative assessment of the frequency of methylation at each modified position. We deduced the sequence motifs recognized by the methyltransferase enzymes present in this strain without prior knowledge of their specificity. Furthermore, we found that deletion of a phage-encoded methyltransferase-endonuclease (restriction-modification; RM) system induced global transcriptional changes and led to gene amplification, suggesting that the role of RM systems extends beyond protecting host genomes from foreign DNA.
Characterization of DNA methyltransferase specificities using single-molecule, real-time DNA sequencing.
DNA methylation is the most common form of DNA modification in prokaryotic and eukaryotic genomes. We have applied the method of single-molecule, real-time (SMRT) DNA sequencing that is capable of direct detection of modified bases at single-nucleotide resolution to characterize the specificity of several bacterial DNA methyltransferases (MTases). In addition to previously described SMRT sequencing of N6-methyladenine and 5-methylcytosine, we show that N4-methylcytosine also has a specific kinetic signature and is therefore identifiable using this approach. We demonstrate for all three prokaryotic methylation types that SMRT sequencing confirms the identity and position of the methylated base in cases where the MTase specificity was previously established by other methods. We then applied the method to determine the sequence context and methylated base identity for three MTases with unknown specificities. In addition, we also find evidence of unanticipated MTase promiscuity with some enzymes apparently also modifying sequences that are related, but not identical, to the cognate site.
With the price of next generation sequencing steadily decreasing, bacterial genome assembly is now accessible to a wide range of researchers. It is therefore necessary to understand the best methods for generating a genome assembly, specifically, which combination of sequencing and bioinformatics strategies result in the most accurate assemblies. Here, we sequence three E. coli strains on the Illumina MiSeq, Life Technologies Ion Torrent PGM, and Pacific Biosciences RS. We then perform genome assemblies on all three datasets alone or in combination to determine the best methods for the assembly of bacterial genomes.Three E. coli strains – BL21(DE3), Bal225, and DH5a – were sequenced to a depth of 100× on the MiSeq and Ion Torrent machines and to at least 125× on the PacBio RS. Four assembly methods were examined and compared. The previously published BL21(DE3) genome [GenBank:AM946981.2], allowed us to evaluate the accuracy of each of the BL21(DE3) assemblies. BL21(DE3) PacBio-only assemblies resulted in a 90% reduction in contigs versus short read only assemblies, while N50 numbers increased by over 7-fold. Strikingly, the number of SNPs in PacBio-only assemblies were less than half that seen with short read assemblies (~20 SNPs vs. ~50 SNPs) and indels also saw dramatic reductions (~2 indel >5 bp in PacBio-only assemblies vs. ~12 for short-read only assemblies). Assemblies that used a mixture of PacBio and short read data generally fell in between these two extremes. Use of PacBio sequencing reads also allowed us to call covalent base modifications for the three strains. Each of the strains used here had a known covalent base modification genotype, which was confirmed by PacBio sequencing.Using data generated solely from the Pacific Biosciences RS, we were able to generate the most complete and accurate de novo assemblies of E. coli strains. We found that the addition of other sequencing technology data offered no improvements over use of PacBio data alone. In addition, the sequencing data from the PacBio RS allowed for sensitive and specific calling of covalent base modifications.
Heterogeneity is a ubiquitous feature of biological systems. A complete understanding of such systems requires a method for uniquely identifying and tracking individual components and their interactions with each other. We have developed a novel method of uniquely tagging individual cells in vivo with a genetic ‘barcode’ that can be recovered by DNA sequencing. Our method is a two-component system comprised of a genetic barcode cassette whose fragments are shuffled by Rci, a site-specific DNA invertase. The system is highly scalable, with the potential to generate theoretical diversities in the billions. We demonstrate the feasibility of this technique in Escherichia coli. Currently, this method could be employed to track the dynamics of populations of microbes through various bottlenecks. Advances of this method should prove useful in tracking interactions of cells within a network, and/or heterogeneity within complex biological samples.© The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases.
Current generation DNA sequencing instruments are moving closer to seamlessly sequencing genomes of entire populations as a routine part of scientific investigation. However, while significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently, single-molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date no statistical framework has been proposed to enhance the power to detect these events while also controlling for false-positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test position of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best-performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events, while others represent putative chemically modified sites of unknown types.
Third generation single molecule sequencing technology is poised to revolutionize genomics by en- abling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction. We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones. All source code and the assembly model are available open-source.
A large outbreak of diarrhea and the hemolytic-uremic syndrome caused by an unusual serotype of Shiga-toxin-producing Escherichia coli (O104:H4) began in Germany in May 2011. As of July 22, a large number of cases of diarrhea caused by Shiga-toxin-producing E. coli have been reported–3167 without the hemolytic-uremic syndrome (16 deaths) and 908 with the hemolytic-uremic syndrome (34 deaths)–indicating that this strain is notably more virulent than most of the Shiga-toxin-producing E. coli strains. Preliminary genetic characterization of the outbreak strain suggested that, unlike most of these strains, it should be classified within the enteroaggregative pathotype of E. coli.We used third-generation, single-molecule, real-time DNA sequencing to determine the complete genome sequence of the German outbreak strain, as well as the genome sequences of seven diarrhea-associated enteroaggregative E. coli serotype O104:H4 strains from Africa and four enteroaggregative E. coli reference strains belonging to other serotypes. Genomewide comparisons were performed with the use of these enteroaggregative E. coli genomes, as well as those of 40 previously sequenced E. coli isolates.The enteroaggregative E. coli O104:H4 strains are closely related and form a distinct clade among E. coli and enteroaggregative E. coli strains. However, the genome of the German outbreak strain can be distinguished from those of other O104:H4 strains because it contains a prophage encoding Shiga toxin 2 and a distinct set of additional virulence and antibiotic-resistance factors.Our findings suggest that horizontal genetic exchange allowed for the emergence of the highly virulent Shiga-toxin-producing enteroaggregative E. coli O104:H4 strain that caused the German outbreak. More broadly, these findings highlight the way in which the plasticity of bacterial genomes facilitates the emergence of new pathogens.
The complete genome sequence of Escherichia coli EC958: a high quality reference sequence for the globally disseminated multidrug resistant E. coli O25b:H4-ST131 clone.
Escherichia coli ST131 is now recognised as a leading contributor to urinary tract and bloodstream infections in both community and clinical settings. Here we present the complete, annotated genome of E. coli EC958, which was isolated from the urine of a patient presenting with a urinary tract infection in the Northwest region of England and represents the most well characterised ST131 strain. Sequencing was carried out using the Pacific Biosciences platform, which provided sufficient depth and read-length to produce a complete genome without the need for other technologies. The discovery of spurious contigs within the assembly that correspond to site-specific inversions in the tail fibre regions of prophages demonstrates the potential for this technology to reveal dynamic evolutionary mechanisms. E. coli EC958 belongs to the major subgroup of ST131 strains that produce the CTX-M-15 extended spectrum ß-lactamase, are fluoroquinolone resistant and encode the fimH30 type 1 fimbrial adhesin. This subgroup includes the Indian strain NA114 and the North American strain JJ1886. A comparison of the genomes of EC958, JJ1886 and NA114 revealed that differences in the arrangement of genomic islands, prophages and other repetitive elements in the NA114 genome are not biologically relevant and are due to misassembly. The availability of a high quality uropathogenic E. coli ST131 genome provides a reference for understanding this multidrug resistant pathogen and will facilitate novel functional, comparative and clinical studies of the E. coli ST131 clonal lineage.
Detecting DNA modifications from SMRT sequencing data by modeling sequence context dependence of polymerase kinetic.
DNA modifications such as methylation and DNA damage can play critical regulatory roles in biological systems. Single molecule, real time (SMRT) sequencing technology generates DNA sequences as well as DNA polymerase kinetic information that can be used for the direct detection of DNA modifications. We demonstrate that local sequence context has a strong impact on DNA polymerase kinetics in the neighborhood of the incorporation site during the DNA synthesis reaction, allowing for the possibility of estimating the expected kinetic rate of the enzyme at the incorporation site using kinetic rate information collected from existing SMRT sequencing data (historical data) covering the same local sequence contexts of interest. We develop an Empirical Bayesian hierarchical model for incorporating historical data. Our results show that the model could greatly increase DNA modification detection accuracy, and reduce requirement of control data coverage. For some DNA modifications that have a strong signal, a control sample is not even needed by using historical data as alternative to control. Thus, sequencing costs can be greatly reduced by using the model. We implemented the model in a R package named seqPatch, which is available at https://github.com/zhixingfeng/seqPatch.
As resequencing projects become more prevalent across a larger number of species, accurate variant identification will further elucidate the nature of genetic diversity and become increasingly relevant in genomic studies. However, the identification of larger genomic variants via DNA sequencing is limited by both the incomplete information provided by sequencing reads and the nature of the genome itself. Long-read sequencing technologies provide high-resolution access to structural variants often inaccessible to shorter reads.We present PBHoney, software that considers both intra-read discordance and soft-clipped tails of long reads (>10,000 bp) to identify structural variants. As a proof of concept, we identify four structural variants and two genomic features in a strain of Escherichia coli with PBHoney and validate them via de novo assembly. PBHoney is available for download at http://sourceforge.net/projects/pb-jelly/.Implementing two variant-identification approaches that exploit the high mappability of long reads, PBHoney is demonstrated as being effective at detecting larger structural variants using whole-genome Pacific Biosciences RS II Continuous Long Reads. Furthermore, PBHoney is able to discover two genomic features: the existence of Rac-Phage in isolate; evidence of E. coli’s circular genome.
Single-molecule sequencing to track plasmid diversity of hospital-associated carbapenemase-producing Enterobacteriaceae.
Public health officials have raised concerns that plasmid transfer between Enterobacteriaceae species may spread resistance to carbapenems, an antibiotic class of last resort, thereby rendering common health care-associated infections nearly impossible to treat. To determine the diversity of carbapenemase-encoding plasmids and assess their mobility among bacterial species, we performed comprehensive surveillance and genomic sequencing of carbapenem-resistant Enterobacteriaceae in the National Institutes of Health (NIH) Clinical Center patient population and hospital environment. We isolated a repertoire of carbapenemase-encoding Enterobacteriaceae, including multiple strains of Klebsiella pneumoniae, Klebsiella oxytoca, Escherichia coli, Enterobacter cloacae, Citrobacter freundii, and Pantoea species. Long-read genome sequencing with full end-to-end assembly revealed that these organisms carry the carbapenem resistance genes on a wide array of plasmids. K. pneumoniae and E. cloacae isolated simultaneously from a single patient harbored two different carbapenemase-encoding plasmids, indicating that plasmid transfer between organisms was unlikely within this patient. We did, however, find evidence of horizontal transfer of carbapenemase-encoding plasmids between K. pneumoniae, E. cloacae, and C. freundii in the hospital environment. Our data, including full plasmid identification, challenge assumptions about horizontal gene transfer events within patients and identify possible connections between patients and the hospital environment. In addition, we identified a new carbapenemase-encoding plasmid of potentially high clinical impact carried by K. pneumoniae, E. coli, E. cloacae, and Pantoea species, in unrelated patients and in the hospital environment. Copyright © 2014, American Association for the Advancement of Science.
Vertical transmission of highly similar bla CTX-M-1-harboring IncI1 plasmids in Escherichia coli with different MLST types in the poultry production pyramid.
The purpose of this study was to characterize sets of extended-spectrum ß-lactamases (ESBL)-producing Enterobacteriaceae collected longitudinally from different flocks of broiler breeders, meconium of 1-day-old broilers from theses breeder flocks, as well as from these broiler flocks before slaughter.Five sets of ESBL-producing Escherichia coli were studied by multi-locus sequence typing (MLST), phylogenetic grouping, PCR-based replicon typing and resistance profiling. The bla CTX-M-1-harboring plasmids of one set (pHV295.1, pHV114.1, and pHV292.1) were fully sequenced and subjected to comparative analysis.Eleven different MLST sequence types (ST) were identified with ST1056 the predominant one, isolated in all five sets either on the broiler breeder or meconium level. Plasmid sequencing revealed that bla CTX-M-1 was carried by highly similar IncI1/ST3 plasmids that were 105 076 bp, 110 997 bp, and 117 269 bp in size, respectively.The fact that genetically similar IncI1/ST3 plasmids were found in ESBL-producing E. coli of different MLST types isolated at the different levels in the broiler production pyramid provides strong evidence for a vertical transmission of these plasmids from a common source (nucleus poultry flocks).
Shigellosis (previously bacillary dysentery) was the primary diarrhoeal disease of World War 1, but outbreaks still occur in military operations, and shigellosis causes hundreds of thousands of deaths per year in developing nations. We aimed to generate a high-quality reference genome of the historical Shigella flexneri isolate NCTC1 and to examine the isolate for resistance to antimicrobials.In this genomic analysis, we sequenced the oldest extant Shigella flexneri serotype 2a isolate using single-molecule real-time (SMRT) sequencing technology. Isolated from a soldier with dysentery from the British forces fighting on the Western Front in World War 1, this bacterium, NCTC1, was the first isolate accessioned into the National Collection of Type Cultures. We created a reference sequence for NCTC1, investigated the isolate for antimicrobial resistance, and undertook comparative genetics with S flexneri reference strains isolated during the 100 years since World War 1.We discovered that NCTC1 belonged to a 2a lineage of S flexneri, with which it shares common characteristics and a large core genome. NCTC1 was resistant to penicillin and erythromycin, and contained a complement of chromosomal antimicrobial resistance genes similar to that of more recent isolates. Genomic islands gained in the S flexneri 2a lineage over time were predominately associated with additional antimicrobial resistances, virulence, and serotype conversion.This S flexneri 2a lineage is a well adapted pathogen that has continued to respond to selective pressures. We have created a valuable historical benchmark for shigellae in the form of a high-quality reference sequence for a publicly available isolate.The Wellcome Trust. Copyright © 2014 Baker et al. Open Access article distributed under the terms of CC BY. Published by Elsevier Ltd. All rights reserved.