Many genomes have been sequenced to high-quality draft status using Sanger capillary electrophoresis and/or newer short-read sequence data and whole genome assembly techniques. However, even the best draft genomes contain gaps and other imperfections due to limitations in the input data and the techniques used to build draft assemblies. Sequencing biases, repetitive genomic features, genomic polymorphism, and other complicating factors all come together to make some regions difficult or impossible to assemble. Traditionally, draft genomes were upgraded to “phase 3 finished” status using time-consuming and expensive Sanger-based manual finishing processes. For more facile assembly and automated finishing of draft genomes, we present here an automated approach to finishing using long-reads from the Pacific Biosciences RS (PacBio) platform. Our algorithm and associated software tool, PBJelly, (publicly available at https://sourceforge.net/projects/pb-jelly/) automates the finishing process using long sequence reads in a reference-guided assembly process. PBJelly also provides “lift-over” co-ordinate tables to easily port existing annotations to the upgraded assembly. Using PBJelly and long PacBio reads, we upgraded the draft genome sequences of a simulated Drosophila melanogaster, the version 2 draft Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the Sooty mangabey. With 24× mapped coverage of PacBio long-reads, we addressed 99% of gaps and were able to close 69% and improve 12% of all gaps in D. pseudoobscura. With 4× mapped coverage of PacBio long-reads we saw reads address 63% of gaps in our budgerigar assembly, of which 32% were closed and 63% improved. With 6.8× mapped coverage of mangabey PacBio long-reads we addressed 97% of gaps and closed 66% of addressed gaps and improved 19%. The accuracy of gap closure was validated by comparison to Sanger sequencing on gaps from the original D. pseudoobscura draft assembly and shown to be dependent on initial reference quality.
Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing.
Single-molecule real-time (SMRT) DNA sequencing allows the systematic detection of chemical modifications such as methylation but has not previously been applied on a genome-wide scale. We used this approach to detect 49,311 putative 6-methyladenine (m6A) residues and 1,407 putative 5-methylcytosine (m5C) residues in the genome of a pathogenic Escherichia coli strain. We obtained strand-specific information for methylation sites and a quantitative assessment of the frequency of methylation at each modified position. We deduced the sequence motifs recognized by the methyltransferase enzymes present in this strain without prior knowledge of their specificity. Furthermore, we found that deletion of a phage-encoded methyltransferase-endonuclease (restriction-modification; RM) system induced global transcriptional changes and led to gene amplification, suggesting that the role of RM systems extends beyond protecting host genomes from foreign DNA.
Characterization of DNA methyltransferase specificities using single-molecule, real-time DNA sequencing.
DNA methylation is the most common form of DNA modification in prokaryotic and eukaryotic genomes. We have applied the method of single-molecule, real-time (SMRT) DNA sequencing that is capable of direct detection of modified bases at single-nucleotide resolution to characterize the specificity of several bacterial DNA methyltransferases (MTases). In addition to previously described SMRT sequencing of N6-methyladenine and 5-methylcytosine, we show that N4-methylcytosine also has a specific kinetic signature and is therefore identifiable using this approach. We demonstrate for all three prokaryotic methylation types that SMRT sequencing confirms the identity and position of the methylated base in cases where the MTase specificity was previously established by other methods. We then applied the method to determine the sequence context and methylated base identity for three MTases with unknown specificities. In addition, we also find evidence of unanticipated MTase promiscuity with some enzymes apparently also modifying sequences that are related, but not identical, to the cognate site.
With the price of next generation sequencing steadily decreasing, bacterial genome assembly is now accessible to a wide range of researchers. It is therefore necessary to understand the best methods for generating a genome assembly, specifically, which combination of sequencing and bioinformatics strategies result in the most accurate assemblies. Here, we sequence three E. coli strains on the Illumina MiSeq, Life Technologies Ion Torrent PGM, and Pacific Biosciences RS. We then perform genome assemblies on all three datasets alone or in combination to determine the best methods for the assembly of bacterial genomes.Three E. coli strains – BL21(DE3), Bal225, and DH5a – were sequenced to a depth of 100× on the MiSeq and Ion Torrent machines and to at least 125× on the PacBio RS. Four assembly methods were examined and compared. The previously published BL21(DE3) genome [GenBank:AM946981.2], allowed us to evaluate the accuracy of each of the BL21(DE3) assemblies. BL21(DE3) PacBio-only assemblies resulted in a 90% reduction in contigs versus short read only assemblies, while N50 numbers increased by over 7-fold. Strikingly, the number of SNPs in PacBio-only assemblies were less than half that seen with short read assemblies (~20 SNPs vs. ~50 SNPs) and indels also saw dramatic reductions (~2 indel >5 bp in PacBio-only assemblies vs. ~12 for short-read only assemblies). Assemblies that used a mixture of PacBio and short read data generally fell in between these two extremes. Use of PacBio sequencing reads also allowed us to call covalent base modifications for the three strains. Each of the strains used here had a known covalent base modification genotype, which was confirmed by PacBio sequencing.Using data generated solely from the Pacific Biosciences RS, we were able to generate the most complete and accurate de novo assemblies of E. coli strains. We found that the addition of other sequencing technology data offered no improvements over use of PacBio data alone. In addition, the sequencing data from the PacBio RS allowed for sensitive and specific calling of covalent base modifications.
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results.
Heterogeneity is a ubiquitous feature of biological systems. A complete understanding of such systems requires a method for uniquely identifying and tracking individual components and their interactions with each other. We have developed a novel method of uniquely tagging individual cells in vivo with a genetic ‘barcode’ that can be recovered by DNA sequencing. Our method is a two-component system comprised of a genetic barcode cassette whose fragments are shuffled by Rci, a site-specific DNA invertase. The system is highly scalable, with the potential to generate theoretical diversities in the billions. We demonstrate the feasibility of this technique in Escherichia coli. Currently, this method could be employed to track the dynamics of populations of microbes through various bottlenecks. Advances of this method should prove useful in tracking interactions of cells within a network, and/or heterogeneity within complex biological samples.© The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases.
Current generation DNA sequencing instruments are moving closer to seamlessly sequencing genomes of entire populations as a routine part of scientific investigation. However, while significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently, single-molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date no statistical framework has been proposed to enhance the power to detect these events while also controlling for false-positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test position of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best-performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events, while others represent putative chemically modified sites of unknown types.
The human fragile X mental retardation 1 (FMR1) gene contains a (CGG)(n) trinucleotide repeat in its 5′ untranslated region (5’UTR). Expansions of this repeat result in a number of clinical disorders with distinct molecular pathologies, including fragile X syndrome (FXS; full mutation range, greater than 200 CGG repeats) and fragile X-associated tremor/ataxia syndrome (FXTAS; premutation range, 55-200 repeats). Study of these diseases has been limited by an inability to sequence expanded CGG repeats, particularly in the full mutation range, with existing DNA sequencing technologies. Single-molecule, real-time (SMRT) sequencing provides an approach to sequencing that is fundamentally different from other “next-generation” sequencing platforms, and is well suited for long, repetitive DNA sequences. We report the first sequence data for expanded CGG-repeat FMR1 alleles in the full mutation range that reveal the confounding effects of CGG-repeat tracts on both cloning and PCR. A unique feature of SMRT sequencing is its ability to yield real-time information on the rates of nucleoside addition by the tethered DNA polymerase; for the CGG-repeat alleles, we find a strand-specific effect of CGG-repeat DNA on the interpulse distance. This kinetic signature reveals a novel aspect of the repeat element; namely, that the particular G bias within the CGG/CCG-repeat element influences polymerase activity in a manner that extends beyond simple nearest-neighbor effects. These observations provide a baseline for future kinetic studies of repeat elements, as well as for studies of epigenetic and other chemical modifications thereof.
The complete genome sequence of Escherichia coli EC958: a high quality reference sequence for the globally disseminated multidrug resistant E. coli O25b:H4-ST131 clone.
Escherichia coli ST131 is now recognised as a leading contributor to urinary tract and bloodstream infections in both community and clinical settings. Here we present the complete, annotated genome of E. coli EC958, which was isolated from the urine of a patient presenting with a urinary tract infection in the Northwest region of England and represents the most well characterised ST131 strain. Sequencing was carried out using the Pacific Biosciences platform, which provided sufficient depth and read-length to produce a complete genome without the need for other technologies. The discovery of spurious contigs within the assembly that correspond to site-specific inversions in the tail fibre regions of prophages demonstrates the potential for this technology to reveal dynamic evolutionary mechanisms. E. coli EC958 belongs to the major subgroup of ST131 strains that produce the CTX-M-15 extended spectrum ß-lactamase, are fluoroquinolone resistant and encode the fimH30 type 1 fimbrial adhesin. This subgroup includes the Indian strain NA114 and the North American strain JJ1886. A comparison of the genomes of EC958, JJ1886 and NA114 revealed that differences in the arrangement of genomic islands, prophages and other repetitive elements in the NA114 genome are not biologically relevant and are due to misassembly. The availability of a high quality uropathogenic E. coli ST131 genome provides a reference for understanding this multidrug resistant pathogen and will facilitate novel functional, comparative and clinical studies of the E. coli ST131 clonal lineage.
Comprehensive methylome characterization of Mycoplasma genitalium and Mycoplasma pneumoniae at single-base resolution.
In the bacterial world, methylation is most commonly associated with restriction-modification systems that provide a defense mechanism against invading foreign genomes. In addition, it is known that methylation plays functionally important roles, including timing of DNA replication, chromosome partitioning, DNA repair, and regulation of gene expression. However, full DNA methylome analyses are scarce due to a lack of a simple methodology for rapid and sensitive detection of common epigenetic marks (ie N(6)-methyladenine (6 mA) and N(4)-methylcytosine (4 mC)), in these organisms. Here, we use Single-Molecule Real-Time (SMRT) sequencing to determine the methylomes of two related human pathogen species, Mycoplasma genitalium G-37 and Mycoplasma pneumoniae M129, with single-base resolution. Our analysis identified two new methylation motifs not previously described in bacteria: a widespread 6 mA methylation motif common to both bacteria (5′-CTAT-3′), as well as a more complex Type I m6A sequence motif in M. pneumoniae (5′-GAN(7)TAY-3’/3′-CTN(7)ATR-5′). We identify the methyltransferase responsible for the common motif and suggest the one involved in M. pneumoniae only. Analysis of the distribution of methylation sites across the genome of M. pneumoniae suggests a potential role for methylation in regulating the cell cycle, as well as in regulation of gene expression. To our knowledge, this is one of the first direct methylome profiling studies with single-base resolution from a bacterial organism.
Polymorphic microsatellite markers for a wind-dispersed tropical tree species, Triplaris cumingiana (Polygonaceae).
Novel microsatellite markers were characterized in the wind-dispersed and dioecious neotropical tree Triplaris cumingiana (Polygonaceae) for use in understanding the ecological processes and genetic impacts of pollen- and seed-mediated gene flow in tropical forests. •Sixty-two microsatellite primer pairs were screened, from which 12 markers showing five or more alleles per locus (range 5-17) were tested on 47 individuals. Observed and expected heterozygosities averaged 0.692 and 0.731, respectively. Polymorphism information content was between 0.417 and 0.874. Linkage disequilibrium was observed in one of the 66 pairwise comparisons between loci. Two loci showed deviation from Hardy-Weinberg equilibrium. An additional 14 markers exhibiting lower polymorphism were characterized on a smaller number of individuals. •These microsatellite markers have high levels of polymorphism and reproducibility and will be useful in studying gene flow and population structure in T. cumingiana.
Deletion of tumor-suppressor genes as well as other genomic rearrangements pervade cancer genomes across numerous types of solid tumor and hematologic malignancies. However, even for a specific rearrangement, the breakpoints may vary between individuals, such as the recurrent CDKN2A deletion. Characterizing the exact breakpoints for structural variants (SVs) is useful for designating patient-specific tumor biomarkers. We propose AmBre (Amplification of Breakpoints), a method to target SV breakpoints occurring in samples composed of heterogeneous tumor and germline DNA. Additionally, AmBre validates SVs called by whole-exome/genome sequencing and hybridization arrays. AmBre involves a PCR-based approach to amplify the DNA segment containing an SV’s breakpoint and then confirms breakpoints using sequencing by Pacific Biosciences RS. To amplify breakpoints with PCR, primers tiling specified target regions are carefully selected with a simulated annealing algorithm to minimize off-target amplification and maximize efficiency at capturing all possible breakpoints within the target regions. To confirm correct amplification and obtain breakpoints, PCR amplicons are combined without barcoding and simultaneously long-read sequenced using a single SMRT cell. Our algorithm efficiently separates reads based on breakpoints. Each read group supporting the same breakpoint corresponds with an amplicon and a consensus amplicon sequence is called. AmBre was used to discover CDKN2A deletion breakpoints in cancer cell lines: A549, CEM, Detroit562, MOLT4, MCF7, and T98G. Also, we successfully assayed RUNX1-RUNX1T1 reciprocal translocations by finding both breakpoints in the Kasumi-1 cell line. AmBre successfully targets SVs where DNA harboring the breakpoints are present in 1:1000 mixtures.
Targeted genome editing with engineered nucleases has transformed the ability to introduce precise sequence modifications at almost any site within the genome. A major obstacle to probing the efficiency and consequences of genome editing is that no existing method enables the frequency of different editing events to be simultaneously measured across a cell population at any endogenous genomic locus. We have developed a novel method for quantifying individual genome editing outcomes at any site of interest using single molecule real time (SMRT) DNA sequencing. We show that this approach can be applied at various loci, using multiple engineered nuclease platforms including TALENs, RNA guided endonucleases (CRISPR/Cas9), and ZFNs, and in different cell lines to identify conditions and strategies in which the desired engineering outcome has occurred. This approach facilitates the evaluation of new gene editing technologies and permits sensitive quantification of editing outcomes in almost every experimental system used.
Comparative genomic analysis and virulence differences in closely related Salmonella enterica serotype Heidelberg isolates from humans, retail meats, and animals.
Salmonella enterica subsp. enterica serovar Heidelberg (S. Heidelberg) is one of the top serovars causing human salmonellosis. Recently, an antibiotic-resistant strain of this serovar was implicated in a large 2011 multistate outbreak resulting from consumption of contaminated ground turkey that involved 136 confirmed cases, with one death. In this study, we assessed the evolutionary diversity of 44 S. Heidelberg isolates using whole-genome sequencing (WGS) generated by the 454 GS FLX (Roche) platform. The isolates, including 30 with nearly indistinguishable (one band difference) Xbal pulsed-field gel electrophoresis patterns (JF6X01.0032, JF6X01.0058), were collected from various sources between 1982 and 2011 and included nine isolates associated with the 2011 outbreak. Additionally, we determined the complete sequence for the chromosome and three plasmids from a clinical isolate associated with the 2011 outbreak using the Pacific Biosciences (PacBio) system. Using single-nucleotide polymorphism (SNP) analyses, we were able to distinguish highly clonal isolates, including strains isolated at different times in the same year. The isolates from the recent 2011 outbreak clustered together with a mean SNP variation of only 17 SNPs. The S. Heidelberg isolates carried a variety of phages, such as prophage P22, P4, lambda-like prophage Gifsy-2, and the P2-like phage which carries the sopE1 gene, virulence genes including 62 pathogenicity, and 13 fimbrial markers and resistance plasmids of the incompatibility (Inc)I1, IncA/C, and IncHI2 groups. Twenty-one strains contained an IncX plasmid carrying a type IV secretion system. On the basis of the recent and historical isolates used in this study, our results demonstrated that, in addition to providing detailed genetic information for the isolates, WGS can identify SNP targets that can be utilized for differentiating highly clonal S. Heidelberg isolates.
Prior to the epidemic that emerged in Haiti in October of 2010, cholera had not been documented in this country. After its introduction, a strain of Vibrio cholerae O1 spread rapidly throughout Haiti, where it caused over 600,000 cases of disease and >7,500 deaths in the first two years of the epidemic. We applied whole-genome sequencing to a temporal series of V. cholerae isolates from Haiti to gain insight into the mode and tempo of evolution in this isolated population of V. cholerae O1. Phylogenetic and Bayesian analyses supported the hypothesis that all isolates in the sample set diverged from a common ancestor within a time frame that is consistent with epidemiological observations. A pangenome analysis showed nearly homogeneous genomic content, with no evidence of gene acquisition among Haiti isolates. Nine nearly closed genomes assembled from continuous-long-read data showed evidence of genome rearrangements and supported the observation of no gene acquisition among isolates. Thus, intrinsic mutational processes can account for virtually all of the observed genetic polymorphism, with no demonstrable contribution from horizontal gene transfer (HGT). Consistent with this, the 12 Haiti isolates tested by laboratory HGT assays were severely impaired for transformation, although unlike previously characterized noncompetent V. cholerae isolates, each expressed hapR and possessed a functional quorum-sensing system. Continued monitoring of V. cholerae in Haiti will illuminate the processes influencing the origin and fate of genome variants, which will facilitate interpretation of genetic variation in future epidemics.Vibrio cholerae is the cause of substantial morbidity and mortality worldwide, with over three million cases of disease each year. An understanding of the mode and rate of evolutionary change is critical for proper interpretation of genome sequence data and attribution of outbreak sources. The Haiti epidemic provides an unprecedented opportunity to study an isolated, single-source outbreak of Vibrio cholerae O1 over an established time frame. By using multiple approaches to assay genetic variation, we found no evidence that the Haiti strain has acquired any genes by horizontal gene transfer, an observation that led us to discover that it is also poorly transformable. We have found no evidence that environmental strains have played a role in the evolution of the outbreak strain.