International Team Publishes Comprehensive DNA Analysis of German E. Coli Outbreak Strain and 11 Related Strains in New England Journal of Medicine
Thursday, July 28, 2011
Yesterday, we published a paper on the origin of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany, in the New England Journal of Medicine, based on the sequencing data we recently deposited in the public domain on different isolates of E. coli, including the current German outbreak as well as 7 additional E. coli strains from different outbreaks (in Africa) but with the same serotype (O104:H4) as the German outbreak strain. To pull this off, we teamed up with leaders in the infectious disease space at Harvard/HHMI (Matt Waldor), UMD IGS (Dave Rasko and his lab), SSI (Karen Angeliki Krogfelt and Flemming Scheutz and their labs), and UVA School of Medicine (Jim Nataro). While there were a number of other groups who had secured access to samples and sequenced isolates from the German outbreak, providing access to those data near the beginning of June, our data and analyses nicely complement this previous work. The combined efforts now provide a more accurate view of the origins and hypotheses relating to increased virulence and drug resistance of this outbreak strain.
In our paper, we described the use of very long read sequencing data generated on the PacBio RS to provide not only the first ever PacBio-only de novo assembly, but the most comprehensive characterization of the genome of the outbreak strain published to date. To produce the most accurate interpretations regarding an outbreak such as this, the gold standard is a complete (or near complete) genome assembly. We achieved this in the present study by combining the really long reads with accurate circular consensus sequencing data, a strategy that yielded long contigs covering 99.7% of the genome. This assembly of the outbreak strain, combined with the sequence data for the other 11 strains we sequenced, provided a more informative context for unambiguously resolving structural variations (in addition to single nucleotide variations) that we demonstrated were important not only for understanding virulence and drug resistance, but for understanding the relationships between enteroaggregative E. coli (EAEC) strains, EAEC strains of the O104:H4 serotype, enterohemorrhagic E. coli (EHEC) strains, as well as a number of other E. coli strains. Our analyses highlighted a number of interesting genomic features (including larger-scale deletions, insertions, and inversions) that were both shared among O104:H4 strains as well as specific to the German outbreak strain. This information combined with the phylogenetic analysis comparing 53 E. coli strains (Figure 2 in the paper), based on whole genome information (not just SNPs), enabled us to unambiguously classify the outbreak strain as an EAEC strain, with horizontal genetic exchange with the Shiga toxin-producing EHEC strain most likely explaining the emergence of this highly virulent Shiga toxin-producing O104:H4 EAEC outbreak strain. Further, consistent with our predictions based on the comparative analyses, we discovered that the Shiga toxin-producing gene (stx2) was significantly increased by certain antibiotics, including ciprofloxacin, suggesting that caution should be exercised in considering treating infections resulting from this strain with certain antibiotics.
My favorite part of the paper is Figure 1, which used Circos (a really wonderful tool for visualizing large-scale, high-dimensional data; see Genome Res (2009) 19:1639-1645) to visualize the results of a comparative analysis that included all of the data we generated on the O104:H4 strains. This figure is packed with nuggets of information that quickly reveal regions of the genome of interest that are most likely involved in the increased virulence of the outbreak strain. In fact, the NEJM staff put together a really cool animation that describes how the Circos plot was constructed, something I think goes even further in enabling those less genomic savvy researchers/medical professionals to understand what types of analyses were carried out to produce that figure. In addition, the PacBio team has all of the relevant *raw* data related to the highlights NEJM provided via the animation, available through our genome browsing tool (SMRT™ View) providing the ability to directly examine the very long reads and how they elucidate structural variations in the genome (I think very cool if you have time to take a look).
One thing we weren’t able to highlight in the paper given length constraints was the utility of the long reads beyond assembly. For example, when we examined single long raw reads (> 5kb) and simply BLASTed them against the NCBI nucleotide database, the top hits that came back most often included E. coli strain 55989. That is, with single reads we are able to classify the outbreak strain! (shown below for NCBI BLAST hits of 9k+ bps reads)
While perhaps unnecessary in this context given in the time it took to generate the really long reads we had enough data to begin whole genome assembly, it demonstrates perhaps how in more complex communities of bacteria (metagenomics projects) that the long reads will be useful for resolving at very low coverage the makeup of the community. The single long reads can also cover whole genes, virulence factors, structural variations, provide for the exact positioning of mobile elements spanned by the long reads, again demonstrating utility beyond de novo assembly.
Finally, I thought worth commenting on the worldwide effort that was undertaken to sequence the outbreak strain (of which we were just one part), well illustrating how the emerging high-throughput DNA sequencing technologies will transform the infectious disease space and beyond. In fact, we had demonstrated this power late last year with rapid sequencing of the Haitian cholera outbreak strain, setting the stage for a new kind of molecular epidemiology based on whole genome sequencing carried out in a matter of hours. The rapid release of data by BGI and HPA have definitely set a new type of standard for enabling analyses of outbreaks by communities of researchers to proceed with great speed, while at the same time these expert groups generating the data carry out more definitive, peer reviewed analyses (although perhaps we are closer than we think to rapid analyses being published on the web and communities of researchers providing rapid “reviews” that result in updates and refinements, and ultimately community consensus regarding interpretations of the data). While I wish we would have pushed to get samples from this German E. coli outbreak earlier (alas, there is only so much we can do in a day!), the data we ultimately provided demonstrated that it is indeed possible to get to a near complete genome in a de novo fashion using sequence data generated in a matter of hours. Exciting times ahead!!