Background: Microbial ecology is reshaping our understanding of the natural world by revealing the large phylogenetic and functional diversity of microbial life. However the vast majority of these microorganisms remain poorly understood, as most cultivated representatives belong to just four phylogenetic groups and more than half of all identified phyla remain uncultivated. Characterization of this microbial ‘dark matter’ will thus greatly benefit from new metagenomic methods for in situ analysis. For example, sensitive high throughput methods for the characterization of community composition and structure from the sequencing of conserved marker genes. Methods: Here we utilize Single Molecule Real-Time (SMRT) sequencing of full-length 16S rRNA amplicons to phylogenetically profile microbial communities to below the genus-level. We test this method on a mock community of known composition, as well as a previously studied microbial community from a lake known to predominantly contain poorly characterized phyla. These results are compared to traditional 16S tag sequencing from short-read technologies and subsets of the full-length data corresponding to the same regions of the 16S gene. Results: We explore the benefits of using full-length amplicons for estimating community structure and diversity. In addition, we investigate the possible effects of context-specific and GC-content biases known to affect short-read sequencing technologies on the predicted community structure. We characterize the potential benefits of profiling metagenomic communities with full-length 16S rRNA genes from SMRT sequencing relative to standard methods.
SMRT Sequencing and assembly of the human microbiome project Mock Community sample – a feasibility project.
While the utility of Single Molecule, Real-Time (SMRT) Sequencing for de novo assembly and finishing of bacterial isolates is well established, this technology has not yet been widely applied to shotgun sequencing of microbial communities. In order to demonstrate the feasibility of this approach, we sequenced genomic DNA from the Microbial Mock Community B of the Human Microbiome Project
The constituents and intra-communal interactions of microbial populations have garnered increasing interest in areas such as water remediation, agriculture and human health. One popular, efficient method of profiling communities is to amplify and sequence the evolutionarily conserved 16S rRNA sequence. Currently, most targeted amplification focuses on short, hypervariable regions of the 16S sequence. Distinguishing information not spanned by the targeted region is lost and species-level classification is often not possible. SMRT Sequencing easily spans the entire 1.5 kb 16S gene, and in combination with highly-accurate single-molecule sequences, can improve the identification of individual species in a metapopulation. However, when amplifying a mixture of sequences with close similarities, the products may contain chimeras, or recombinant molecules, at rates as high as 20-30%. These PCR artifacts make it difficult to identify novel species, and reduce the amount of productive sequences. We investigated multiple factors that have been hypothesized to contribute to chimera formation, such as template damage, denaturing time before and during cycling, polymerase extension time, and reaction volume. Of the factors tested, we found two major related contributors to chimera formation: the amount of input template into the PCR reaction and the number of PCR cycles. Sequence errors generated during amplification and sequencing can also confound the analysis of complex populations. Circular Consensus Sequencing (CCS) can generate single-molecule reads with >99% accuracy, and the SMRT Analysis software provides filtering of these reads to >99.99% accuracies. Remaining substitution errors in these highly-filtered reads are likely dominated by mis-incorporations during amplification. Therefore, we compared the impact of several commercially-available high-fidelity PCR kits with full-length 16S amplification. We show results of our experiments and describe an optimized protocol for full-length 16S amplification for SMRT Sequencing. These optimizations have broader implications for other applications that use PCR amplification to phase variations across targeted regions and to generate highly accurate reference sequences.
The constituents and intra-communal interactions of microbial populations have garnered increasing interest in areas such as water remediation, agriculture and human health. Amplification and sequencing of the evolutionarily conserved 16S rRNA gene is an efficient method of profiling communities. Currently, most targeted amplification focuses on short, hypervariable regions of the 16S sequence. Distinguishing information not spanned by the targeted region is lost, and species-level classification is often not possible. PacBio SMRT Sequencing easily spans the entire 1.5 kb 16S gene in a single read, producing highly accurate single-molecule sequences that can improve the identification of individual species in a metapopulation.However, this process still relies upon PCR amplification from a mixture of similar sequences, which may result in chimeras, or recombinant molecules, at rates upwards of 20%. These PCR artifacts make it difficult to identify novel species, and reduce the amount of informative sequences. We investigated multiple factors that may contribute to chimera formation, such as template damage, denaturation time before and during thermocycling, polymerase extension time, and reaction volume. We found two related factors that contribute to chimera formation: the amount of input template into the PCR reaction, and the number of PCR cycles.A second problem that can confound analysis is sequence errors generated during amplification and sequencing. With the updated algorithm for circular consensus sequencing (CCS2), single-molecule reads can be filtered to 99.99% predicted accuracy. Substitution errors in these highly filtered reads may be dominated by mis-incorporations during amplification. Sequence differences in full-length 16S amplicons from several commercial high-fidelity PCR kits were compared.We show results of our experiments and describe our optimized protocol for full-length 16S amplification for SMRT Sequencing. These optimizations have broader implications for other applications that use PCR amplification to phase variations across targeted regions and generate highly accurate reference sequences.
Unbiased characterization of metagenome composition and function using HiFi sequencing on the PacBio Sequel II System
Recent work comparing metagenomic sequencing methods indicates that a comprehensive picture of the taxonomic and functional diversity of complex communities will be difficult to achieve with short-read technology alone. While the lower cost of short reads has enabled greater sequencing depth, the greater contiguity of long-read assemblies and lack of GC bias in SMRT Sequencing has enabled better gene finding. However, since long-read assembly requires high coverage for error correction, the benefits of unbiased coverage have in the past been lost for low abundance species. SMRT Sequencing performance improvements and the introduction of the Sequel II System has enabled a new, high throughput data type uniquely suited to metagenome characterization: HiFi reads. HiFi reads combine high accuracy with read lengths up to 15 kb, eliminating the need for assembly for most microbiome applications, including functional profiling, gene discovery, and metabolic pathway reconstruction. Here we present the application of the HiFi data type to enable a new method of analyzing metagenomes that does not require assembly.
Unbiased characterization of metagenome composition and function using HiFi sequencing on the PacBio Sequel II System
Recent work comparing metagenomic sequencing methods indicates that a comprehensive picture of the taxonomic and functional diversity of complex communities will be difficult to achieve with one sequencing technology alone. While the lower cost of short reads has enabled greater sequencing depth, the greater contiguity of long-read assemblies and lack of GC bias in SMRT Sequencing has enabled better gene finding. However, since long-read assembly typically requires high coverage for error correction, these benefits have in the past been lost for low-abundance species. The introduction of the Sequel II System has enabled a new, higher throughput, assembly-optional data type that addresses these challenges: HiFi reads. HiFi reads combine QV20 accuracy with long read lengths, eliminating the need for assembly for most metagenome applications, including gene discovery and metabolic pathway reconstruction. In fact, the read lengths and accuracy of HiFi data match or outperform the quality metrics of most metagenome assemblies, enabling cost-effective recovery of intact genes and operons while omitting the resource intensive and data-inefficient assembly step. Here we present the application of HiFi sequencing to both mock and human fecal samples using full-length 16S and shotgun methods. This proof-of-concept work demonstrates the unique strengths of the HiFi method. First, the high correspondence between the expected community composition,16S and shotgun profiling data reflects low context bias. In addition, every HiFi read yields ~5-8 predicted genes, without assembly, using standard tools. If assembly is desired, excellent results can be achieved with Canu and contig binning tools. In summary, HiFi sequencing is a new, cost-effective option for high-resolution functional profiling of metagenomes which complements existing short read workflows.
User Group Meeting: Unbiased characterization of metagenome composition and function using HiFi sequencing on the PacBio Sequel II System
In this PacBio User Group Meeting presentation, PacBio scientist Meredith Ashby shared several examples of analysis — from full-length 16S sequencing to shotgun sequencing — showing how SMRT Sequencing enables…
Understanding interactions among plants and the complex communities of organisms living on, in and around them requires more than one experimental approach. A new method for de novo metagenome assembly,…
In this webinar, Dr. Ashby gives attendees a brief update on PacBio’s metagenomics solutions on the Sequel II System. Then, Dr. Ma, University of Maryland School of Medicine, discusses her…
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
Antibiotic susceptibility of plant-derived lactic acid bacteria conferring health benefits to human.
Lactic acid bacteria (LAB) confer health benefits to human when administered orally. We have recently isolated several species of LAB strains from plant sources, such as fruits, vegetables, flowers, and medicinal plants. Since antibiotics used to treat bacterial infection diseases induce the emergence of drug-resistant bacteria in intestinal microflora, it is important to evaluate the susceptibility of LAB strains to antibiotics to ensure the safety and security of processed foods. The aim of the present study is to determine the minimum inhibitory concentration (MIC) of antibiotics against several plant-derived LAB strains. When aminoglycoside antibiotics, such as streptomycin (SM), kanamycin (KM), and gentamicin (GM), were evaluated using LAB susceptibility test medium (LSM), the MIC was higher than when using Mueller-Hinton (MH) medium. Etest, which is an antibiotic susceptibility assay method consisting of a predefined gradient of antibiotic concentrations on a plastic strip, is used to determine the MIC of antibiotics world-wide. In the present study, we demonstrated that Etest was particularly valuable while testing LAB strains. We also show that the low susceptibility of the plant-derived LAB strains against each antibiotic tested is due to intrinsic resistance and not acquired resistance. This finding is based on the whole-genome sequence information reflecting the horizontal spread of the drug-resistance genes in the LAB strains.
Membrane proteomic analysis reveals overlapping and independent functions of Streptococcus mutans Ffh, YidC1, and YidC2.
A comparative proteomic analysis was utilized to evaluate similarities and differences in membrane samples derived from the cariogenic bacterium Streptococcus mutans, including the wild-type strain and four mutants devoid of protein translocation machinery components, specifically ?ffh, ?yidC1, ?yidC2, or ?ffh/yidC1. The purpose of this work was to determine the extent to which the encoded proteins operate individually or in concert with one another and to identify the potential substrates of the respective pathways. Ffh is the principal protein component of the signal recognition particle (SRP), while yidC1 and yidC2 are dual paralogs encoding members of the YidC/Oxa/Alb family of membrane-localized chaperone insertases. Our results suggest that the co-translational SRP pathway works in concert with either YidC1 or YidC2 specifically, or with no preference for paralog, in the insertion of most membrane-localized substrates. A few instances were identified in which the SRP pathway alone, or one of the YidCs alone, appeared to be most relevant. These data shed light on underlying reasons for differing phenotypic consequences of ffh, yidC1 or yidC2 deletion. Our data further suggest that many membrane proteins present in a ?yidC2 background may be non-functional, that ?yidC1 is better able to adapt physiologically to the loss of this paralog, that shared phenotypic properties of ?ffh and ?yidC2 mutants can stem from impacts on different proteins, and that independent binding to ribosomal proteins is not a primary functional activity of YidC2. Lastly, genomic mutations accumulate in a ?yidC2 background coincident with phenotypic reversion, including an apparent W138R suppressor mutation within yidC1. © 2019 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Complete genome sequence and characterization of virulence genes in Lancefield group C Streptococcus dysgalactiae isolated from farmed amberjack (Seriola dumerili).
Lancefield group C Streptococcus dysgalactiae causes infections in farmed fish. Here, the genome of S. dysgalactiae strain kdys0611, isolated from farmed amberjack (Seriola dumerili) was sequenced. The complete genome sequence of kdys0611 consists of a single chromosome and five plasmids. The chromosome is 2,142,780?bp long and has a GC content of 40%. It possesses 2061 coding sequences and 67 tRNA and 6 rRNA operons. One clustered regularly interspaced short palindromic repeat, 125 insertion sequences, and four predicted prophage elements were identified. Phylogenetic analysis based on 126 core genes suggested that the kdys0611 strain is more closely related to S. dysgalactiae subsp. dysgalactiae than to S. dysgalactiae subsp. equisimilis. The genome of kdys0611 harbors 87 genes with sequence similarity to putative virulence-associated genes identified in other bacteria, of which 57 exhibit amino acid identity (>52%) to genes of the S. dysgalactiae subsp. equisimilis GGS124 human clinical isolate. Four putative virulence genes, emm5 (FGCSD_0256), spg_2 (FGCSD_1961), skc (FGCSD_1012), and cna (FGCSD_0159), in kdys0611 did not show significant homology with any deposited S. dysgalactiae genes. The chromosomal sequence of kdys0611 has been deposited in GenBank under Accession No. AP018726. This is the first report of the complete genome sequence of S. dysgalactiae isolated from fish. © 2019 The Societies and John Wiley & Sons Australia, Ltd.
Complete genome sequence of Bacillus velezensis JT3-1, a microbial germicide isolated from yak feces
Bacillus velezensis JT3-1 is a probiotic strain isolated from feces of the domestic yak (Bos grunniens) in the Gansu province of China. It has strong antagonistic activity against Listeria monocytogenes, Staphylococcus aureus, Escherichia coli, Salmonella Typhimurium, Mannheimia haemolytica, Staphylococcus hominis, Clostridium perfringens, and Mycoplasma bovis. These properties have made the JT3-1 strain the focus of commercial interest. In this study, we describe the complete genome sequence of JT3-1, with a genome size of 3,929,799 bp, 3761 encoded genes and an average GC content of 46.50%. Whole genome sequencing of Bacillus velezensis JT3-1 will lay a good foundation for elucidation of the mechanisms of its antimicrobial activity, and for its future application.
Optimized Cas9 expression systems for highly efficient Arabidopsis genome editing facilitate isolation of complex alleles in a single generation.
Genetic resources for the model plant Arabidopsis comprise mutant lines defective in almost any single gene in reference accession Columbia. However, gene redundancy and/or close linkage often render it extremely laborious or even impossible to isolate a desired line lacking a specific function or set of genes from segregating populations. Therefore, we here evaluated strategies and efficiencies for the inactivation of multiple genes by Cas9-based nucleases and multiplexing. In first attempts, we succeeded in isolating a mutant line carrying a 70 kb deletion, which occurred at a frequency of ~?1.6% in the T2 generation, through PCR-based screening of numerous individuals. However, we failed to isolate a line lacking Lhcb1 genes, which are present in five copies organized at two loci in the Arabidopsis genome. To improve efficiency of our Cas9-based nuclease system, regulatory sequences controlling Cas9 expression levels and timing were systematically compared. Indeed, use of DD45 and RPS5a promoters improved efficiency of our genome editing system by approximately 25-30-fold in comparison to the previous ubiquitin promoter. Using an optimized genome editing system with RPS5a promoter-driven Cas9, putatively quintuple mutant lines lacking detectable amounts of Lhcb1 protein represented approximately 30% of T1 transformants. These results show how improved genome editing systems facilitate the isolation of complex mutant alleles, previously considered impossible to generate, at high frequency even in a single (T1) generation.