As technology developers, one of our greatest joys is seeing how customers take our sequencing tools and deploy them for innovative and compelling new projects. Metagenomics has been one of those areas: our customers have recently been demonstrating the significant performance improvements enabled by our HiFi metagenome sequencing data and analysis pipelines.
But since much of that work is protected by HIPAA regulations or has not yet been published, we are now releasing a metagenomic data set to help scientists see how HiFi data can make a difference for these types of studies. This information is now available for review and analysis and can be used with existing tools or to help develop new ones.
The data set was generated from four fully consented, pooled human fecal microbiome samples made available through The BioCollective. Two samples came from vegan donors and two from omnivore donors, allowing us to see how diet influences gut microbiota. The pooling process, which creates a reference material by pooling samples from multiple donors (in this case four adults), leads to a more complex sample and a richer data set than can be obtained through mock community approaches. It also gives a more consistent composition than samples from an individual donation.
Long-Read Sequencing Produces Rich Profiling Information
HiFi sequencing gave us nearly 2 million reads per sample, with mean read length close to 10 kb for each. Median quality for the sequencing data was Q39 for two samples and Q40 for two samples. We found that species composition was consistent within diets and different between diets. Of the 76 bacterial species detected, 14 were exclusive to the omnivore samples and 21 were only found in the vegan samples.
There are a lot of exciting things to unpack in this data set. First, it demonstrates that our data analysis pipelines produce rich functional profiling information. Unlike analyses of short-read data, about 90% of HiFi reads have at least one functional annotation, with reads typically having two to five annotations. For each sample run on a single SMRT Cell 8M, we generated more than 8 million total annotations.
In addition, the data set highlights the advantage of high accuracy when assembling long-read data from metagenomes. These samples often contain closely related strains. A common cutoff for defining a distinct species is just 3%; if the difference between strains is less than the error rate, then the error correction process can erase the real differences needed to resolve and distinguish those strains.
This heightened ability to resolve strains is what drives the large number of high-quality metagenome-assembled genomes (MAGs) that can be recovered from a relatively small amount of HiFi data. For each sample, our assembly evaluation pipeline identified at least 56 — and as many as 69 — MAGs. The unique combination of high accuracy and long reads means that high-quality MAGs can be generated with less than 20-fold coverage, and many of those MAGs are represented in a single contig.
Listen to Daniel Portik talk about this new dataset in the first episode of our Metagenomics Webinar Series on demand here. We hope you get the chance to download the data and experience it for yourself.
Want to talk to us about this data set or have project ideas where you think HiFi data can make a difference? Hit us up on Twitter or reach out directly to our metagenomic specialists Meredith Ashby or Daniel Portik.
If you are interested in additional Metagenomics Webinars, register for upcoming episodes to learn about:
● How to resolve viral evolution and quasispecies diversity mechanisms of bacterial virulence and adaptation,
● Identifying key players in host-microbiome interactions with high resolution 16S sequencing, and
● Revealing mechanisms of bacterial virulence and adaptation