UPDATE — November 17, 2020: This paper is now published in Scientific Data.
ORIGINAL POST
It’s been more than a year since we introduced HiFi sequencing to generate highly accurate long reads. In that time, we’ve seen many PacBio users make HiFi sequencing their go-to setting because it’s simple, reliable, and cost-effective. For scientists who have yet to generate their own HiFi data, we thought it might be helpful to publish a few data sets for exploration and analysis.
In a new preprint, we have released HiFi data sets for five samples: mouse, frog, maize, strawberry, and a mock metagenome community. We like to think there’s a data set for everyone here, whatever your research area of interest! Working with any of these HiFi read collections should offer a great introduction to this sequencing mode and show you why we often hear how easy it is to analyze HiFi data compared to traditional long reads.
Consistent with previous reports, the HiFi data generated for these five organisms yielded excellent accuracy, with average read qualities ranging from 99.84% to 99.97%. With that kind of accuracy we look forward to seeing what interesting biology our collaborators find within this data.
Organism | SRA | HiFi Data Yield (Gb) | Average Read Length (kb) | Average Read Quality |
Mus musculus | SRR11606870 | 66.5 | 17.1 | 31 |
Zea mays | SRR11606869 | 48.1 | 15.6 | 30 |
Fragaria ananassa | SRR11606867 | 29.7 | 21.7 | 28 |
Rana muscosa | SRR11606868 | 180.1 | 15.7 | 31 |
MSA-1003 | SRR11606871 | 59.1 | 10.4 | 35 |
The five HiFi data sets generated on the Sequel II System
In addition to letting scientists get a fresh look at HiFi data, we hope this release will encourage development of new applications and software for the benefit of the entire sequencing community. New and improved tools for assembling polyploid genomes or calling variants in non-model organisms are just a couple of areas we hope to see grow.
For those of you who want to use existing software to explore these datasets, here are some tools that we find useful for working with HiFi reads:
- Assembly: FALCON, hifiasm, HiCanu (Check out this in-depth overview of HiFi assemblers from @Magdoll)
- Variant detection: DeepVariant for SNVs and small indels (<50 bp) and pbsv for SVs (≥50 bp)
- Metagenomics: Canu for assembly, FragGeneScan for gene prediction, and MEGAN for taxonomic and functional profiling
For this data release, we’d like to thank all of the collaborators who helped to generate and present these results: Jane Landolin (@jlandolin), Nicholas Maurer, David Kudrna, Michael Hardigan, Cynthia Steiner, Steven Knapp (@knapp1955), Doreen Ware, and Beth Shapiro (@bonesandbugs).
And congratulations to the PacBio team members who led the charge on this effort: Ting Hon, Kristin Mars, Greg Young (@PacbioGreg), Yu-Chih Tsai, Joseph Karalius (@JoeyKaralius), Paul Peluso, and David Rank.
Access all of the data sets in the preprint, ‘Highly accurate long-read HiFi sequencing data for five complex genomes.‘
—
Interested in finding out more about HiFi data for sequencing your organism of interest? Get in touch with a PacBio scientist to scope out your project.