This article explores sequencing coverage fundamentals. Uncover key concepts and discover how highly accurate long-read sequencing provides a comprehensive view of the genome, at any coverage level.
What is sequencing coverage?
Genomics professionals use the terms “sequencing coverage” or “sequencing depth” to describe the number of unique sequencing reads that align to a region in a reference genome or de novo assembly.
A 30x human genome means that the reads align to any given region of the reference about 30 times, on average. In practical terms, the higher the sequencing depth, the more times the genome is read, resulting in a more accurate and reliable information.
Why is sequencing coverage important?
Sequencing coverage is important in genomics because more coverage gives researchers greater statistical confidence that their results, and the conclusions that are drawn from them, are correct.
Increased coverage is important for scientists to be assured that what was observed was not a fluke or random error but rather an actual attribute of the biological sample.
In science, having statistical confidence in the outcome of an experiment is very important. If you flipped a coin three times in a row there is a decent chance that it might land on the same side two out of three times. If you stopped there, you might conclude that such coins will land on one side more often than the other (66% of the time to be exact). But a sample of 3 is small, what if the coin flips you observed were merely the result of chance? If you performed your coin flips 30 times or 100 times, it is much more likely that you will find that the results fall closer to a 50/50 split as to which side the coin lands on.
Are genomes of equal sequencing coverage of equal scientific value?
Genomes with equal sequencing coverage are not necessarily of equal scientific value.
Many factors influence the explanatory power of a genome alignment or assembly. However, in addition to the underlying factors (e.g., assumptions, sample quality, experimental design, etc.) that can affect the biological value of a genome, the uniformity of coverage and the accuracy of the individual reads can greatly enhance the scientific value of one genome over another. An excellent example of this comes from a technology comparison study where the authors found that for the de novo assembly of the Saccharomyces cerevisiae genome, 20x coverage with highly accurate long-read PacBio HiFi data exceeded the utility of 20x (and in fact even 80x) coverage using nanopore sequencing1.
What coverage uniformity is, and why it is important
Coverage uniformity tells us how evenly distributed individual reads are across the genome or region of interest.
Two genomes could be sequenced to an equal level of coverage (e.g., 30x) but the first could have low uniformity (with some genes that are not covered at all and others that are covered 60 times); while the second could have highly uniform coverage with every gene or region covered 25 to 35 times. At face value these are both 30x genomes. However, the first is lower quality –with gaps in some areas and excellent coverage in others – while the second provides respectable confidence throughout –making it more useful for interpreting biology across the whole genome.
How much sequencing coverage is necessary?
The right level of sequencing coverage for a study can vary widely depending on the goals of a project and how the results may be applied. Factors to consider include but are not limited to:
The type of sequencing technology being used.
- The ploidy of the genome.
- The complexity or rarity of the variant or attribute you wish to study.
- Sample quality/level of degradation.
- The desired level of statistical confidence/power.
- Requirements set by peer-reviewed journals or data repositories.
Currently, in human genomics 30x coverage is widely regarded as the standard for human whole genome sequencing (WGS) in many biomedical research areas. However, this benchmark was established when DNA sequencing chemistry and capabilities were different from what they are today2. In the years since the rise of short-read sequencing-by-synthesis (SBS) technology, sources have pointed to the need for >30x coverage when studying many aspects of genomes3,4. For example, it is often recommended that human cancer tumors be analyzed to a sequencing depth of 80x or more on conventional SBS sequencers5,6. As genomics begins to move beyond short-read NGS into the era of long-read sequencing, the amount of sequencing coverage needed to find what we are looking for will likely begin to change yet again.
Rethink what can be achieved at any level of coverage
Now and again new technologies arrive on the scene in ways that redefine the way we approach science. The current standard for whole genome sequencing coverage was established at a time when short-read sequencing by synthesis (SBS) technology was still new2. As a result, the 30x sequencing coverage benchmark in many ways reflects the capabilities and limitations of that technology. Today, the technological landscape of DNA sequencing looks quite different.
The new forefront is now defined by long-read sequencing, which has matured into a highly accurate, high-throughput, scalable technology. This new approach is shifting the methodological paradigm for genomics in ways that beg the scientific community to not only break new ground, but reexamine past work, assumptions, and standards.
At present, PacBio long-read WGS can provide information that is practically unattainable with SBS short-read sequencers at any level of sequencing coverage7. Structural variation, native 5mC epigenetic calling, haplotype phasing, and accurate, uniform coverage of the entire genome, including what used to be called “dark regions” (such as large repeat expansions, GC rich areas, centromeric regions and more) –can now be seen and are part and parcel of a standard PacBio HiFi long-read sequencing run on both the Sequel IIe and the new Revio systems.
In the era of high-throughput long-read sequencing, there is still some debate on what the new baseline should be for sequencing coverage in human genomics. Nevertheless, it is evident that the old standard is undergoing significant change, and we are so excited to see where it takes us.
Are you interested in finding out why PacBio long-read sequencing requires less coverage than other technologies? Check out this article on long-read sequencing.
Rich genomic information at lower coverage seems great, but what does it cost?
At about 330 USD*† per 10x human genome on the Revio system, accessing the full complement of genomic information to make a truly “whole” whole genome is now more affordable and information rich than ever. This is especially true when considering the time and money saved by skipping the additional experiments required to achieve lesser insights with short-read SBS technology. If you are interested in a 30x PacBio long-read human genome the Revio system is optimized to produce one 30x HiFi human whole genome per SMRT Cell with a ~24-hour turnaround time, without batching –making it a $995 USD* 30x HiFi whole genome ripe for enabling novel discoveries.
These advances offer exciting possibilities, and raise one crucial question – what groundbreaking discoveries will you make?
Are you interested in investing in a PacBio system or sequencing through a core facility or service provider? Would you like to pinpoint the correct level of coverage to achieve your project goals with PacBio sequencing?
Speak with a PacBio scientist to explore options.
- Xue Zhang, Chen-Guang Liu, Shi-Hui Yang, Xia Wang, Feng-Wu Bai, Zhuo Wang, “Benchmarking of long-read sequencing, assemblers and polishers for yeast genome”, Briefings in Bioinformatics, Volume 23, Issue 3, May 2022, bbac146, https://doi.org/10.1093/bib/bbac146
- Bentley, David R et al. “Accurate whole human genome sequencing using reversible terminator chemistry.” Nature 456,7218 (2008): 53-9. doi:10.1038/nature07517
- Kong, Sek Won et al. “Measuring coverage and accuracy of whole-exome sequencing in clinical context.” Genetics in medicine: official journal of the American College of Medical Genetics 20,12 (2018): 1617-1626. doi:10.1038/gim.2018.51
- Sims, D., Sudbery, I., Ilott, N. et al.Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet 15, 121–132 (2014). https://doi.org/10.1038/nrg3642
- “Evaluating Somatic Variant Calling in Tumor/Normal Studies” https://www.illumina.com/content/dam/illumina-marketing/documents/products/whitepapers/whitepaper_wgs_tn_somatic_variant_calling.pdf
- Manja Meggendorfer et al. “Analytical demands to use whole-genome sequencing in precision oncology” Seminars in Cancer Biology, Vol 84, 2022, 16-22, https://doi.org/10.1016/j.semcancer.2021.06.009.
- De Coster, W., Weissensteiner, M.H. & Sedlazeck, F.J. Towards population-scale long-read sequencing. Nat Rev Genet 22, 572–587 (2021). https://doi.org/10.1038/s41576-021-00367-3
- William T. Harvey et al. “Whole-genome long-read sequencing downsampling and its effect on variant calling precision and recall” Preprint bioRxiv 2023.05.04.539448; doi: https://doi.org/10.1101/2023.05.04.539448
* Study design, sample type, and level of multiplexing may affect the number of SMRT Cells required. Costs may vary by region. Pricing includes library and sequencing reagents run on the Revio system and does not include instrument amortization or other reagents. Pricing information is current as of May 5, 2023.
† Requires multiplexing 3 human genome libraries per SMRT Cell 25M to achieve price estimate.