Quality Statement

Pacific Biosciences is committed to providing high-quality products that meet customer expectations and comply with regulations. We will achieve these goals by adhering to and maintaining an effective quality-management system designed to ensure product quality, performance, and safety.


Image Use Agreement

By downloading, copying, or making any use of the images located on this website (“Site”) you acknowledge that you have read and understand, and agree to, the terms of this Image Usage Agreement, as well as the terms provided on the Legal Notices webpage, which together govern your use of the images as provided below. If you do not agree to such terms, do not download, copy or use the images in any way, unless you have written permission signed by an authorized Pacific Biosciences representative.

Subject to the terms of this Agreement and the terms provided on the Legal Notices webpage (to the extent they do not conflict with the terms of this Agreement), you may use the images on the Site solely for (a) editorial use by press and/or industry analysts, (b) in connection with a normal, peer-reviewed, scientific publication, book or presentation, or the like. You may not alter or modify any image, in whole or in part, for any reason. You may not use any image in a manner that misrepresents the associated Pacific Biosciences product, service or technology or any associated characteristics, data, or properties thereof. You also may not use any image in a manner that denotes some representation or warranty (express, implied or statutory) from Pacific Biosciences of the product, service or technology. The rights granted by this Agreement are personal to you and are not transferable by you to another party.

You, and not Pacific Biosciences, are responsible for your use of the images. You acknowledge and agree that any misuse of the images or breach of this Agreement will cause Pacific Biosciences irreparable harm. Pacific Biosciences is either an owner or licensee of the image, and not an agent for the owner. You agree to give Pacific Biosciences a credit line as follows: "Courtesy of Pacific Biosciences of California, Inc., Menlo Park, CA, USA" and also include any other credits or acknowledgments noted by Pacific Biosciences. You must include any copyright notice originally included with the images on all copies.


You agree that Pacific Biosciences may terminate your access to and use of the images located on the PacificBiosciences.com website at any time and without prior notice, if it considers you to have violated any of the terms of this Image Use Agreement. You agree to indemnify, defend and hold harmless Pacific Biosciences, its officers, directors, employees, agents, licensors, suppliers and any third party information providers to the Site from and against all losses, expenses, damages and costs, including reasonable attorneys' fees, resulting from any violation by you of the terms of this Image Use Agreement or Pacific Biosciences' termination of your access to or use of the Site. Termination will not affect Pacific Biosciences' rights or your obligations which accrued before the termination.

I have read and understand, and agree to, the Image Usage Agreement.

I disagree and would like to return to the Pacific Biosciences home page.

Pacific Biosciences

Beyond Contiguity – Assessing the Quality of Genome Assemblies with the 3 C’s

Thursday, March 5, 2020

With high-throughput long-read sequencing, it is now affordable and routine to produce a de novo genome assembly for microbes, plants and animals. The quality of a reference genome impacts biological interpretation and downstream utility, so it is important that researchers strive to achieve quality similar to “finished” assemblies like the human reference, GRCh38. 

Until a time when sequence data and resulting assemblies can regularly achieve reference-quality, assemblies should be evaluated in the three key dimensions: Contiguity, Completeness, and Correctness. However, the most commonly used measures of genome quality only tackle two of the three C’s. 

Contiguity is often measured as contig N50, which is the length cutoff for the longest contigs that contain 50% of the total genome length. In this era of long-read genome assemblies, a contig N50 over 1 Mb is generally considered good. 

Completeness is often measured using BUSCO (Benchmarking Universal Single-Copy Orthologs) scores, which look for the presence or absence of highly conserved genes in an assembly. The aim is to have the highest percentage of genes identified in your assembly, with a BUSCO complete score above 95% considered good.

Correctness, the third and final C, is more challenging to measure. Correctness can be defined as the accuracy of each base pair in the assembly and is most often measured as concordance of an assembly to a gold standard reference. Of course, when sequencing a novel species there may not be a reference against which to measure. Furthermore, concordance is only a good measure for accuracy when the gold-standard itself is very high quality and when there is little biological divergence between the reference sample and assembly sample (Figure 1).

So, how does one properly measure the accuracy of a generated genome assembly? Well, we explored several methods you might find useful and broke them down by what type of orthogonal data is needed for each.


Data needed: Transcript Annotations

One measure of correctness is the number of frameshifting indels in coding genes. Frameshifts often disrupt the production of the protein encoded by the gene and are rare. Thus, most observed frameshifts are actually assembly errors.

This approach is similar to BUSCO but rather than utilizing a small conserved set of genes, a larger set of genes are analyzed. This requires a set of transcripts from the same (or very closely related) sample, which are commonly generated as part of a genome annotation project. PacBio RNA Sequencing, using the Iso-Seq method, is a good strategy for genome annotation.

The primary advantage of this approach is that you may be able to use an annotation or RNA sequencing data that is already in existence. The primary disadvantages of this approach are that it assesses only a small percentage of the genome (often less than 1%, often some of the most conserved regions) and may underestimate accuracy since not all frameshifts are errors.


Data needed: Reference Genome & Short Reads

Sometimes a reference genome is available for the same species but for a different individual than the one being assembled. In such cases, it is useful to define “high-confidence regions” where the reference is a good match to the sample and then assess the assembly only within those high-confidence regions. The Genome in a Bottle Consortium has applied such an approach for human samples.

How to generate a high-confidence region benchmark from Kingan, et al.

To build high-confidence regions as Kingan, et al. did for human, rice, and Drosophila, short-read sequencing data is mapped to the reference and used to exclude low-confidence regions including those with abnormal coverage or in close proximity to variants. Within the resulting high-confidence regions, concordance is a good measure of assembly accuracy.

The advantages of this approach are that it provides a good measure of assembly accuracy and explicitly identifies errors as discordances between an assembly and the high-confidence regions. Discordances can then be examined to determine how to improve the assembly.

The disadvantages are that it requires an independent reference genome from which to start, as well as additional short-read data. Also, the accuracy estimate can be somewhat optimistic by excluding “difficult” regions from evaluation or somewhat pessimistic if true biological variants are not removed from the benchmark.


Data needed: BAC Sequences

In cases where a reference is not available but another set of high-quality sequences, such as Bacterial Artificial Chromosomes (BACs), exist for the same sample, you can measure concordance between your assembled contigs and the BAC sequences. This method was used by Vollger, et al. when validating the accuracy of one of the first human assemblies generated using PacBio highly accurate long reads, known as HiFi reads.


Data needed: Short Reads

It is possible to measure accuracy even for a species with no existing reference genome by comparing the k-mers in an assembly to k-mers from short reads from the same individual. One tool to do this is yak from Heng Li. Another is merqury, developed by Arang Rhie in Adam Phillippy’s group.

K-mer spectrum comparison to identify errors

The advantages of this approach are that it does not require a reference genome and does not ignore difficult regions of the assembly. It also provides a way to measure completeness by flipping the comparison and looking for k-mers present in the short reads that are missing in the assembly. Merqury has the additional ability to track the coordinates of errors: it outputs files that can be loaded as IGV tracks so the user can visualize misassembles or other errors. Merqury has many additional functions like outputting spectra-cn plots and, for users with a trio, assessing contig phasing accuracy with statistics and plots.

Example of a homozygous variant that indicates an error using short reads.

Similar to the k-mer approach above, short-read data can be used to count errors by aligning short reads to the assembly and identifying single nucleotide differences. An error rate can then be calculated by dividing the total count of SNVs by the number of bases in an assembly covered by at least 3 short reads. The short-read data can be from a closely related individual, although estimates of correctness are most accurate when the same individual is used as Koren, et al. did with an F1 cross of two breeds of cattle.

Like the k-mer method, this approach requires no reference genome or transcript dataset and does not ignore difficult regions of the assembly. Unlike the k-mer approach, potential errors in the assembly can be identified and characterized in order to improve the assembly method.


Looking to the Future

Exciting progress in long-read sequencing and genome assembly has made it standard to produce contiguous, complete genomes. In order to generate genomes that are not simply assembled, but are also effectively used for downstream biology, we must address the third dimension of quality: correctness. The techniques discussed above make it easy to measure correctness with a variety of different orthogonal data types. We expect these approaches will identify which sequencing workflows produce the most accurate genomes and will nudge the field towards an era of reference-grade de novo assemblies.


Learn more about HiFi sequencing data for your organism of interest or get in touch with a PacBio scientist to scope out your project.

Subscribe for blog updates: