Genomes vs. GenNNNes: The Difference between Contigs and Scaffolds in Genome Assemblies
Wednesday, February 3, 2021
This blog post has been updated, it was originally published September 2016.
In recent interactions with the scientific community, we’ve seen a growing number of questions around scaffolding genome assemblies. We thought it might be useful to review the concepts behind contigs and scaffolds, as well as the circumstances in which one might want to scaffold a high-quality PacBio genome assembly.
Contigs vs. Scaffolds
Contigs are continuous stretches of sequence containing only A, C, G, or T bases without gaps. SMRT Sequencing has all of the necessary performance characteristics – long reads, lack of sequence-context bias, and high accuracy – to generate contiguous genome assemblies with megabase-sized contigs. Ultra-long contigs provide complete and uninterrupted sequence information across full genes, and more recently even allow separation of the different chromosomes for diploid and polyploid organisms.
The unprecedented quality of PacBio highly accurate long reads – known as HiFi reads – has been described as “the most effective standalone technology for de novo assembly” in a study focused on sequencing the CHM13 human cell line, which yielded an assembly contig N50 of 29.5 Mb and a Phred quality score of Q45. HiFi reads have also enabled generating reference-quality de novo assemblies of many plant and animal species, population-specific human assemblies and the first fully complete sequence of a human autosome – chromosome 8, including the centromeres. Even large and complex plant genomes like the California Redwood, a 27 Gb hexaploid, can be readily assembled with high contiguity using HiFi reads.
Learn how HiFi reads help scientists unlock new discoveries.
Scaffolds are created by chaining contigs together using additional information about the relative position and orientation of the contigs in the genome. Contigs in a scaffold are separated by gaps, which are designated by a variable number of ‘N’ letters. Scaffolding is often used for short-read assemblies to make sense of the fragmented genome assemblies containing short contigs. However, there are three important principal deficiencies of scaffolds:
- Scaffolds miss critical information. Gaps represent missing genomic information and, in many cases, these gaps can coincide with important genomic loci. Many promoters and first exons are GC-rich in sequence, often resulting in missing or low-quality sequence reads from short-read or Sanger sequencing. Thus, genes are incompletely resolved, and their regulation cannot be understood. Another reason for gaps in scaffolded assemblies is large, repetitive elements which short-read sequencing methods struggle to bridge. Thus, duplicated genes, genes vs. pseudogenes, short tandem repeats, variable number tandem repeats, microsatellites, and many other structural genomic features are often unresolved in scaffolded short read assemblies. As summarized in a Nature Genetic Reviews article, long-read sequencing technologies, and specifically HiFi reads help overcome these types of complex regions to give a complete picture of genetic variation, including in regions previously thought to be intractable like telomeres and centromeres.
- The length of a scaffold gap often has no relation to the true gap size. In several reference genomes, gaps are arbitrarily set to certain fixed lengths. For example, most gaps in the zebra finch reference are set to 100 Ns, while in the version 3 maize reference they are set to 1,000 Ns. This means that in most cases, the true length of sequence represented by the gap differs from the set gap size, and is sometimes off by thousands of bases. The uncertainties of gap sizes in scaffolds result in an inability to understand the true spatial relationships of functional elements in genomes and is an underestimate of the actual extent of missing information. More recently, those older reference assemblies have benefited from PacBio long-read sequencing – see the latest: zebra finch and maize.
- Gap-flanking scaffold sequence can be low-quality, and is sometimes completely wrong. The sequences surrounding gaps often fall into areas where short-read technologies have deficiencies due to GC-bias or read-length limitations. This can result in sequence that is of lower quality and, in some cases, completely erroneous. For example, because of complex repeat structures in the human IGH locus, the right edge of a 50,000 N gap in the short-read assembly contains 1,836 bases of flanking sequence that has no support in the hg19 human genome reference or the PacBio assembly. In some ways, having incorrect flanking sequence in scaffolds is worse than having ‘N’ gaps, since that erroneous sequence is considered and included for downstream analyses.
Illustration of the difference between contigs and scaffolds in genome assemblies
The information missed by gapped scaffold assemblies complicates and may preclude downstream analysis and understanding related to functional and comparative genomics. Scaffolded short-read assemblies get nowhere near the quality of PacBio genome assemblies in terms of contiguity and completeness, and they often require labor-intensive follow-up work to close gaps, adding time and cost to projects.
Scaffolding PacBio assemblies for chromosome-scale genome representations
For even longer-range genomic connectivity, e.g. to bridge the largest segmental duplications and repeat regions, researchers can go a step further by adding scaffolding information to a PacBio assembly, often resulting in telomere-to-telomere, chromosome-scale genome representations. Several methods have been demonstrated to work very well for this purpose, including optical mapping and crosslinking approaches. Check out examples of barn swallow, insects, and human genome sequencing to see how chromosome-level scaffolding enables more comprehensive insights.
There are numerous large international initiatives using PacBio long-read sequencing to produce high-quality, phased, chromosome-level genome assemblies of many organisms:
- Vertebrate Genomes Project
- Sanger 25 Genomes Project
- Darwin Tree of Life Project
- NHGRI Human Pangenome Reference Initiative
- PacBio Workshop: Understanding the biology of genomes with HiFi sequencing
- Webinar: Sequencing 101 – How long-read sequencing improves access to genetic information
- Understanding Accuracy in DNA Sequencing
- Looking Beyond the Single Reference Genome to a Pangenome for Every Species
- The Evolution of DNA Sequencing Tools