Quality Statement

Pacific Biosciences is committed to providing high-quality products that meet customer expectations and comply with regulations. We will achieve these goals by adhering to and maintaining an effective quality-management system designed to ensure product quality, performance, and safety.


Image Use Agreement

By downloading, copying, or making any use of the images located on this website (“Site”) you acknowledge that you have read and understand, and agree to, the terms of this Image Usage Agreement, as well as the terms provided on the Legal Notices webpage, which together govern your use of the images as provided below. If you do not agree to such terms, do not download, copy or use the images in any way, unless you have written permission signed by an authorized Pacific Biosciences representative.

Subject to the terms of this Agreement and the terms provided on the Legal Notices webpage (to the extent they do not conflict with the terms of this Agreement), you may use the images on the Site solely for (a) editorial use by press and/or industry analysts, (b) in connection with a normal, peer-reviewed, scientific publication, book or presentation, or the like. You may not alter or modify any image, in whole or in part, for any reason. You may not use any image in a manner that misrepresents the associated Pacific Biosciences product, service or technology or any associated characteristics, data, or properties thereof. You also may not use any image in a manner that denotes some representation or warranty (express, implied or statutory) from Pacific Biosciences of the product, service or technology. The rights granted by this Agreement are personal to you and are not transferable by you to another party.

You, and not Pacific Biosciences, are responsible for your use of the images. You acknowledge and agree that any misuse of the images or breach of this Agreement will cause Pacific Biosciences irreparable harm. Pacific Biosciences is either an owner or licensee of the image, and not an agent for the owner. You agree to give Pacific Biosciences a credit line as follows: "Courtesy of Pacific Biosciences of California, Inc., Menlo Park, CA, USA" and also include any other credits or acknowledgments noted by Pacific Biosciences. You must include any copyright notice originally included with the images on all copies.


You agree that Pacific Biosciences may terminate your access to and use of the images located on the PacificBiosciences.com website at any time and without prior notice, if it considers you to have violated any of the terms of this Image Use Agreement. You agree to indemnify, defend and hold harmless Pacific Biosciences, its officers, directors, employees, agents, licensors, suppliers and any third party information providers to the Site from and against all losses, expenses, damages and costs, including reasonable attorneys' fees, resulting from any violation by you of the terms of this Image Use Agreement or Pacific Biosciences' termination of your access to or use of the Site. Termination will not affect Pacific Biosciences' rights or your obligations which accrued before the termination.

I have read and understand, and agree to, the Image Usage Agreement.

I disagree and would like to return to the Pacific Biosciences home page.

Pacific Biosciences

Sequencing 101: Looking Beyond the Single Reference Genome to a Pangenome for Every Species

Thursday, April 30, 2020

What is a Pangenome?

A pangenome identifies which portions of the genome are unique and which overlap and are therefore core to the species.

Unless you have an identical twin, no other person has a genome that is identical to yours. The same is true for other animal, plant, and microbial species that reproduce sexually: the genomes of individuals are unique. Less well known, but equally true, is that individual members of a species do not always share even the exact same genes. Nevertheless, scientists mostly use a single reference genome to represent an entire species: one human genome, one maize genome, one Staphylococcus aureus genome.


The Coining of the “Pangenome”

 Around 2005, geneticists started to explore the concept of the pangenome, originally defined as the entire set of genes possessed by all members of a particular species and then extended to refer to a collection of all the DNA sequences that occur in a species.

It started with bacteria, as many things do. Genomic activity like recombination, mobile genetic elements, and horizontal gene transfer were clearly contributing to individual diversity across the bacterial domain. Some scientists discovered dozens, if not hundreds, of unknown genes when they sequenced new strains.

Generating pangenomes reveals more diversity than expected.

In 2007, MIT microbiologist Sallie Chisholm (@ChisholmLab_MIT) set out to determine the extent of genetic variation in the marine cyanobacterium Prochlorococcus. Each strain contains approximately 2,000 genes, and Chisholm estimated that a pangenome for Prochlorococcus would be around 6,000 genes, based on an initial set of 12 genome sequences. Eight years later, with 45 strains sequenced, she revised that estimate up to at least 80,000 genes—around four-times the number of genes in the human genome—with the core genome for the species comprising only about 1,000 genes, or less than 2 percent of the total gene pool.

“That’s a lot of information shaping that collective,” Chisholm told The Scientist. “[The pangenome view] changes the way you think about what an organism is.”


Why is it Important to Capture the Full Range of Genetic Diversity? 

Those looking to create vaccines need to understand the genomic variation and versatility of disease-causing microbes, especially if they are hoping to develop universal vaccines that could provide protection against more than one strain in a species.

Those studying adaptation to climate change would benefit from a comparison of genes absent or in abundance within species found in different geographic locations and/or environmental conditions. In crop plants, differences in variable genes could have implications on disease resistance, metabolite production, and stress responses.

And with differences in gene number increasingly being associated with disorders including autism, Parkinson’s and Alzheimer’s diseases there are strong medical justifications for taking a more variation-centric view of the human species. Variants cannot be identified within regions completely missing from the reference sequence, many of which have been found to be more common than previously thought.


What is Being Done to Generate Pangenomes?

To answer this question, we sat down with a few scientists to talk about the era of the pangenome and what’s to come.

Pangenomes have revealed more genetic diversity within the maize species, than between human and chimpanzees.

One particularly important crop that has haunted geneticists and breeders for years is maize. It is challenging to sequence because the vast majority of its 2.3 Gb genome, a staggering 85 percent, is made up of highly repetitive transposable elements.  Maize is also incredibly diverse in its DNA makeup. As an example, a study comparing genome segments from two inbred lines revealed that half of the sequence and one-third of the gene content was not shared – that’s much more diversity within the species than between humans and chimpanzees, which exhibit around 94 percent sequence similarity.

“The whole notion of a single reference genome for crop plants is an antiquated concept borne out of necessity from the technological limitations of the past. Now with the capability to rapidly generate high-quality references for even the largest crop genomes, we can readily access the full complement of sequence diversity and structural variation within a crop,” says Kevin Fengler, Comparative Genomics Lead at Corteva Agriscience.

So the field was delighted when a collective of 33 scientists released a 26-line maize pangenome reference collection earlier this year. The collection was created using PacBio sequencing, and includes comprehensive, high-quality assemblies of 26 inbreds known as the NAM founder lines. These include the most extensively researched maize lines that represent a broad cross section of modern maize diversity, as well as an additional line containing an abnormal chromosome 10.

It turns out it’s not just maize biology that can be informed by pangenomes. “The high level of diversity in maize is well known, but we see a lot of diversity and structural variation underlying traits of interest in all the crop plants we work on. Creating the first reference genome for a crop genome is a great first step, but things get really interesting as you begin to add more genomes and a more comprehensive view emerges,” adds Fengler.

As for our own species, the current reference genome (GRCh38) – an update of the genome produced by the international Human Genome Project in 2000 and based mostly on DNA from one person – has been added to and annotated through the years, but is still an incomplete sequence and woefully inadequate as a representation of human diversity and genetic variation. Scientists estimate that up to 40 megabases of sequence, including protein-coding regions, are absent from the reference genome.

Several studies using PacBio long reads have reported an average of ~20,000 structural variants (SV) per human genome, most of which fall within repetitive elements and segmental duplications. Furthermore, it does not represent the diploid structure of human genomes. Rather, it is an arbitrary linear combination of different haplotypes, or a mosaic of multiple individuals.

Several groups have undertaken efforts to ensure certain populations are better represented in genomic databases, from Sweden to Tibet to Japan. Check out an interactive map of human genomes generated with PacBio sequencing.


When asked the value a pangenome could bring to human research, Fritz Sedlazeck (@sedlazeck), Assistant Professor at Baylor College of Medicine, said, “the pangenome has the potential to represent the diversity of the human population or any species. This eases the re-identification of complex alleles or even haplotypes.”

And it seems the National Human Genome Research Institute agrees, recently committing $30 million towards the creation of a new human pangenome based on high-quality sequencing of 350 individuals from across the human population, to capture all genomic variation observed in human populations.

“One human genome cannot represent all of humanity. The human pangenome reference will be a key step forward for biomedical research and personalized medicine. Not only will we have 350 genomes representing human diversity, they will be vastly higher quality than previous genome sequences,” said David Haussler, director of the University of California Santa Cruz Genomics Institute, which is leading the project.


How to Generate a Pangenome?

So, what are the most important things to keep in mind when creating a pangenome reference?

First, Fengler says that being able to be confident in your results is really important. “Ideally, all of the references in the pangenome collection will be built with a similar recipe to enable direct comparisons without artifacts from different technologies.” This points to the need for a reliable technology that can be used to generate equivalent quality genomes for many samples with little variability.

HiFi reads from the Sequel II System enable fast, reliable results.

Second, the data must be high quality. When asked the importance of long reads to pangenome efforts, Sedlazeck said, “they will be important to distinguish between different alleles/paths in the graph and to characterize novel mutations. Thus, being able to cope with graphs that encode a much higher number of variations to better represent the population.” Along those lines, Fengler adds, “the approach for assembly needs to be robust and accurate such that mis-assembly and sequence errors are not interpreted as structural variation and sequence diversity.”

Lastly, cost and speed have to be taken into account. With the high accuracy of HiFi sequencing, only 10- to 15-fold coverage per haplotype is needed for a high-quality resulting genome assembly, and the analysis time can be cut in half.

“Now researchers no longer need to wait for actionable sequence data,” says Fengler. “For maize, we can generate a high-quality reference genome the same day that the sequencing finishes.”


What’s Next in Pangenomes?

Questions remain as to how to fully utilize pangenomes to better understand biology.

As pangenome collections grow, scientists have to tackle questions around how to represent a pangenome. “Which variations should be included into a pangenome? Is it all of them? Then you lose specificity in regions. Is it only the common variations? Then you have a problem with disease-causing variations and other complex regions like HLA,” asks Sedlazeck, highlighting the continued work that needs to be done.

In addition, tackling things like annotation, visualization, and relationship management are on Fengler’s mind. “A variety of new pangenome analysis and visualization tools are needed to fully realize the value of having a pangenome collection for each crop.”

And then we have to move into functional and translational analysis. Scientists need to be able to take their newfound understanding of variation at the genome level and see how it impacts phenotypes, and whether the variation can be introduced artificially to influence agronomic traits, for instance.

One thing is for sure, the pangenome era is upon us, and whether you need a pangenome to understand important traits or you build tools to interpret those traits, there will be plenty to work on in the coming years!

Interested in finding out more about HiFi data for pangenome sequencing your organism of interest? Get in touch with a PacBio scientist to scope out your project.


Explore other posts in the Sequencing 101 series:


The Evolution of DNA Sequencing Tools

Introduction to PacBio Sequencing and the Sequel II System

From DNA to Discovery – The Steps of SMRT Sequencing

Why Are Long Reads Important for Studying Viral Genomes?

Understanding Accuracy in DNA Sequencing

The Value of Sequencing Full-Length RNA Transcripts of DNA Transcripts

Ploidy, Haplotypes, and Phasing – How to Get More from Your Sequencing Data

VIDEO: Sequencing 101: How Long-Read Sequencing Improves Access to Genetic Information

Subscribe for blog updates: