Delve deeper into the importance of capturing diversity with population-specific reference pangenomes.
The movement towards the development of pangenomes is gaining steam, bringing the reality of truly personalized medicine closer. The first human pangenome was developed using genetic data from 47 individuals. But knowing what we do now about genomic diversity, researchers recognize that just one human pangenome will not fit everyone.
Earlier this month, we published a piece describing the impact of a recent study authored by Professors Kai Ye of Xi’an Jiaotong University, Xi’an, China and Shu-hua Xu of Fudan University, Shanghai, China. The study which appears in Nature, notes that while nearly 60% of the global human population is spread across Asia, peoples on the world’s largest continent have been underrepresented in the pangenomes created so far.
Starting a few years ago, researchers decided to do something about this through the Chinese Pangenome Consortium (CPC) by initiating the first of three phases of a project aimed at better representing the genetic diversity of Chinese populations.
Why diversity matters
“Unlike previous genetic studies of populations in China, which were mainly targeted at revealing genetic relationships and genetic history of populations, we attempted to uncover missing sequences and hidden variations that have not been identified before in Chinese ethnic groups,” stated Prof. Shu-hua Xu. “For example, about 18.4% of the small variants and 17.1% of the structural variants (SVs) identified were specific to the CPC assemblies compared with a recently released pangenome reference by the Human Pangenome Reference Consortium (HPRC). These newly identified genomic variations are more informative and thus can facilitate uncovering finer-scale population relationships, as the majority of the novel variations are population-specific.”
The CPC reference pangenome provides one of the most comprehensive understandings of genomic variation within an East Asian population that has been constructed to date. As Prof. Shu-hua Xu noted, “our results suggest that the use of population-specific references in sequence alignment improved the alignment quality. Compared with the HPRC reference, using the CPC reference improved the perfect alignment rate of short reads in East Asian samples. It would also help to improve the accuracy of profiling parts of the genome enriched with complex sequence variations such as genes regulating the immune system.”
As shared in the paper, Phase I of this study captured a large number of missing sequences and hidden variations from a collection of 116 high-quality and haplotype-phased de novo assemblies from 36 underrepresented minority Chinese ethnic groups. “In particular, our efforts added 189 million base pairs of euchromatic polymorphic sequences and 1,367 protein-coding gene duplications to the current state-of-the-art reference, GRCh38. We identified 15.9 million small variants and 78 thousand structural variants, of which 5.9 million small variants and 34 thousand structural variants were not reported elsewhere,” Prof. Shu-hua Xu said. These formerly missing sequences have the potential to help researchers trace missing links in human evolution, identify heritability for disease mapping and aid medical research today and in the future.
Advantages of HiFi pangenome references
Prof. Kai Ye, and Prof. Shu-hua Xu and their team collected long-read, high-accuracy HiFi data as the main data type for their study and strategically combined assembly and alignment to successfully resolve sequences and structures. Prof. Kai Ye explained, “to avoid artifacts introduced during cell line immortalization processes, we directly extracted DNA from blood samples. Additionally, to ensure that the samples represent various ethnic characteristics, we required that the samples be from three generations of the same ethnic group.”
Not only is creating a population-specific pangenome a best practice, with proper sequencing technology, it is achievable. For population genomic efforts like the CPC, a platform like the Revio system enables researchers to reveal more with accurate long-read sequencing at scale with increased throughput for larger cohorts. With HiFi reads, researchers are able to obtain phased genomes with high accuracy, more completeness, and greater resolution for all variant classes, robust coverage across challenging and repeat-rich regions, and genome-side methylation status for multiomics studies. And as the CPC project found, HiFi sequencing is ideal for creating differentiated population sequencing data sets that span well beyond the limitations of traditional short-read technology.
“Compared to traditional linear genomes, pangenomics characterizes genetic diversity of multiple populations in a graph genome representation. Its built-in prior variant information provides a potential set of functional variants for disease research, addressing the bottleneck of missing inheritability faced by researchers,” said Prof. Kai Ye. “Current applications such as disease identification using resequencing strategies are severely limited by the sequences contained in linear genomes. However, the pangenomic graph representation reveals previously unrepresented variants from diverse populations and those with lower population frequencies, enabling individualized disease identification.”
We can’t wait to see what Phase II of the study will bring. The team will sequence 1,000 samples encompassing 56 ethnic groups, and the possibility of discovery is endless.
To learn more about PacBio solutions for your population genomics study, we invite you to reach out to one of our scientists or check out our recent webinar, “Shining a Light on Dark Genes in Population Sequencing.”