A new Arab pangenome reference has been constructed from 43 individuals enabling the study of variants and sequences of significance to the Arab population.
In a newly released preprint entitled A draft Arab pangenome reference, researchers from UAE and several other countries developed both the nuclear and mitochondrial pangenome constructs from long-read data generated by a combination of sequencing technologies, for which PacBio HiFi long reads “provided the base with highly accurate long-reads”.
A new frontier for pangenome construction
Arabs constitute culturally diverse communities with a combined population comprising about 5% of the global population. They are unfortunately underrepresented in global sequencing projects and neither the HPRC pangenome nor the 1000 Genomes Project sampled any Arab population. The lack of reference genomes for Arab populations has limited the investigation of genetic diversity and the genetic underpinning of numerous diseases, according to the authors of the publication.
To build the pangenome, the long sequence reads (average median Q score of 32.85 for PacBio HiFi and 17.39 for ONT) were assembled de novo using Hifiasm for all samples in the entire cohort. “This yielded high-quality contiguous (average N50=106.81 Mb) de novo assemblies that used over 99% of the sequences constructing haplotype phased diploid genome assemblies with 88% exhibited larger genome length (average 3.01 gigabase) than the prevailing human reference GRCh38”. Similar to other pangenome projects, a pangenome graph was built using Minigraph Cactus (v2.6.7), integrating 86 long-read assemblies into a graph structure. For small variant analysis, “joint calling analysis incorporating Deepvariant (GRCh38) for HiFi data, identified an average of 4,421,702 single nucleotide variants (SNVs) and 847,117 indels”. The mitochondrial Arab pangenome (mtAPR) was also constructed from high-quality HiFi reads from 43 individuals.
Then the scientists compared their data to existing reference sets and found there was an average of 30.84 Mb and 76.83 Mb of assembled contigs that did not align with CHM13 and GRCh38 respectively, confirming the under-representation of the Arab genome diversity in these references.
Exciting new findings from the Arab pangenome
When comparing the Arab pangenome graph with the HPRC and CPC pangenome graph, the authors identified each individual genome containing an average of 5,044,179 total and 743,379 unique small variants, 10.68 million small variants were unique to the new Arab pangenome. Each sample also had an average of 8,302 unique structural variants (SVs), yielding 108,709 SVs that were unique to the Arab pangenome.
The authors also looked at gene duplications and many other quality aspects of the data and concluded, “Our study provides a valuable resource for future genetic research and genomic medicine initiatives in the Arab populations,” and “will enable the probing of disease associations with variants and sequences that are unique or prevalent in Arab populations.”
This publication is another great example of the recognized value of HiFi WGS data for population genetics and precision health research, and points to the added benefits of long-read sequencing when building population-specific reference genomes.
Get the sequencing capabilities of tomorrow, today
If you are interested in using PacBio sequencing for your own research:
Connect with a PacBio scientist