Menu

Scientific publications

Publications featuring PacBio long-read + short-read sequencing data

Nature Communications  |  2025

Human de novo mutation rates from a four-generation pedigree reference

David Porubsky, Harriet Dashnow, Thomas A. Sasani, Glennis A. Logsdon, Pille Hallast, Michelle D. Noyes, Zev N. Kronenberg, Tom Mokveld, Nidhi Koundinya, Cillian Nolan, Cody J. Steely, Andrea Guarracino, Egor Dolzhenko, William T. Harvey, William J. Rowell, Kirill Grigorev, Thomas J. Nicholas, Michael E. Goldberg, Keisuke K. Oshima, Jiadong Lin, Peter Ebert, W. Scott Watkins, Tiffany Y. Leung, Vincent C. T. Hanlon, Evan E. Eichle et al

Understanding the human de novo mutation (DNM) rate requires complete sequence information1. Here using five complementary short-read and long-read sequencing technologies, we phased and assembled more than 95% of each diploid human genome in a four-generation, twenty-eight-member family (CEPH 1463). We estimate 98–206 DNMs per transmission, including 74.5 de novo single-nucleotide variants, 7.4 non-tandem repeat indels, 65.3 de novo indels or structural variants originating from tandem repeats, and 4.4 centromeric DNMs. Among male individuals, we find 12.4 de novo Y chromosome events per generation. Short tandem repeats and variable-number tandem repeats are the most mutable, with 32 loci exhibiting recurrent mutation through the generations. We accurately assemble 288 centromeres and six Y chromosomes across the generations and demonstrate that the DNM rate varies by an order of magnitude depending on repeat content, length and sequence identity. We show a strong paternal bias (75–81%) for all forms of germline DNM, yet we estimate that 16% of de novo single-nucleotide variants are postzygotic in origin with no paternal bias, including early germline mosaic mutations. We place all this variation in the context of a high-resolution recombination map (~3.4 kb breakpoint resolution) and find no correlation between meiotic crossover and de novo structural variants. These near-telomere-to-telomere familial genomes provide a truth set to understand the most fundamental processes underlying human genetic variation.
bioRXiv  |  2025

High quality genome assemblies of African cattle breeds using PacBio HiFi sequencing

Isidore Houaga, Meenu Bhati, Zabron Nziku, Athumani Nguluma, Ntanganedzeni Mapholi, Lucky T. Nesengani, Moses Ogugo, Duhamel C.Y. Sagbo, Loukaïya Zorobouragui, Mariano B.Y. Boco, Appolinaire Djikeng, James G.D. Prendergast, Gregor Gorjanc, Hannes Becher

Africa has a uniquely rich cattle diversity of ∼150 breeds comprising the Bos taurus indicus sub-species, Bos taurus taurus, and their crosses. These represent ∼23% of the global cattle population. However, high quality, representative assemblies are limited for African cattle and especially for indicine breeds. Here we built high quality de novo assemblies for five important African indigenous cattle breeds using PacBio HiFi sequencing: Lagune (Bos taurus taurus), Gudali, Iringa Red and Singida White (Bos taurus indicus), and Mpwapwa (Bos taurus taurus x Bos taurus indicus). These new assemblies are the most contiguous and complete African cattle assemblies produced so far, with genome sizes of 3.25 - 3.36Gb, contiguity N50s ranging from 83.59Mb to 97.87Mb and scaffold N50s from 100.30Mb to 113.37Mb. BUSCO genome completeness scores were also higher than 99.68%, indicative of highly contiguous assemblies. These improved and highly contiguous genome assemblies are consequently a valuable resource for future African and global livestock genomic studies.
Genome Research  |  2025

Analytical validation of germline small variant detection using long-read HiFi genome sequencing

Nathan Hammond et al

Long-read sequencing has the capacity to interrogate difficult genomic regions and phase variants; however, short-read sequencing is more commonly implemented for clinical testing. Given the advances in long-read HiFi sequencing chemistry and variant calling, we analytically validated this technology for small variant detection (single nucleotide variants, insertions/deletions; SNVs/indels; <50bp). HiFi genome sequencing was performed on DNA from reference materials and clinical specimen types, and accuracy results were compared to short-read genome sequencing data. HiFi genome sequencing recall and precision across Genome in a Bottle (GIAB)-defined nondifficult and difficult genomic regions (high confidence) for SNVs were >99.9% and >99.7%, respectively, and for indels were >99.8% and >99.1%, respectively. Moreover, HiFi genome sequencing outperformed short-read genome sequencing on overall SNV/indel F1-score accuracy at all paired sequencing depths, which were further stratified across 100 total GIAB-defined genomic regions for a comprehensive evaluation of performance. Of note, HiFi genome sequencing F1-scores for SNVs and indels surpassed 99% at ~15×. and ~25×, respectively. In addition, high confidence small variant concordance across all HiFi genome sequencing reproducibility assessments (two specimens, three independent sequencing datasets) were >99.8% for SNVs and >98.6% for indels, and average high confidence small variant concordance between paired blood, saliva, and swab specimens were all >99.8%. Taken together, these data underscore that long-read HiFi genome sequencing detection of SNVs and indels is very accurate and robust, which supports the implementation of this technology for clinical diagnostic testing.
Nature  |  2025

Complete sequencing of ape genomes

Yoo, D., Rhie, A., Hebbar, P. et al.

The most dynamic and repetitive regions of great ape genomes have traditionally been excluded from comparative studies1,2,3. Consequently, our understanding of the evolution of our species is incomplete. Here we present haplotype-resolved reference genomes and comparative analyses of six ape species: chimpanzee, bonobo, gorilla, Bornean orangutan, Sumatran orangutan and siamang. We achieve chromosome-level contiguity with substantial sequence accuracy (<1 error in 2.7 megabases) and completely sequence 215 gapless chromosomes telomere-to-telomere. We resolve challenging regions, such as the major histocompatibility complex and immunoglobulin loci, to provide in-depth evolutionary insights. Comparative analyses enabled investigations of the evolution and diversity of regions previously uncharacterized or incompletely studied without bias from mapping to the human reference genome. Such regions include newly minted gene families in lineage-specific segmental duplications, centromeric DNA, acrocentric chromosomes and subterminal heterochromatin. This resource serves as a comprehensive baseline for future evolutionary studies of humans and our closest living ape relatives.
PLOS Computational Biology  |  2025

Analysis of targeted and whole genome sequencing of PacBio HiFi reads for a comprehensive genotyping of gene-proximal and phenotype-associated Variable Number Tandem Repeats

Sara Javadzadeh, Aaron Adamson, Jonghun Park,Se-Young Jo,Yuan-Chun Ding, Mehrdad Bakhtiari, Vikas Bansal, Susan L. Neuhausen ,Vineet Bafna

Variable Number Tandem repeats (VNTRs) refer to repeating motifs of size greater than five bp. VNTRs are an important source of genetic variation, and have been associated with multiple Mendelian and complex phenotypes. However, the highly repetitive structures require reads to span the region for accurate genotyping. Pacific Biosciences HiFi sequencing spans large regions and is highly accurate but relatively expensive. Therefore, targeted sequencing approaches coupled with long-read sequencing have been proposed to improve efficiency and throughput. In this paper, we systematically explored the trade-off between targeted and whole genome HiFi sequencing for genotyping VNTRs. We curated a set of 10 , 787 gene-proximal (G-)VNTRs, and 48 phenotype-associated (P-)VNTRs of interest. Illumina reads only spanned 46% of the G-VNTRs and 71% of P-VNTRs, motivating the use of HiFi sequencing. We performed targeted sequencing with hybridization by designing custom probes for 9,999 VNTRs and sequenced 8 samples using HiFi and Illumina sequencing, followed by adVNTR genotyping. We compared these results against HiFi whole genome sequencing (WGS) data from 28 samples in the Human Pangenome Reference Consortium (HPRC). With the targeted approach only 4,091 (41%) G-VNTRs and only 4 (8%) of P-VNTRs were spanned with at least 15 reads. A smaller subset of 3,579 (36%) G-VNTRs had higher median coverage of at least 63 spanning reads. The spanning behavior was consistent across all 8 samples. Among 5,638 VNTRs with low-coverage ( < 15), 67% were located within GC-rich regions ( > 60%). In contrast, the 40X WGS HiFi dataset spanned 98% of all VNTRs and 49 (98%) of P-VNTRs with at least 15 spanning reads, albeit with lower coverage. Spanning reads were sufficient for accurate genotyping in both cases. Our findings demonstrate that targeted sequencing provides consistently high coverage for a small subset of low-GC VNTRs, but WGS is more effective for broad and sufficient sampling of a large number of VNTRs.
bioRxiv  |  2025

Genetic diversity and regulatory features of human-specific NOTCH2NL duplications

Taylor D. Real, Prajna Hebbar, DongAhn Yoo, Francesca Antonacci, Ivana Pačar, Mark Diekhans, Gregory J. Mikol, Oyeronke G. Popoola, Benjamin J. Mallory, Mitchell R. Vollger, Philip C. Dishuck, Xavi Guitart, Allison N. Rozanski, Katherine M. Munson, Kendra Hoekzema, Jane E. Ranchalis, Shane J. Neph, Adriana E. Sedeño-Cortes, Benedict Paten, Sofie R. Salama, Andrew B. Stergachis, Evan E. Eichler

NOTCH2NL (NOTCH2-N-terminus-like) genes arose from incomplete, recent chromosome 1 segmental duplications implicated in human brain cortical expansion. Genetic characterization of these loci and their regulation is complicated by the fact they are embedded in large, nearly identical duplications that predispose to recurrent microdeletion syndromes. Using nearly complete long-read assemblies generated from 67 human and 12 ape haploid genomes, we show independent recurrent duplication among apes with functional copies emerging in humans ∼2.1 million years ago. We distinguish NOTCH2NL paralogs present in every human haplotype (NOTCH2NLA) from copy number variable ones. We also characterize large-scale structural variation, including gene conversion, for 28% of haplotypes leading to a previously undescribed paralog, NOTCH2tv. Finally, we apply Fiber-seq and long-read transcript sequencing to human cortical neurospheres to characterize the regulatory landscape and find that the most fixed paralogs, NOTCH2 and NOTCH2NLA, harbor the greatest number of paralog-specific elements potentially driving their regulation.
Nature  |  2025

Solanum pan-genetics reveals paralogues as contingencies in crop engineering

Benoit, M., Jenike, K.M., Satterlee, J.W. et al.

Pan-genomics and genome-editing technologies are revolutionizing breeding of global crops1,2. A transformative opportunity lies in exchanging genotype-to-phenotype knowledge between major crops (that is, those cultivated globally) and indigenous crops (that is, those locally cultivated within a circumscribed area)3,4,5 to enhance our food system. However, species-specific genetic variants and their interactions with desirable natural or engineered mutations pose barriers to achieving predictable phenotypic effects, even between related crops6,7. Here, by establishing a pan-genome of the crop-rich genus Solanum8 and integrating functional genomics and pan-genetics, we show that gene duplication and subsequent paralogue diversification are major obstacles to genotype-to-phenotype predictability. Despite broad conservation of gene macrosynteny among chromosome-scale references for 22 species, including 13 indigenous crops, thousands of gene duplications, particularly within key domestication gene families, exhibited dynamic trajectories in sequence, expression and function. By augmenting our pan-genome with African eggplant cultivars9 and applying quantitative genetics and genome editing, we dissected an intricate history of paralogue evolution affecting fruit size. The loss of a redundant paralogue of the classical fruit size regulator CLAVATA3 (CLV3)10,11 was compensated by a lineage-specific tandem duplication. Subsequent pseudogenization of the derived copy, followed by a large cultivar-specific deletion, created a single fused CLV3 allele that modulates fruit organ number alongside an enzymatic gene controlling the same trait. Our findings demonstrate that paralogue diversifications over short timescales are underexplored contingencies in trait evolvability. Exposing and navigating these contingencies is crucial for translating genotype-to-phenotype relationships across species.
Nature Genetics  |  2025

Long-read RNA sequencing atlas of human microglia isoforms elucidates disease-associated genetic regulation of splicing

Humphrey, J., Brophy, E., Kosoy, R. et al.

Microglia, the innate immune cells of the central nervous system, have been genetically implicated in multiple neurodegenerative diseases. Mapping the genetics of gene expression in human microglia has identified several loci associated with disease-associated genetic variants in microglia-specific regulatory elements. However, identifying genetic effects on splicing is challenging because of the use of short sequencing reads. Here, we present the isoform-centric microglia genomic atlas (isoMiGA), which leverages long-read RNA sequencing to identify 35,879 novel microglia isoforms. We show that these isoforms are involved in stimulation response and brain region specificity. We then quantified the expression of both known and novel isoforms in a multi-ancestry meta-analysis of 555 human microglia short-read RNA sequencing samples from 391 donors, and found associations with genetic risk loci in Alzheimer’s and Parkinson’s disease. We nominate several loci that may act through complex changes in isoform and splice-site usage.
Scientific Data  |  2025

A chromosome-level genome assembly of the male darkbarbel catfish (Pelteobagrus vachelli) using PacBio HiFi and Hi-C data

Liu, H., Zhang, J., Cui, T. et al

The darkbarbel catfish (Pelteobagrus vachelli), a species of significant economic value in China’s aquaculture sector, is widely utilized in hybrid yellow catfish production due to its exceptional growth rate. The growth rate of male P. vachelli is significantly higher compared to females, making all-male breeding a promising market opportunity. Therefore, the analysis of the male P. vachelli genome provides crucial genetic information for hybrid breeding and all-male breeding. Utilizing PacBio Hifi long-read sequencing and Hi-C technologies, we present a high-quality, chromosome-level genome assembly for the male P. vachelli. The assembly covers 728.88 Mb with 99.92% of the sequence distributed across 26 chromosomes. The contig N50 is 5.60 Mb, and the scaffold N50 is 28.76 Mb. The completeness of the P. vachelli genome assembly is highlighted by a BUSCO score of 97.45%. The genome is estimated to encode 25,121 protein-coding genes, with 93.46% annotated functionally and a BUSCO score of 96.40%. Repeat elements constitute approximately 38.97% of the genome. This comprehensive genome assembly represents an invaluable resource for advancing hybrid breeding, comparative genomics, and evolutionary studies in catfish and related species.
Liebert Pub  |  2025

Analysis of HIV-1-Based Lentiviral Vector Particle Composition by PacBio Long-Read Nucleic Acid Sequencing

Saqlain Suleman, Mohammad S. Khalifa, Serena Fawaz, Sharmin Alhaque, Yaghoub Chinea, and Michael Themis

Lentivirus (LV) vectors offer permanent delivery of therapeutic genes to the host through an RNA intermediate genome. They are one of the most commonly used vectors for clinical gene therapy of inherited disorders such as immune deficiencies and cancer immunotherapy. One of the most difficult challenges facing their widespread application to patients is the large-scale production of highly pure vector stocks. To improve vector production and downstream purification, there has been a recent investment in the United Kingdom to establish good manufacturing process (GMP)-licensed centers for manufacture and quality control. Other requirements for these vectors include their target cell specificity and tropism, how to regulate gene expression of the therapeutic payload and their potential side effects. Comprehensive detail on the full nucleic acid content of LV is unknown, even though they have entered clinical trials. With potential adverse effects in mind, it is important to identify these contents to assess their safety and purity. In this study, we used highly sensitive PacBio long-distance, next-generation sequencing of reverse-transcribed vector component RNA to investigate the nucleic acid composition of recombinant HIV-1 particles generated by human 293T packaging cells. In this article, we describe our findings of nucleic acids other than the recombinant vector genome that exist, which could potentially be delivered during gene transfer, and suggest that removal of these unwanted components be considered before clinical LV application.
bioRxiv  |  2025

The human immunoglobulin heavy chain constant gene locus is enriched for large complex structural variants and coding polymorphisms that vary in frequency among human populations

Uddalok Jana, Oscar L. Rodriguez, William Lees, Eric Engelbrecht, Zach Vanwinkle, Ayelet Peres, William S. Gibson, Kaitlyn Shields, Steven Schultze, Abdullah Dorgham, Matthew Emery, Gintaras Deikus, Robert Sebra, Evan E. Eichler, Gur Yaari, Melissa L. Smith, Corey T. Watson

The immunoglobulin heavy chain constant (IGHC) domain of antibodies (Ab) is responsible for effector functions critical to Ab mediated immunity. In humans, this domain is encoded by genes within the IGHC locus, where descriptions of genomic diversity remain incomplete. To address this, we utilized long-read genomic datasets to build a high-quality IGHC haplotype/variant catalog from 105 individuals of diverse ancestry, and developed a high-throughput approach for targeted long-read IGHC locus sequencing and assembly. From locally phased assemblies, we discovered previously uncharacterized single nucleotide variants (SNV) and complex structural variants (SVs, n=7), as well as novel genes and alleles. Of the 262 identified IGHC coding alleles, 235 (89.6%) were undocumented. SNV, SV, and gene allele/genotype frequencies revealed significant population differentiation, including; (i) hundreds of SNVs in African and East Asian populations exceeding fixation index (FST) of 0.3, (ii) and an IGHG4 haplotype carrying specific coding variants uniquely enriched in East and South Asian populations. Our results illuminate missing signatures of haplotype diversity in the IGHC locus, including evidence of natural selection, and establish a new foundation for investigating IGHC germline variation and its role in Ab function and disease.
medRxiv  |  2025

Long-read sequencing resolves the clinically relevant CYP21A2 locus, supporting a new clinical test for Congenital Adrenal Hyperplasia

Jean Monlong, Xiao Chen, Hayk Barseghyan, William J Rowell, Shloka Negi, Natalie Nokoff, Lauren Mohnach, Josephine Hirsch, Courtney Finlayson, Catherine E. Keegan, Miguel Almalvez, Seth I. Berger, Ivan de Dios, Brandy McNulty, Alex Robertson, Karen H. Miga, Phyllis W. Speiser, Benedict Paten, Eric Vilain, Emmanuèle C. Délot

Both HiFi-based and nanopore-based whole-genome long-read sequencing datasets could be mined to accurately identify pathogenic single-nucleotide variants, full gene deletions, fusions creating non-functional hybrids between the gene and pseudogene (“30-kb deletion”), as well as count the number of RCCX modules and phase the resulting multimodular haplotypes. On the Hi-Fi data set of 6 samples, the PacBio Paraphase tool was able to distinguish nine different mono-, bi-, and tri-modular haplotypes, as well as the 30-kb and whole gene deletions. To do the same on the ONT-Nanopore dataset, we designed a tool, Parakit, which creates an enriched local pangenome to represent known haplotype assemblies and map ClinVar pathogenic variants and fusions onto them. With few labels in the region, optical genome mapping was not able to reliably resolve module counts or fusions, although designing a tool to mine the dataset specifically for this region may allow doing so in the future. Both sequencing techniques yielded congruent results, matching clinically identified variants, and offered additional information above the clinical test, including phasing, count of RCCX modules, and status of the other module genes, all of which may be of clinical relevance. Thus long-read sequencing could be used to identify variants causing multiple forms of CAH in a single test.
Oxford Academics  |  2025

Long and Accurate: How HiFi Sequencing is Transforming Genomics

Bo Wang, Peng Jia, Shenghan Gao, Huanhuan Zhao, Gaoyang Zheng, Linfeng Xu, Kai Ye

Recent developments in PacBio high-fidelity (HiFi) sequencing technologies have transformed genomic research, with circular consensus sequencing now achieving 99.9% accuracy for long (up to 25 kb) single-molecule reads. This method circumvents biases intrinsic to amplification-based approaches, enabling thorough analysis of complex genomic regions [including tandem repeats, segmental duplications, ribosomal DNA (rDNA) arrays, and centromeres] as well as direct detection of base modifications, furnishing both sequence and epigenetic data concurrently. This has streamlined a number of tasks including genome assembly, variant detection, and full-length transcript analysis. This review provides a comprehensive overview of the applications and challenges of HiFi sequencing across various fields, including genomics, transcriptomics, and epigenetics. By delineating the evolving landscape of HiFi sequencing in multi-omics research, we highlight its potential to deepen our understanding of genetic mechanisms and to advance precision medicine.
bioRxiv  |  2025

CiFi: Accurate long-read chromatin conformation capture with low-input requirements

Sean P McGinty, Gulhan Kaya, Sheina B. Sim, Renée Lynn Corpuz, Michael A Quail, Mara KN Lawniczak, Scott M Geib, Jonas Korlach, Megan Y Dennis

By coupling chromatin conformation capture (3C) with PacBio HiFi long-read sequencing, we have developed a new method (CiFi) that enables analysis of genome interactions across repetitive genomic regions with low-input requirements. CiFi produces multiple interacting concatemer segments per read, facilitating genome assembly and scaffolding. Together, the approach enables genomic analysis of previously recalcitrant low-complexity loci, and of small organisms such as single insect individuals.
bioRxiv  |  2025

CiFi: Accurate long-read chromatin conformation capture with low-input requirements

Sean P. McGinty, Gulhan Kaya, Sheina B. Sim, Renée L. Corpuz, Michael A. Quail, Mara K. N. Lawniczak, Scott M. Geib, Jonas Korlach, Megan Y. Dennis

By coupling chromatin conformation capture (3C) with PacBio HiFi long-read sequencing, we have developed a new method (CiFi) that enables analysis of genome interactions across repetitive genomic regions with low-input requirements. CiFi produces multiple interacting concatemer segments per read, facilitating genome assembly and scaffolding. Together, the approach enables genomic analysis of previously recalcitrant low-complexity loci, and of small organisms such as single insect individuals.
Keyword search
Author search
Year search

Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.