Sprite decoration

Scientific publications

Publications featuring PacBio long-read + short-read sequencing data

Biorxiv  |  2024

Microflora Danica: the atlas of Danish environmental microbiomes

Singleton, Jensen, Delogu,Sørensen, Jørgensen, Karst, Yang, Knudsen, Sereika, Petriglieri, Knutsson, Dall, Kirkegaard, Kristensen, Woodcroft, Speth, Aroney, The Microflora Danica Consortium, Wagner, Dueholm, Nielsen, Albertsen

In this preprint, scientists from Denmark, Australia, and Austria, conducted a study where HiFi sequencing was used for rRNA operon sequencing for 449 (multiplexed in pools of 92 samples) microbiome samples (14.9 million bacterial (median 4,528 bp, containing both 16S and 23S) and 13.4 million eukaryotic rRNA operon sequences (median 4,035 bp, containing both 18S and 28S)). “This dataset is an order of magnitude larger than the current most comprehensive database SILVA 138.1”, and provides “an unprecedented resource and the foundation for answering fundamental questions underlying microbial ecology: what drives microbial diversity, distribution and function.”
Biorxiv  |  2024

A haplotype-resolved view of human gene regulation

Mitchell R. Vollger1*, Elliott G. Swanson2*, Shane J. Neph1, Jane Ranchalis1, Katherine M. Munson2, Ching-Huang Ho3, Adriana E. Sedeño-Cortés1, William E. Fondrie4, Stephanie C. Bohaczuk1, Yizi Mao1, Nancy L. Parmalee5, Benjamin J. Mallory2, William T. Harvey2, Younjun Kwon2, Gage H. Garcia2, Kendra Hoekzema2, Jeffrey G. Meyer6, Mine Cicek6, Evan E. Eichler2,7, William S. Noble2,8, Daniela M. Witten9, James T. Bennett10, John P. Ray2,3,11, Andrew B. Stergachis1,2,12,†

Most human cells contain two non-identical genomes, and differences in their regulation underlie human development and disease. We demonstrate that Fiber-seq Inferred Regulatory Elements (FIREs) enable the accurate quantification of chromatin accessibility across the 6 Gbp diploid human genome with single-molecule and single-nucleotide precision. We find that cells can harbor >1,000 regulatory elements with haplotype-selective chromatin accessibility (HSCA) and show that these elements preferentially localize to genomic loci containing the most human genetic diversity, with the human leukocyte antigen (HLA) locus showing the largest amount of HSCA genome-wide in immune cells. Furthermore, we uncover HSCA elements with sequence non-deterministic chromatin accessibility, representing likely somatic epimutations, and show that productive transcription from the inactive X chromosome is buttressed by clustered promoter-proximal elements that escape X chromosome inactivation.
Biorxiv  |  2024

Structural polymorphism and diversity of human segmental duplications

Hyeonsoo Jeong, Philip C. Dishuck, DongAhn Yoo, William T. Harvey, Katherine M. Munson, Alexandra P. Lewis, Jennifer Kordosky, Gage H. Garcia, Human Genome Structural Variation Consortium (HGSVC), Feyza Yilmaz, Pille Hallast, Charles Lee, Tomi Pastinen, Evan E. Eichler

In this preprint, HiFi reveals segment duplications (SDs) unresolvable by short reads. This missing info is essential for understanding human disease, evolution and diversity. Researchers from HGSVC, UW, Altos Labs, JAX, CMKC conducted a study including a “population genetics survey of SDs by analyzing 170 [all HiFi] human genome assemblies where the majority of SDs are fully resolved using long-read sequence assembly.”
Nature  |  2024

The complex polyploid genome architecture of sugarcane

A. L. Healey, O. Garsmeur, J. T. Lovell, S. Shengquiang, A. Sreedasyam, J. Jenkins, C. B. Plott, N. Piperidis, N. Pompidor, V. Llaca, C. J. Metcalfe, J. Doležel, P. Cápal, J. W. Carlson, J. Y. Hoarau, C. Hervouet, C. Zini, A. Dievart, A. Lipzen, M. Williams, L. B. Boston, J. Webber, K. Keymanesh, S. Tejomurthula, S. Rajasekar, R. Suchecki, A. Furtado, G. May, P. Parakkal, B. A. Simmons, K. Barry, R. J. Henry, J. Grimwood, K. S. Aitken, J. Schmutz & A. D’Hont

Sugarcane, the world’s most harvested crop by tonnage, has shaped global history, trade and geopolitics, and is currently responsible for 80% of sugar production worldwide1. While traditional sugarcane breeding methods have effectively generated cultivars adapted to new environments and pathogens, sugar yield improvements have recently plateaued2. The cessation of yield gains may be due to limited genetic diversity within breeding populations, long breeding cycles and the complexity of its genome, the latter preventing breeders from taking advantage of the recent explosion of whole-genome sequencing that has benefited many other crops. Thus, modern sugarcane hybrids are the last remaining major crop without a reference-quality genome. Here we take a major step towards advancing sugarcane biotechnology by generating a polyploid reference genome for R570, a typical modern cultivar derived from interspecific hybridization between the domesticated species (Saccharum officinarum) and the wild species (Saccharum spontaneum). In contrast to the existing single haplotype (‘monoploid’) representation of R570, our 8.7 billion base assembly contains a complete representation of unique DNA sequences across the approximately 12 chromosome copies in this polyploid genome. Using this highly contiguous genome assembly, we filled a previously unsized gap within an R570 physical genetic map to describe the likely causal genes underlying the single-copy Bru1 brown rust resistance locus. This polyploid genome assembly with fine-grain descriptions of genome architecture and molecular targets for biotechnology will help accelerate molecular and transgenic breeding and adaptation of sugarcane to future environmental conditions.
Biorxiv  |  2024

Genome-wide profiling of highly similar paralogous genes using HiFi sequencing

Xiao Chen, Daniel Baker, Egor Dolzhenko, Joseph M Devaney, Jessica Noya, April S Berlyoung, Rhonda Brandon, Kathleen S Hruska, Lucas Lochovsky, Paul Kruszka, Scott Newman, Emily Farrow, Isabelle Thiffault, Tomi Pastinen, Dalia Kasperaviciute, Christian Gilissen, Lisenka Vissers, Alexander Hoischen, Seth Berger, Eric Vilain, Emmanuèle Délot, Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium, Michael A Eberle

Variant calling is hindered in segmental duplications by sequence homology. We developed Paraphase, a HiFi-based informatics method that resolves highly similar genes by phasing all haplotypes of a gene family. We applied Paraphase to 160 long (>10 kb) segmental duplication regions across the human genome with high (>99%) sequence similarity, encoding 316 genes. Analysis across five ancestral populations revealed highly variable copy numbers of these regions. We identified 23 families with exceptionally low within-family diversity, where extensive gene conversion and unequal-crossing over have resulted in highly similar gene copies. Furthermore, our analysis of 36 trios identified 7 de novo SNVs and 4 de novo gene conversion events, 2 of which are non-allelic. Finally, we summarized extensive genetic diversity in 9 medically relevant genes previously considered challenging to genotype. Paraphase provides a framework for resolving gene paralogs, enabling accurate testing in medically relevant genes and population-wide studies of previously inaccessible genes.
Nature  |  2024

Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain

Anoushka Joglekar, Wen Hu, Bei Zhang, Oleksandr Narykov, Mark Diekhans, Jordan Marrocco, Jennifer Balacco, Lishomwa C. Ndhlovu, Teresa A. Milner, Olivier Fedrigo, Erich D. Jarvis, Gloria Sheynkman, Dmitry Korkin, M. Elizabeth Ross & Hagen U. Tilgner

RNA isoforms influence cell identity and function. However, a comprehensive brain isoform map was lacking. We analyze single-cell RNA isoforms across brain regions, cell subtypes, developmental time points and species. For 72% of genes, full-length isoform expression varies along one or more axes. Splicing, transcription start and polyadenylation sites vary strongly between cell types, influence protein architecture and associate with disease-linked variation. Additionally, neurotransmitter transport and synapse turnover genes harbor cell-type variability across anatomical regions. Regulation of cell-type-specific splicing is pronounced in the postnatal day 21-to-postnatal day 28 adolescent transition. Developmental isoform regulation is stronger than regional regulation for the same cell type. Cell-type-specific isoform regulation in mice is mostly maintained in the human hippocampus, allowing extrapolation to the human brain. Conversely, the human brain harbors additional cell-type specificity, suggesting gain-of-function isoforms. Together, this detailed single-cell atlas of full-length isoform regulation across development, anatomical regions and species reveals an unappreciated degree of isoform variability across multiple axes.
Biorxiv  |  2024

Transcript Isoform Diversity of Y Chromosome Ampliconic Genes of Great Apes Uncovered Using Long Reads and Telomere-to-Telomere Reference Genome Assemblies

Aleksandra Greshnova, Karol Pál, Juan Francisco Iturralde Martinez, Stefan Canzar, Kateryna D. Makova

Y chromosomes of great apes harbor Ampliconic Genes (YAGs)—multi-copy gene families (BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY) that encode proteins important for spermatogenesis. Previous work assembled YAG transcripts based on their targeted sequencing but not using reference genome assemblies, potentially resulting in an incomplete transcript repertoire. Here we used the recently produced gapless telomere-to-telomere (T2T) Y chromosome assemblies of great ape species (bonobo, chimpanzee, human, gorilla, Bornean orangutan, and Sumatran orangutan) and analyzed RNA data from whole-testis samples for the same species. We generated hybrid transcriptome assemblies by combining targeted long reads (Pacific Biosciences), untargeted long reads (Pacific Biosciences) and untargeted short reads (Illumina)and mapping them to the T2T reference genomes. Compared to the results from the reference-free approach, average transcript length was more than two times higher, and the total number of transcripts decreased three times, improving the quality of the assembled transcriptome. The reference-based transcriptome assemblies allowed us to differentiate transcripts originating from different Y chromosome gene copies and from their non-Y chromosome homologs. We identified two sources of transcriptome diversity—alternative splicing and gene duplication with subsequent diversification of gene copies. For each gene family, we detected transcribed pseudogenes along with protein-coding gene copies. We revealed previously unannotated gene copies of YAGs as compared to currently available NCBI annotations, as well as novel isoforms for annotated gene copies. This analysis paves the way for better understanding Y chromosome gene functions, which is important given their role in spermatogenesis.
Nature Methods  |  2024

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms

Francisco J. Pardo-Palacios, Angeles Arzalluz-Luque, Liudmyla Kondratova, Pedro Salguero, Jorge Mestre-Tomás, Rocío Amorín, Eva Estevan-Morió, Tianyuan Liu, Adalena Nanni, Lauren McIntyre, Elizabeth Tseng & Ana Conesa

SQANTI3 is a tool designed for the quality control, curation and annotation of long-read transcript models obtained with third-generation sequencing technologies. Leveraging its annotation framework, SQANTI3 calculates quality descriptors of transcript models, junctions and transcript ends. With this information, potential artifacts can be identified and replaced with reliable sequences. Furthermore, the integrated functional annotation feature enables subsequent functional iso-transcriptomics analyses.
Biorxiv  |  2024

Addressing technical pitfalls in pursuit of molecular factors that mediate immunoglobulin gene regulation

Eric Engelbrecht1 , Oscar L. Rodriguez1 , Corey T. Watson1 1) Department of Biochemistry and Molecular Genetics, University of Louisville, Louisville, KY, USA.

The expressed antibody repertoire is a critical determinant of immune-related phenotypes. Antibody-encoding transcripts are distinct from other expressed genes because they are transcribed from somatically rearranged gene segments. Human antibodies are composed of two identical heavy and light chain polypeptides derived from genes in the immunoglobulin heavy chain (IGH) locus and one of two light chain loci. The combinatorial diversity that results from antibody gene rearrangement and the pairing of different heavy and light chains contributes to the immense diversity of the baseline antibody repertoire. During rearrangement, antibody gene selection is mediated by factors that influence chromatin architecture, promoter/enhancer activity, and V(D)J recombination. Interindividual variation in the composition of the antibody repertoire associates with germline variation in IGH, implicating polymorphism in antibody gene regulation. Determining how IGH variants directly mediate gene regulation will require integration of these variants with other functional genomic datasets. Here, we argue that standard approaches using short reads have limited utility for characterizing regulatory regions in IGH at haplotype-resolution. Using simulated and ChIP-seq reads, we define features of IGH that limit use of short reads and a single reference genome, namely 1) the highly duplicated nature of DNA sequence in IGH and 2) structural polymorphisms that are frequent in the population. We demonstrate that personalized diploid references enhance performance of short-read data for characterizing mappable portions of the locus, while also showing that long-read profiling tools will ultimately be needed to fully resolve functional impacts of IGH germline variation on expressed antibody repertoires.
Biorxiv  |  2024

Structurally divergent and recurrently mutated regions of primate genomes

Yafei Mao, William T. Harvey, David Porubsky, Katherine M. Munson, Kendra Hoekzema, Alexandra P. Lewis, Peter A. Audano, Allison Rozanski, Xiangyu Yang, Shilong Zhang, David S. Gordon, Xiaoxi Wei, Glennis A. Logsdon, Marina Haukness, Philip C. Dishuck, Hyeonsoo Jeong, Ricardo del Rosario, Vanessa L. Bauer, Will T. Fattor, Gregory K. Wilkerson, Qing Lu, Benedict Paten, Guoping Feng, Sara L. Sawyer, Wesley C. Warren, Lucia Carbone, Evan E. Eichler

To better understand the pattern of primate genome structural variation, we sequenced and assembled using multiple long-read sequencing technologies the genomes of eight nonhuman primate species, including New World monkeys (owl monkey and marmoset), Old World monkey (macaque), Asian apes (orangutan and gibbon), and African ape lineages (gorilla, bonobo, and chimpanzee). Compared to the human genome, we identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. Across 50 million years of primate evolution, we estimate that 819.47 Mbp or ~27% of the genome has been affected by SVs based on analysis of these primate lineages. We identify 1,607 structurally divergent regions (SDRs) wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (CARDs, ABCD7, OLAH) and new lineage-specific genes are generated (e.g., CKAP2, NEK5) and have become targets of rapid chromosomal diversification and positive selection (e.g., RGPDs). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species for the first time.
Biorxiv  |  2024

CTAT-LR-fusion: accurate fusion transcript identification from long and short read isoform sequencing at bulk or single cell resolution

Qian Qin, Victoria Popic, Houlin Yu, Emily White, Akanksha Khorgade, Asa Shin, Kirsty Wienand, Arthur Dondi, Niko Beerenwinkel, Francisca Vazquez, Aziz M. Al’Khafaji, Brian J. Haas

Gene fusions are found as cancer drivers in diverse adult and pediatric cancers. Accurate detection of fusion transcripts is essential in cancer clinical diagnostics, prognostics, and for guiding therapeutic development. Most currently available methods for fusion transcript detection are compatible with Illumina RNA-seq involving highly accurate short read sequences. Recent advances in long read isoform sequencing enable the detection of fusion transcripts at unprecedented resolution in bulk and single cell samples. Here we developed a new computational tool CTAT-LR-fusion to detect fusion transcripts from long read RNA-seq with or without companion short reads, with applications to bulk or single cell transcriptomes. We demonstrate that CTAT-LR-fusion exceeds fusion detection accuracy of alternative methods as benchmarked with simulated and real long read RNA-seq. Using short and long read RNA-seq, we further apply CTAT-LR-fusion to bulk transcriptomes of nine tumor cell lines, and to tumor single cells derived from a melanoma sample and three metastatic high grade serous ovarian carcinoma samples. In both bulk and in single cell RNA-seq, long isoform reads yielded higher sensitivity for fusion detection than short reads with notable exceptions. By combining short and long reads in CTAT-LR-fusion, we are able to further maximize detection of fusion splicing isoforms and fusion-expressing tumor cells. CTAT-LR-fusion is available at
Biorxiv  |  2024

Stimulated saliva has a distinct composition that influences release of volatiles from wine

Xinwei Ruan, Yipeng Chen, Aafreen Chauhan, Kate Howell

Aroma perception plays an important role in wine preference and evaluation and varies between groups of wine consumers. Saliva influences the release of aroma in the oral cavity. The composition of human saliva varies depending on stimulation; however, the compositional differences of stimulated and unstimulated saliva and their influences on aroma release have not been evaluated. In this study, we recruited healthy adults, of which 15 were Australian and 15 Chinese. Three types of saliva were collected from each participant: before, during, and after salivary stimulation. The collected salivary samples were characterised by flow rate, total protein concentration, esterase activity and microbiome composition by full-length 16S rRNA gene sequencing. The saliva samples were mixed with wine to investigate the differences in released volatiles by headspace solid-phase microextraction gas chromatography–mass spectrometry (HS-SPME-GC-MS). Differences in salivary composition and specific wine volatiles were found between Australian and Chinese participants, and amongst the three stimulation stages. Differential species were identified and significant correlations between the relative abundance of 3 bacterial species and 10 wine volatiles were observed. Our results confirm the influence of host factors and stimulation on salivary composition. Understanding the interactions of salivary components, especially salivary bacteria, on the release of aroma during wine tasting allows nuanced appreciation of the variability of flavour perception in wine consumers.
Biorxiv  |  2024

Full-length transcript sequencing traces the brain isoform diversity in house mouse natural populations

The ability to generate multiple RNA isoforms of transcripts from the same gene is a general phenomenon in eukaryotes. However, the complexity and diversity of alternative isoforms in natural populations remains largely unexplored. Using a newly developed full-length transcripts enrichment protocol, we sequenced full-length RNA transcripts of 48 individuals from outbred populations and subspecies of Mus musculus, as well as the closely-related sister species Mus spretus and Mus spicilegus as outgroups. This represents the largest full-length high-quality isoform catalog at the population level to date. In total, we reliably identify 117,728 distinct transcripts, of which only 51% were previously annotated. We show that the population-specific distribution pattern of isoforms is phylogenetically informative and reflects the segregating SNP diversity between the populations. We find that ancient house-keeping genes are the major source to the overall isoform diversity, and the recruiting of alternative first exon plays the dominant role in generating new isoforms. Given that our data allow to distinguish between population-specific isoforms and isoforms that are conserved across multiple populations, it is possible to refine the annotation of the reference mouse genome to a set of about 40,000 isoforms that should be most relevant for comparative functional analysis across species.
Biorxiv  |  2024

Adaptive diversification through structural variation in barley

Murukarthick Jayakodi, Qiongxian Lu, Hélène Pidon, M. Timothy Rabanus-Wallace, Micha Bayer, Thomas Lux, Yu Guo, Benjamin Jaegle, Ana Badea, Wubishet Bekele, Gurcharn S. Brar, Katarzyna Braune, Boyke Bunk, Kenneth J. Chalmers, Brett Chapman, Morten Egevang Jørgensen, Jia-Wu Feng, Manuel Feser, Anne Fiebig, Heidrun Gundlach, Wenbin Guo, Georg Haberer, Mats Hansson, Axel Himmelbach, Iris Hoffie, Robert E. Hoffie, Haifei Hu, Sachiko Isobe, Patrick König, Sandip M. Kale, Nadia Kamal, Gabriel Keeble-Gagnère, Beat Keller, Manuela Knauft, Ravi Koppolu, Simon G. Krattinger, Jochen Kumlehn, Peter Langridge, Chengdao Li, Marina P. Marone, Andreas Maurer, Klaus F.X. Mayer, Michael Melzer, Gary J. Muehlbauer, Emiko Murozuka, Sudharsan Padmarasu, Dragan Perovic, Klaus Pillen, Pierre A. Pin, Curtis J. Pozniak, Luke Ramsay, Pai Rosager Pedas, Twan Rutten, Shun Sakuma, Kazuhiro Sato, Danuta Schüler, Thomas Schmutzer, Uwe Scholz, Miriam Schreiber, Kenta Shirasawa, Craig Simpson, Birgitte Skadhauge, Manuel Spannagl, Brian J. Steffenson, Hanne C. Thomsen, Josquin F. Tibbits, Martin Toft Simmelsgaard Nielsen, Corinna Trautewig, Dominique Vequaud, Cynthia Voss, Penghao Wang, Robbie Waugh, Sharon Westcott, Magnus Wohlfahrt Rasmussen, Runxuan Zhang, Xiao-Qi Zhang, Thomas Wicker, Christoph Dockter, Martin Mascher, Nils Stein

Pangenomes are collections of annotated genome sequences of multiple individuals of a species. The structural variants uncovered by these datasets are a major asset to genetic analysis in crop plants. Here, we report a pangenome of barley comprising long-read sequence assemblies of 76 wild and domesticated genomes and short-read sequence data of 1,315 genotypes. An expanded catalogue of sequence variation in the crop includes structurally complex loci that have become hot spots of gene copy number variation in evolutionarily recent times. To demonstrate the utility of the pangenome, we focus on four loci involved in disease resistance, plant architecture, nutrient release, and trichome development. Novel allelic variation at a powdery mildew resistance locus and population-specific copy number gains in a regulator of vegetative branching were found. Expansion of a family of starch-cleaving enzymes in elite malting barleys was linked to shifts in enzymatic activity in micro-malting trials. Deletion of an enhancer motif is likely to change the developmental trajectory of the hairy appendages on barley grains. Our findings indicate that rapid evolution at structurally complex loci may have helped crop plants adapt to new selective regimes in agricultural ecosystems.
Quick search

Quick search is faster but may return fewer results.

Advanced search

Advanced search allows you to search more fields but may take longer.

Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.