PacBio 2014 User Group Meeting Presentation Slides: Anne Deslattes Mays of Georgetown University discussed how PacBio provided the necessary full-length isoform information to allow characterization of isoform distribution by sub-cell population.
Complex alternative splicing patterns in hematopoietic cell subpopulations revealed by third-generation long reads.
Background: Alternative splicing expands the repertoire of gene functions and is a signature for different cell populations. Here we characterize the transcriptome of human bone marrow subpopulations including progenitor cells to understand their contribution to homeostasis and pathological conditions such as atherosclerosis and tumor metastasis. To obtain full-length transcript structures, we utilized long reads in addition to RNA-seq for estimating isoform diversity and abundance. Method: Freshly harvested, viable human bone marrow tissues were extracted from discarded harvesting equipment and separated into total bone marrow (total), lineage-negative (lin-) progenitor cells and differentiated cells (lin+) by magnetic bead sorting with antibodies to surface markers of hematopoietic cell lineages. Sequencing was done with SOLiD, Illumina HiSeq (100bp paired-end reads), and PacBio RS II (full-length cDNA library protocol for 1 – 6 kb libraries). Short reads were assembled using both Trinity for de novo assembly and Cufflinks for genome-guided assembly. Full-length transcript consensus sequences were obtained for the PacBio data using the RS_IsoSeq protocol from PacBios SMRTAnalysis software. Quantitation for each sample was done independently for each sequencing platform using Sailfish to obtain the TPM (transcripts per million) using k-mer matching. Results: PacBios long read sequencing technology is capable of sequencing full-length transcripts up to 10 kb and reveals heretofore-unseen isoform diversity and complexity within the hematopoietic cell populations. A comparison of sequencing depth and de novo transcript assembly with short read, second-generation sequencing reveals that, while short reads provide precision in determining portions of isoform structure and supporting larger 5 and 3 UTR regions, it fails in providing a complete structure especially when multiple isoforms are present at the same locus. Increased breadth of isoform complexity is revealed by long reads that permits further elaboration of full isoform diversity and specific isoform abundance within each separate cell population. Sorting the distribution of major and minor isoforms reveals a cell population-specific balance focused on distinct genome loci and shows how tissue specificity and diversity are modulated by alternative splicing.
Full-length sequencing of HLA class I genes of more than 1000 samples provides deep insights into sequence variability
Aim: The vast majority of donor typing relies on sequencing exons 2 and 3 of HLA class I genes (HLA-A, -B, -C). With such an approach certain allele combinations do not result in the anticipated “high resolution” (G-code) typing, due to the lack of exon-phasing information. To resolve ambiguous typing results for a haplotype frequency project, we established a whole gene sequencing approach for HLA class I, facilitating also an estimation of the degree of sequence variability outside the commonly sequenced exons. Methods: Primers were developed flanking the UTR regions resulting in similar amplicon lengths of 4.2-4.4 kb. Using a 4-primer approach, secondary primers containing barcodes were combined with the gene specific primers to obtain barcoded full-gene amplicons in a single amplification step. Amplicons were pooled, purified, and ligated to SMRT bells (i.e. annealing points for sequencing primers) following standard protocols from Pacific Biosciences. Taking advantage of the SMRT chemistry, pools of 48-72 amplicons were sequenced full length and phased in single runs on a Pacific Biosciences RSII instrument. Demultiplexing was achieved using the SMRT portal. Sequence analysis was performed using NGSengine software (GenDx). Results: We successfully performed full-length gene sequencing of 1003 samples, harboring ambiguous typings of either HLA-A (n=46), HLA-B (n=304) or HLA-C (n=653). Despite the high per-read raw error rates typical for SMRT sequencing (~15%) the consensus sequence proved highly reliable. All consensus sequences for exons 2 and 3 were in full accordance with their MiSeq-derived sequences. Unambiguous allelic resolution was achieved for all samples. We observed novel intronic, exonic as well as UTR sequence variations for many of the alleles covered by our data set. This included sequences of 600 individuals with HLA-C*07:01/C*07:02 genotype revealing the extent of sequence variation outside the exons 2 and 3. Conclusion: Here we present a whole gene amplification and sequencing approach for HLA class I genes. The maturity of this approach was demonstrated by sequencing more than 1000 samples, achieving fully phased allelic sequences. Extensive sequencing of one common allele combination hints at the yet to discover diversity of the HLA system outside the commonly analyzed exons.
Aim: In contrast to exon-based HLA-typing approaches, whole gene genotyping crucially depends on full-length sequences submitted to the IMGT/HLA Database. Currently, full-length sequences are provided for only 7 out of 520 HLA-DPB1 alleles. Therefore, we developed a fully phased whole-gene sequencing approach for DPB1, to facilitate further exploration of the allelic structure at this locus. Methods: Primers were developed flanking the UTR-regions of DPB1 resulting in a 12 kb amplicon. Using a 4-primer approach, secondary primers containing barcodes were combined with the gene-specific primers to obtain barcoded full-gene amplicons in a single amplification step. Amplicons were pooled, purified, and ligated to SMRT bells (i.e. annealing points for sequencing primers) following standard protocols from Pacific Biosciences. Taking advantage of the SMRT chemistry, pools of 48 amplicons were sequenced full length in single runs on a Pacific Biosciences RSII instrument. Demultiplexing was performed using the SMRT portal. Sequence analysis was performed using the NGSengine software (GenDx). Results: We analyzed a set of 48 randomly picked samples. With 3 exceptions due to PCR failure, all genotype assignments conformed to standard genotyping results based on exons 2 and 3. Allelic proportions for heterozygous positions were evenly distributed (range 0.4 – 0.6) for all samples, suggesting unbiased amplifications. Despite the high per-read raw error rates typical for SMRT sequencing (~15%) the consensus sequence proved highly reliable. All consensus sequences for exons 2 and 3 were in full accordance with their MiSeq-derived sequences. We describe novel intronic sequence variation of the 7 so far genomically defined alleles, as well as 7 whole-length DPB1 alleles with hitherto unknown intronic regions. One of these alleles (HLA-DPB1*131:01) is classified as rare. Conclusion: Here we present a whole gene amplification and sequencing workflow for DPB1 alleles utilizing single molecule real-time (SMRT) sequencing from Pacific Biosciences. Validation of consensus sequences against known exonic sequences highlights the reliability of this technology. This workflow will facilitate amending the IMGT/HLA Database for DPB1.
The complex immune regions of the genome, including MHC and KIR, contain large copy number variants (CNVs), a high density of genes, hyper-polymorphic gene alleles, and conserved extended haplotypes (CEH) with enormous linkage disequilibrium (LDs). This level of complexity and inherent biases of short-read sequencing make it challenging for extracting immune region haplotype information from reference-reliant, shotgun sequencing and GWAS methods. As NGS based genome and exome sequencing and SNP arrays have become a routine for population studies, numerous efforts are being made for developing software to extract and or impute the immune gene information from these datasets. Despite these efforts, the fine mapping of causal variants of immune genes for their well-documented association with cancer, drug-induced hypersensitivity and immune-related diseases, has been slower than expected. This has in many ways limited our understanding of the mechanisms leading to immune disease. In the present work, we demonstrate the advantages of long reads delivered by SMRT Sequencing for assembling complete haplotypes of MHC and KIR gene clusters, as well as calling correct genotypes of genes comprised within them. All the genotype information is detected at allele- level with full phasing information across SNP-poor regions. Genotypes were called correctly from targeted gene amplicons, haplotypes, as well as from a completely assembled 5 Mb contig of the MHC region from a de novo assembly of whole genome shotgun data. De novo analysis pipeline used in all these approaches allowed for reference-free analysis without imputation, a key for interrogation without prior knowledge about ethnic backgrounds. These methods are thus easily adoptable for previously uncharacterized human or non-human species.
The MHC Diversity in Africa Project (MDAP) pilot – 125 African high resolution HLA types from 5 populations
The major histocompatibility complex (MHC), or human leukocyte antigen (HLA) in humans, is a highly diverse gene family with a key role in immune response to disease; and has been implicated in auto-immune disease, cancer, infectious disease susceptibility, and vaccine response. It has clinical importance in the field of solid organ and bone marrow transplantation, where donors and recipient matching of HLA types is key to transplanted organ outcomes. The Sanger based typing (SBT) methods currently used in clinical practice do not capture the full diversity across this region, and require specific reference sequences to deconvolute ambiguity in HLA types. However, reference databases are based largely on European populations, and the full extent of diversity in Africa remains poorly understood. Here, we present the first systematic characterisation of HLA diversity within Africa in the pilot phase of the MHC Diversity in Africa Project, together with an evaluation of methods to carry out scalable cost-effective, as well as reliable, typing of this region in African populations.To sample a geographically representative panel of African populations we obtained 125 samples, 25 each from the Zulu (South Africa), Igbo (Nigeria), Kalenjin (Kenya), Moroccan and Ashanti (Ghana) groups. For methods validation we included two controls from the International Histocompatibility Working Group (IHWG) collection with known typing information. Sanger typing and Illumina HiSeq X sequencing of these samples indicated potentially novel Class I and Class II alleles; however, we found poor correlation between HiSeq X sequencing and SBT for both classes. Long Range PCR and high resolution PacBio RS-II typing of 4 of these samples identified 7 novel Class II alleles, highlighting the high levels of diversity in these populations, and the need for long read sequencing approaches to characterise this comprehensively. We have now expanded this approach to the entire pilot set of 125 samples. We present these confirmed types and discuss a workflow for scaling this to 5000 individuals across Africa.The large number of new alleles identified in our pilot suggests the high level of African HLA diversity and the utility of high resolution methods. The MDAP project will provide a framework for accurate HLA typing, in addition to providing an invaluable resource for imputation in GWAS, boosting power to identify and resolve HLA disease associations.
ASHG Virtual Poster: The MHC Diversity in Africa Project (MDAP) pilot – 125 African high resolution HLA types from 5 populations
In this ASHG 2016 poster video, Martin Pollard from the Wellcome Trust Sanger Institute and the University of Cambridge describes an ambitious project to better represent natural variation in the…
Human MHC class I genes HLA-A, -B, -C, and class II genes HLA -DR, -DQ, and -DP play a critical role in the immune system as primary factors responsible for…
Pacific Biosciences’ SMRT sequencing method was used to extend the sequence of HLA-A*02:13. © 2019 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Diversification and Evolution of Vancomycin-Resistant Enterococcus faecium during Intestinal Domination.
Vancomycin-resistant Enterococcus faecium (VRE) is a leading cause of hospital-acquired infections. This is particularly true in immunocompromised patients, where the damage to the microbiota caused by antibiotics can lead to VRE domination of the intestine, increasing a patient’s risk for bloodstream infection. In previous studies we observed that the intestinal domination by VRE of patients hospitalized to receive allogeneic bone marrow transplantation can persist for weeks, but little is known about subspecies diversification and evolution during prolonged domination. Here we combined a longitudinal analysis of patient data and in vivo experiments to reveal previously unappreciated subspecies dynamics during VRE domination that appeared to be stable from 16S rRNA microbiota analyses. Whole-genome sequencing of isolates obtained from sequential stool samples provided by VRE-dominated patients revealed an unanticipated level of VRE population complexity that evolved over time. In experiments with ampicillin-treated mice colonized with a single CFU, VRE rapidly diversified and expanded into distinct lineages that competed for dominance. Mathematical modeling shows that in vivo evolution follows mostly a parabolic fitness landscape, where each new mutation provides diminishing returns and, in the setting of continuous ampicillin treatment, reveals a fitness advantage for mutations in penicillin-binding protein 5 (pbp5) that increase resistance to ampicillin. Our results reveal the rapid diversification of host-colonizing VRE populations, with implications for epidemiologic tracking of in-hospital VRE transmission and susceptibility to antibiotic treatment.Copyright © 2019 Dubin et al.
Single-Molecule Real-Time (SMRT) Full-Length RNA-Sequencing Reveals Novel and Distinct mRNA Isoforms in Human Bone Marrow Cell Subpopulations.
Hematopoietic cells are continuously replenished from progenitor cells that reside in the bone marrow. To evaluate molecular changes during this process, we analyzed the transcriptomes of freshly harvested human bone marrow progenitor (lineage-negative) and differentiated (lineage-positive) cells by single-molecule real-time (SMRT) full-length RNA-sequencing. This analysis revealed a ~5-fold higher number of transcript isoforms than previously detected and showed a distinct composition of individual transcript isoforms characteristic for bone marrow subpopulations. A detailed analysis of messenger RNA (mRNA) isoforms transcribed from the ANXA1 and EEF1A1 loci confirmed their distinct composition. The expression of proteins predicted from the transcriptome analysis was evaluated by mass spectrometry and validated previously unknown protein isoforms predicted e.g., for EEF1A1. These protein isoforms distinguished the lineage negative cell population from the lineage positive cell population. Finally, transcript isoforms expressed from paralogous gene loci (e.g., CFD, GATA2, HLA-A, B, and C) also distinguished cell subpopulations but were only detectable by full-length RNA sequencing. Thus, qualitatively distinct transcript isoforms from individual genomic loci separate bone marrow cell subpopulations indicating complex transcriptional regulation and protein isoform generation during hematopoiesis.
Genome-Wide Screening for Enteric Colonization Factors in Carbapenem-Resistant ST258 Klebsiella pneumoniae.
A diverse, antibiotic-naive microbiota prevents highly antibiotic-resistant microbes, including carbapenem-resistant Klebsiella pneumoniae (CR-Kp), from achieving dense colonization of the intestinal lumen. Antibiotic-mediated destruction of the microbiota leads to expansion of CR-Kp in the gut, markedly increasing the risk of bacteremia in vulnerable patients. While preventing dense colonization represents a rational approach to reduce intra- and interpatient dissemination of CR-Kp, little is known about pathogen-associated factors that enable dense growth and persistence in the intestinal lumen. To identify genetic factors essential for dense colonization of the gut by CR-Kp, we constructed a highly saturated transposon mutant library with >150,000 unique mutations in an ST258 strain of CR-Kp and screened for in vitro growth and in vivo intestinal colonization in antibiotic-treated mice. Stochastic and partially reversible fluctuations in the representation of different mutations during dense colonization revealed the dynamic nature of intestinal microbial populations. We identified genes that are crucial for early and late stages of dense gut colonization and confirmed their role by testing isogenic mutants in in vivo competition assays with wild-type CR-Kp Screening of the transposon library also identified mutations that enhanced in vivo CR-Kp growth. These newly identified colonization factors may provide novel therapeutic opportunities to reduce intestinal colonization by CR-KpIMPORTANCEKlebsiella pneumoniae is a common cause of bloodstream infections in immunocompromised and hospitalized patients, and over the last 2 decades, some strains have acquired resistance to nearly all available antibiotics, including broad-spectrum carbapenems. The U.S. Centers for Disease Control and Prevention has listed carbapenem-resistant K. pneumoniae (CR-Kp) as an urgent public health threat. Dense colonization of the intestine by CR-Kp and other antibiotic-resistant bacteria is associated with an increased risk of bacteremia. Reducing the density of gut colonization by CR-Kp is likely to reduce their transmission from patient to patient in health care facilities as well as systemic infections. How CR-Kp expands and persists in the gut lumen, however, is poorly understood. Herein, we generated a highly saturated mutant library in a multidrug-resistant K. pneumoniae strain and identified genetic factors that are associated with dense gut colonization by K. pneumoniae This study sheds light on host colonization by K. pneumoniae and identifies potential colonization factors that contribute to high-density persistence of K. pneumoniae in the intestine. Copyright © 2019 Jung et al.
The antibody repertoire of Bos taurus is characterized by a subset of variable heavy (VH) chain regions with ultralong third complementarity determining regions (CDR3) which, compared to other species, can provide a potent response to challenging antigens like HIV env. These unusual CDR3 can range to over seventy highly diverse amino acids in length and form unique ß-ribbon ‘stalk’ and disulfide bonded ‘knob’ structures, far from the typical antigen binding site. The genetic components and processes for forming these unusual cattle antibody VH CDR3 are not well understood. Here we analyze sequences of Bos taurus antibody VH domains and find that the subset with ultralong CDR3 exclusively uses a single variable gene, IGHV1-7 (VHBUL) rearranged to the longest diversity gene, IGHD8-2. An eight nucleotide duplication at the 3′ end of IGHV1-7 encodes a longer V-region producing an extended F ß-strand that contributes to the stalk in a rearranged CDR3. A low amino acid variability was observed in CDR1 and CDR2, suggesting that antigen binding for this subset most likely only depends on the CDR3. Importantly a novel, potentially AID mediated, deletional diversification mechanism of the B. taurus VH ultralong CDR3 knob was discovered, in which interior codons of the IGHD8-2 region are removed while maintaining integral structural components of the knob and descending strand of the stalk in place. These deletions serve to further diversify cysteine positions, and thus disulfide bonded loops. Hence, both germline and somatic genetic factors and processes appear to be involved in diversification of this structurally unusual cattle VH ultralong CDR3 repertoire.
Next generation sequencing characterizes HLA diversity in a registry population from the Netherlands.
Next generation DNA sequencing is used to determine the HLA-A, -B, -C, -DRB1, -DRB3/4/5, and -DQB1 assignments of 1009 unrelated volunteers for the unrelated donor registry in The Netherlands. The analysis characterizes all HLA exons and introns for class I alleles; at least exons 2 to 3 for HLA-DRB1; and exons 2 to 6 for HLA-DQB1. Of the distinct alleles present, there are 229 class I and 71 class II; 36 of these alleles are novel. The majority (approximately 98%) of the cumulative allele frequency at each locus is contributed by alleles that appear three or more times. Alleles encoding protein variation outside of the antigen recognition domains are 0.6% of the class I assignments and 5.3% of the class II assignments. © 2019 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Our understanding of sequence variation in the HLA-DPB1 gene is largely restricted to the hypervariable antigen recognition domain (ARD) encoded by exon 2. Here, we employed a redundant sequencing strategy combining long-read and short-read data to accurately phase and characterise in full length the majority of common and well-documented (CWD) DPB1 alleles as well as alleles with an observed frequency of at least 0.0006% in our predominantly European sample set. We generated 664 DPB1 sequences, comprising 279 distinct allelic variants. This allows us to present the, to date, most comprehensive analysis of the nature and extent of DPB1 sequence variation. The full-length sequence analysis revealed the existence of two highly diverged allele clades. These clades correlate with the rs9277534 A???G variant, a known expression marker located in the 3′-UTR. The two clades are fully differentiated by 174 fixed polymorphisms throughout a 3.6?kb stretch at the 3′-end of DPB1. The region upstream of this differentiation zone is characterised by increasingly shared variation between the clades. The low-expression A clade comprises 59% of the distinct allelic sequences including the three by far most frequent DPB1 alleles, DPB1*04:01, DPB1*02:01 and DPB1*04:02. Alleles in the A clade show reduced nucleotide diversity with an excess of rare variants when compared to the high-expression G clade. This pattern is consistent with a scenario of recent proliferation of A-clade alleles. The full-length characterisation of all but the most rare DPB1 alleles will benefit the application of NGS for DPB1 genotyping and provides a helpful framework for a deeper understanding of high- and low-expression alleles and their implications in the context of unrelated haematopoietic stem-cell transplantation.Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.