The Human Leukocyte Antigen (HLA) genes located on chromosome 6 are responsible for regulating immune function via antigen presentation and are one of the determining factors for stem cell and organ transplantation compatibility. Additionally various alleles within this region have been implicated in autoimmune disorders, cancer, vaccine response and both non-infectious and infectious disease risk. The HLA region is highly variable; containing repetitive regions; and co-dominantly expressed genes. This complicates short read mapping and means that assessing the effect of variation within a gene requires full phase information to resolve haplotypes.One solution to the problem of HLA identification is the use of statistical inference to suggest the most likely diploid alleles given the genotypes observed. The assumption of this approach is the availability of an extensive reference panel. Whilst there exists good population genetics data for imputing European populations, there remains a paucity of information about variation in African populations. Filling this gap is one of the aims of the Genome Diversity in Africa Project and as a first step we are performing a pilot study to identify the optimal method for determining HLA type information for large numbers of samples from African populations.To that end we have obtained samples from 125 consented African participants selected from 5 populations across Africa (Morrocan, Ashanti, Igbo, Kalenjin, and Zulu). The methods included in our pilot study are Sanger sequencing (ABI), NGS on HiSeqX Ten platform (Illumina); long-range PCR combined with single molecule real-time (SMRT) sequencing (PacBio); and for a subset of samples library preparation on GemCode Platform (10x Genomics), which delivers valuable long range contextual information, combined with Illumina NGS sequencing.Results from capillary sequencing suggests the presence of a minimum of two novel alleles. Long Range PCR have been performed initially on a subset of samples using both primers sourced from GenDX and designed as described in Shiina et al (2012). Initial results from both primer sets were promising on Promega DNA test samples but only the GenDX primers proved effective on the African samples, producing consistently PCR products of the expected size in the Igbo, Ashanti, Morrocan and Zulu samples. We will present early results from our evaluation of the different sequencing technologies
Over the last few years, several advances were implemented in the PacBio RS II System to maximize throughput and efficiency while reducing the cost per sample. The number of useable bases per SMRT Cell now exceeds 1 Gb with the latest P6-C4 chemistry and 6-hour movies. For applications such as microbial sequencing, targeted sequencing, Iso-Seq (full-length isoform sequencing) and Nimblegen’s target enrichment method, current SMRT Cell yields could be an excess relative to project requirements. To this end, barcoding is a viable option for multiplexing samples. For microbial sequencing, multiplexing can be accomplished by tagging sheared genomic DNA during library construction with modified SMRTbell adapters. We studied the performance of 2- to 8-plex microbial sequencing. For full-length amplicon sequencing such as HLA typing, amplicons as large as 5 kb may be barcoded during amplification using barcoded locus-specific primers. Alternatively, amplicons may be barcoded during SMRTbell library construction using barcoded SMRTbell adapters. The preferred barcoding strategy depends on the user’s existing workflow and flexibility to changing and/or updating existing workflows. Using barcoded adapters, five Class I and II genes (3.3 – 5.8 kb) x 96 patients can be multiplexed and typed. For Iso-Seq full-length cDNA sequencing, barcodes are incorporated during 1st-strand synthesis and are enabled by tailing the oligo-dT primer with any PacBio published 16-bp barcode sequences. RNA samples from 6 maize tissues were multiplexed to generate barcoded cDNA libraries. The NimbleGen SeqCap Target Enrichment method, combined with PacBio’s long-read sequencing, provides comprehensive view of multi-kilobase contiguous regions, both exonic and intronic regions. To make this cost effective, we recommend barcoding samples for pooling prior to target enrichment and capture. Here, we present specific examples of strategies and best practices for multiplexing samples for different applications for SMRT Sequencing. Additionally, we describe recommendations for analyzing barcoded samples.
The complex immune regions of the genome, including MHC and KIR, contain large copy number variants (CNVs), a high density of genes, hyper-polymorphic gene alleles, and conserved extended haplotypes (CEH) with enormous linkage disequilibrium (LDs). This level of complexity and inherent biases of short-read sequencing make it challenging for extracting immune region haplotype information from reference-reliant, shotgun sequencing and GWAS methods. As NGS based genome and exome sequencing and SNP arrays have become a routine for population studies, numerous efforts are being made for developing software to extract and or impute the immune gene information from these datasets. Despite these efforts, the fine mapping of causal variants of immune genes for their well-documented association with cancer, drug-induced hypersensitivity and immune-related diseases, has been slower than expected. This has in many ways limited our understanding of the mechanisms leading to immune disease. In the present work, we demonstrate the advantages of long reads delivered by SMRT Sequencing for assembling complete haplotypes of MHC and KIR gene clusters, as well as calling correct genotypes of genes comprised within them. All the genotype information is detected at allele- level with full phasing information across SNP-poor regions. Genotypes were called correctly from targeted gene amplicons, haplotypes, as well as from a completely assembled 5 Mb contig of the MHC region from a de novo assembly of whole genome shotgun data. De novo analysis pipeline used in all these approaches allowed for reference-free analysis without imputation, a key for interrogation without prior knowledge about ethnic backgrounds. These methods are thus easily adoptable for previously uncharacterized human or non-human species.
The increased sequencing throughput creates a need for multiplexing for several applications. We are here detailing different barcoding strategies for microbial sequencing, targeted sequencing, Iso-Seq full-length isoform sequencing, and Roche NimbleGen’s target enrichment method.
Highly contiguous de novo human genome assembly and long-range haplotype phasing using SMRT Sequencing
The long reads, random error, and unbiased sampling of SMRT Sequencing enables high quality, de novo assembly of the human genome. PacBio long reads are capable of resolving genomic variations at all size scales, including SNPs, insertions, deletions, inversions, translocations, and repeat expansions, all of which are both important in understanding the genetic basis for human disease, and difficult to access via other technologies. In demonstration of this, we report a new high-quality, diploid-aware de novo assembly of Craig Venter’s well-studied genome.
Whole gene sequencing of KIR-3DL1 with SMRT Sequencing and the distribution of allelic variants in different ethnic groups
The killer-cell immunoglobulin-like receptor (KIR) gene family are involved in immune modulation during viral infection, autoimmune disease and in allogeneic stem cell transplantation. Most KIR gene diversity studies and their impact on the transplant outcome is performed by gene absence/presence assays. However, it is well known that KIR gene allelic variations have biological significance. Allele level typing of KIR genes has been very challenging until recently due to the homologous nature of those genes and very long intronic sequences. SMRT (Single Molecule Real-Time) Sequencing generates average long reads of 10 to 15 kb and allows us to obtain in-phase long sequence reads. We have developed a PCR assay for SMRT Sequencing on the PacBio RS II platform in our lab for 3DL1 whole gene sequencing. This approach allows us to obtain allele level typing for 3DL1 genes and could serve as a model to type other KIR genes at allelic level.
Multiplex target enrichment using barcoded multi-kilobase fragments and probe-based capture technologies
Target enrichment capture methods allow scientists to rapidly interrogate important genomic regions of interest for variant discovery, including SNPs, gene isoforms, and structural variation. Custom targeted sequencing panels are important for characterizing heterogeneous, complex diseases and uncovering the genetic basis of inherited traits with more uniform coverage when compared to PCR-based strategies. With the increasing availability of high-quality reference genomes, customized gene panels are readily designed with high specificity to capture genomic regions of interest, thus enabling scientists to expand their research scope from a single individual to larger cohort studies or population-wide investigations. Coupled with PacBio® long-read sequencing, these technologies can capture 5 kb fragments of genomic DNA (gDNA), which are useful for interrogating intronic, exonic, and regulatory regions, characterizing complex structural variations, distinguishing between gene duplications and pseudogenes, and interpreting variant haplotyes. In addition, SMRT® Sequencing offers the lowest GC-bias and can sequence through repetitive regions. We demonstrate the additional insights possible by using in-depth long read capture sequencing for key immunology, drug metabolizing, and disease causing genes such as HLA, filaggrin, and cancer associated genes.
Collection of major HLA allele sequences in Japanese population toward the precise NGS based HLA DNA typing at the field 4 level
We previously reported on the use of the Ion PGM next generation sequencing (NGS) platform to genotype HLA class I and class II genes by a super-high resolution, single-molecule, sequence-based typing (SS-SBT) method (Shiina et al. 2012). However, HLA alleles could not be assigned at the field 4 level at some HLA loci such as DQA1, DPA1 and DPB1 because the SNP and indel densities were too low to identify and separate both of the phases. In this regard, we have now added the single molecule, real-time (SMRT) DNA sequencer PacBio RS II method to our analysis in order to test whether it might determine the HLA allele sequences in some of the loci with which we previously had difficulties. In this study, we report on sequence-based genotyping of entire HLA gene sequences from the promoter-enhancer region to 3’UTR of the major HLA loci (A, B, C, DRB1, DRB345, DQA1, DQB1, DPA1 and DPB1) using 46 Japanese reference subjects who represented a distribution of more than 99.5% of the HLA alleles at each of the HLA loci and the PacBio RS II and Ion PGM systems.
Analysis of 37,000 Caucasian samples reveals tight linkage between SNP RS9277534 and high resolution typing of HLA-DPB1
HLA-DPB1 mismatching between patients and unrelated donors is known to increase the risk of acute graft-versus-host-disease (GvHD) after hematopoietic stem cell transplantation. If only HLA-DPB1 mismatched donors are available, the genotype defined by the Single Nucleotide Polymorphism (SNP) rs9277534 can be used to select mismatched donors that are well-tolerated. However, since rs9277534 resides within the 3’ untranslated region (UTR), it usually is not analyzed during DPB1 routine typing.
The MHC Diversity in Africa Project (MDAP) pilot – 125 African high resolution HLA types from 5 populations
The major histocompatibility complex (MHC), or human leukocyte antigen (HLA) in humans, is a highly diverse gene family with a key role in immune response to disease; and has been implicated in auto-immune disease, cancer, infectious disease susceptibility, and vaccine response. It has clinical importance in the field of solid organ and bone marrow transplantation, where donors and recipient matching of HLA types is key to transplanted organ outcomes. The Sanger based typing (SBT) methods currently used in clinical practice do not capture the full diversity across this region, and require specific reference sequences to deconvolute ambiguity in HLA types. However, reference databases are based largely on European populations, and the full extent of diversity in Africa remains poorly understood. Here, we present the first systematic characterisation of HLA diversity within Africa in the pilot phase of the MHC Diversity in Africa Project, together with an evaluation of methods to carry out scalable cost-effective, as well as reliable, typing of this region in African populations.To sample a geographically representative panel of African populations we obtained 125 samples, 25 each from the Zulu (South Africa), Igbo (Nigeria), Kalenjin (Kenya), Moroccan and Ashanti (Ghana) groups. For methods validation we included two controls from the International Histocompatibility Working Group (IHWG) collection with known typing information. Sanger typing and Illumina HiSeq X sequencing of these samples indicated potentially novel Class I and Class II alleles; however, we found poor correlation between HiSeq X sequencing and SBT for both classes. Long Range PCR and high resolution PacBio RS-II typing of 4 of these samples identified 7 novel Class II alleles, highlighting the high levels of diversity in these populations, and the need for long read sequencing approaches to characterise this comprehensively. We have now expanded this approach to the entire pilot set of 125 samples. We present these confirmed types and discuss a workflow for scaling this to 5000 individuals across Africa.The large number of new alleles identified in our pilot suggests the high level of African HLA diversity and the utility of high resolution methods. The MDAP project will provide a framework for accurate HLA typing, in addition to providing an invaluable resource for imputation in GWAS, boosting power to identify and resolve HLA disease associations.
Targeted sequencing employing PCR amplification is a fundamental approach to studying human genetic disease. PacBio’s Sequel System and supporting products provide an end-to-end solution for amplicon sequencing, offering better performance to Sanger technology in accuracy, read length, throughput, and breadth of informative data. Sample multiplexing is supported with three barcoding options providing the flexibility to incorporate unique sample identifiers during target amplification or library preparation. Multiplexing is key to realizing the full capacity of the 1 million individual reactions per Sequel SMRT Cell. Two analysis workflows that can generate high-accuracy results support a wide range of amplicon sizes in two ranges from 250 bp to 3 kb and from 3 kb to >10 kb. The Circular Consensus Sequencing workflow results in high accuracy through intra-molecular consensus generation, while high accuracy for the Long Amplicon Analysis workflow is achieved by clustering of individual long reads from multiple reactions. Here we present workflows and results for single- molecule sequencing of amplicons for human genetic analysis.
Targeted sequencing with Sanger as well as short read based high throughput sequencing methods is standard practice in clinical genetic testing. However, many applications beyond SNP detection have remained somewhat obstructed due to technological challenges. With the advent of long reads and high consensus accuracy, SMRT Sequencing overcomes many of the technical hurdles faced by Sanger and NGS approaches, opening a broad range of untapped clinical sequencing opportunities. Flexible multiplexing options, highly adaptable sample preparation method and newly improved two well-developed analysis methods that generate highly-accurate sequencing results, make SMRT Sequencing an adept method for clinical grade targeted sequencing. The Circular Consensus Sequencing (CCS) analysis pipeline produces QV 30 data from each single intra-molecular multi-pass polymerase read, making it a reliable solution for detecting minor variant alleles with frequencies as low as 1 %. Long Amplicon Analysis (LAA) makes use of insert spanning full-length subreads originating from multiple individual copies of the target to generate highly accurate and phased consensus sequences (>QV50), offering a unique advantage for imputation free allele segregation and haplotype phasing. Here we present workflows and results for a range of SMRT Sequencing clinical applications. Specifically, we illustrate how the flexible multiplexing options, simple sample preparation methods and new developments in data analysis tools offered by PacBio in support of Sequel System 5.1 can come together in a variety of experimental designs to enable applications as diverse as high throughput HLA typing, mitochondrial DNA sequencing and viral vector integrity profiling of recombinant adeno-associated viral genomes (rAAV).
In this BioConference Live webinar, PacBio CSO Jonas Korlach highlights how multi-kilobase reads from SMRT Sequencing can resolve many of the previously considered ‘difficult-to-sequence’ genomic regions. The long reads also…
Ulf Gyllensten speaks about advances in screening for HPV, his predictions for the widespread use of genome sequencing in the clinic, and applications using Single Molecule, Real-Time (SMRT) Sequencing for…
One of the popular questions on the Mendelspod program is how those doing sequencing decide between the quality of PacBio’s long reads and the cheaper short read technology, such as…