PacBio 2013 User Group Meeting Presentation Slides: Lisbeth Guethlein from Stanford University School of Medicine looked at highly repetitive and variable immune regions of the orangutan genome. Guethlein reported that “PacBio managed to accomplish in a week what I have been working on for a couple years” (with Sanger sequencing), and the results were concordant. “Long story short, I was a happy customer.”
Allele-level sequencing and phasing of full-length HLA class I and II genes using SMRT Sequencing technology
The three classes of genes that comprise the MHC gene family are actively involved in determining donor-recipient compatibility for organ transplant, as well as susceptibility to autoimmune diseases via cross-reacting immunization. Specifically, Class I genes HLA-A, -B, -C, and class II genes HLA-DR, -DQ and -DP are considered medically important for genetic analysis to determine histocompatibility. They are highly polymorphic and have thousands of alleles implicated in disease resistance and susceptibility. The importance of full-length HLA gene sequencing for genotyping, detection of null alleles, and phasing is now widely acknowledged. While DNA-sequencing-based HLA genotyping has become routine, only 7% of the HLA genes have been characterized by allele-level sequencing, while 93% are still defined by partial sequences. The gold-standard Sanger sequencing technology is being quickly replaced by second-generation, high- throughput sequencing methods due to its inability to generate unambiguous phased reads from heterozygous alleles. However, although these short, high-throughput, clonal sequencing methods are better at heterozygous allele detection, they are inadequate at generating full-length haploid gene sequences. Thus, full-length gene sequencing from an enhancer-promoter region to a 3’UTR that includes phasing information without the need for imputation still remains a technological challenge. The best way to overcome these challenges is to sequence these genes with a technology that is clonal in nature and has the longest possible read lengths. We have employed Single Molecule Real-Time (SMRT) sequencing technology from Pacific Biosciences for sequencing full-length HLA class I and II genes.
HLA sequencing using SMRT Technology – High resolution and high throughput HLA genotyping in a clinical setting
Sequence based typing (SBT) is considered the gold standard method for HLA typing. Current SBT methods are rather laborious and are prone to phase ambiguity problems and genotyping uncertainties. As a result, the NGS community is rapidly seeking to remedy these challenges, to produce high resolution and high throughput HLA sequencing conducive to a clinical setting. Today, second generation NGS technologies are limited in their ability to yield full length HLA sequences required for adequate phasing and identification of novel alleles. Here we present the use of single molecule real time (SMRT) sequencing as a means of determining full length/long HLA sequences. Moreover we reveal the scalability of this method through multiplexing approches and determine HLA genotyping calls through the use of third party Gendx NGSengine® software.
Long Amplicon Analysis: Highly accurate, full-length, phased, allele-resolved gene sequences from multiplexed SMRT Sequencing data.
The correct phasing of genetic variations is a key challenge for many applications of DNA sequencing. Allele-level resolution is strongly preferred for histocompatibility sequencing where recombined genes can exhibit different compatibilities than their parents. In other contexts, gene complementation can provide protection if deleterious mutations are found on only one allele of a gene. These problems are especially pronounced in immunological domains given the high levels of genetic diversity and recombination seen in regions like the Major Histocompatibility Complex. A new tool for analyzing Single Molecule, Real-Time (SMRT) Sequencing data – Long Amplicon Analysis (LAA) – can generate highly accurate, phased and full-length consensus sequences for multiple genes in a single sequencing run.
Unique haplotype structure determination in human genome using Single Molecule, Real-Time (SMRT) Sequencing of targeted full-length fosmids.
Determination of unique individual haplotypes is an essential first step toward understanding how identical genotypes having different phases lead to different biological interpretations of function, phenotype, and disease. Genome-wide methods for identifying individual genetic variation have been limited in their ability to acquire phased, extended, and complete genomic sequences that are long enough to assemble haplotypes with high confidence. We explore a recombineering approach for isolation and sequencing of a tiling of targeted fosmids to capture interesting regions from human genome. Each individual fosmid contains large genomic fragments (~35?kb) that are sequenced with long-read SMRT technology to generate contiguous long reads. These long reads can be easily de novo assembled for targeted haplotype resolution within an individual’s genomes. The P5-C3 chemistry for SMRT Sequencing generated contiguous, full-length fosmid sequences of 30 to 40 kb in a single read, allowing assembly of resolved haplotypes with minimal data processing. The phase preserved in fosmid clones spanned at least two heterozygous variant loci, providing the essential detail of precise haplotype structures. We show complete assembly of haplotypes for various targeted loci, including the complex haplotypes of the KIR locus (~150 to 200 kb) and conserved extended haplotypes (CEHs) of the MHC region. This method is easily applicable to other regions of the human genome, as well as other genomes.
Fully phased allele-level sequencing of highly polymorphic HLA genes is greatly facilitated by SMRT Sequencing technology. In the present work, we have evaluated multiple DNA barcoding strategies for multiplexing several loci from multiple individuals, using three different tagging methods. Specifically MHC class I genes HLA-A, -B, and –C were indexed via DNA Barcodes by either tailed primers or barcoded SMRTbell adapters. Eight different 16-bp barcode sequences were used in symmetric & asymmetric pairing. Eight DNA barcoded adapters in symmetric pairing were independently ligated to a pool of HLA-A, -B and –C for eight different individuals, one at a time and pooled for sequencing on a single SMRT Cell. Amplicons generated from barcoded primers were pooled upfront for library generation. Eight symmetric barcoded primers were generated for HLA class I genes. These primers facilitated multiplexing of 8 samples and also allowed generation of unique asymmetric pairings for simultaneous amplification from 28 reference genomic DNA samples. The data generated from all 3 methods was analyzed using LAA protocol in SMRT analysis V2.3. Consensus sequences generated were typed using GenDx NGS engine HLA-typing software.
Second-generation sequencing has brought about tremendous insights into the genetic underpinnings of biology. However, there are many functionally important and medically relevant regions of genomes that are currently difficult or impossible to sequence, resulting in incomplete and fragmented views of genomes. Two main causes are (i) limitations to read DNA of extreme sequence content (GC-rich or AT-rich regions, low complexity sequence contexts) and (ii) insufficient read lengths which leave various forms of structural variation unresolved and result in mapping ambiguities.
As a cost-effective alternative to whole genome human sequencing, targeted sequencing of specific regions, such as exomes or panels of relevant genes, has become increasingly common. These methods typically include direct PCR amplification of the genomic DNA of interest, or the capture of these targets via probe-based hybridization. Commonly, these approaches are designed to amplify or capture exonic regions and thereby result in amplicons or fragments that are a few hundred base pairs in length, a length that is well-addressed with short-read sequencing technologies. These approaches typically provide very good coverage and can identify SNPs in the targeted region, but are unable to haplotype these variants. Here we describe a targeted sequencing workflow that combines Roche NimbleGen’s SeqCap EZ enrichment technology with Pacific Biosciences’ SMRT Sequencing to provide a more comprehensive view of variants and haplotype information over multi-kilobase regions. While the SeqCap EZ technology is typically used to capture 200 bp fragments, we demonstrate that 6 kb fragments can also be utilized to enrich for long fragments that extend beyond the targeted capture site and well into (and often across) the flanking intronic regions. When combined with the long reads of SMRT Sequencing, multi-kilobase regions of the human genome can be phased and variants detected in exons, introns and intergenic regions.
Multiplexing human HLA class I & II genotyping with DNA barcode adapters for high throughput research.
Human MHC class I genes HLA-A, -B, -C, and class II genes HLA-DR, -DP and -DQ, play a critical role in the immune system as major factors responsible for organ transplant rejection. The have a direct or linkage-based association with several diseases, including cancer and autoimmune diseases, and are important targets for clinical and drug sensitivity research. HLA genes are also highly polymorphic and their diversity originates from exonic combinations as well as recombination events. A large number of new alleles are expected to be encountered if these genes are sequenced through the UTRs. Thus allele-level resolution is strongly preferred when sequencing HLA genes. Pacific Biosciences has developed a method to sequence the HLA genes in their entirety within the span of a single read taking advantage of long read lengths (average >10 kb) facilitated by SMRT technology. A highly accurate consensus sequence (=99.999 or QV50 demonstrated) is generated for each allele in a de novo fashion by our SMRT Analysis software. In the present work, we have combined this imputation-free, fully phased, allele-specific consensus sequence generation workflow and a newly developed DNA-barcode-tagged SMRTbell sample preparation approach to multiplex 96 individual samples for sequencing all of the HLA class I and II genes. Commercially available NGS-go reagents for full-length HLA class I and relevant exons of class II genes were amplified for hi-resolution HLA sequencing. The 96 samples included 72 that are part of UCLA reference panel and had pre-typing information available for 2 fields, based on gold standard SBT methods. SMRTbell adapters with 16 bp barcode tags were ligated to long amplicons in symmetric pairing. PacBio sequencing was highly effective in generating accurate, phased sequences of full-length alleles of HLA genes. In this work we demonstrate scalability of HLA sequencing using off the shelf assays for research applications to find biological significance in full-length sequencing.
Whole genome sequencing can provide comprehensive information important for determining the biochemical and genetic nature of all elements inside a genome. The high-quality genome references produced from past genome projects and advances in short-read sequencing technologies have enabled quick and cheap analysis for simple variants. However even with the focus on genome-wide resequencing for SNPs, the heritability of more than 50% of human diseases remains elusive. For non-human organisms, high-contiguity references are deficient, limiting the analysis of genomic features. The long and unbiased reads from single molecule, real-time (SMRT) Sequencing and new de novo assembly approaches have demonstrated the ability to detect more complicated variants and chromosome-level phasing. Moreover, with the recent advance of bioinformatics algorithms and tools, the computation tasks for completing high-quality de novo assembly of large genomes becomes feasible with commodity hardware. Ongoing development in sequencing technologies and bioinformatics will likely lead to routine generation of high-quality reference assemblies in the future. We discuss the current state of art and the challenges in bioinformatics toward such a goal. More specifically, explicit examples of pragmatic computational requirements for assembling mammalian-size genomes and algorithms suitable for processing diploid genomes are discussed.
Complete resequencing of extended genomic regions using fosmid target capture and single molecule real-time (SMRT) long read sequencing technology.
A longstanding goal of genomic analysis is the identification of causal genetic factors contributing to disease. While the common disease/common variant hypothesis has been tested in many genome-wide association studies, few advancements in identifying causal variation have been realized, and instead recent findings point away from common variants towards aggregate rare variants as causal. A challenge is obtaining complete phased genomic sequences over extended genomic regions from sufficient numbers of cases and controls to identify all potential variation causal of a disease. To address this, we modified methods for targeted DNA isolation using fosmid technology and single-molecule, long-sequence-read generaton that combine for complete, haplotype-resolved resequencing across extended genomic subregions. As proof of principal, we validated the approach by resequencing four 800 kbp segments that span a major histocompatibility complex (MHC) common extended haplotype (CEH) associated with disease. The data revealed the extent of conservation exposing a near identity among four DR4 CEHs over conserved regions, detailing rare variation and measuring sequence accuracy. In a second test, we sequenced the complete KIR haplotypes from 8 individuals within a specific timeframe and cost. Single molecule long-read sequencing technology generated contiguous full-length fosmid sequences of 30 to 40 kb in a single read, allowing assembly of resolved haplotypes with very little data processing. All of the sequences produced from these projects were contiguous, phased, with accuracy above 99.99%. The results demonstrated that cost-effective scale-up is possible to generate scores to hundreds of phased chromosomal sequences of extended lengths that can encompass genomic regions associated with disease.
Highly contiguous de novo human genome assembly and long-range haplotype phasing using SMRT Sequencing
The long reads, random error, and unbiased sampling of SMRT Sequencing enables high quality, de novo assembly of the human genome. PacBio long reads are capable of resolving genomic variations at all size scales, including SNPs, insertions, deletions, inversions, translocations, and repeat expansions, all of which are important in understanding the genetic basis for human disease and difficult to access via other technologies. In demonstration of this, we report a new high-quality, diploid aware de novo assembly of Craig Venter’s well-studied genome.
Access full spectrum of polymorphisms in HLA class I & II genes, without imputation for disease association and evolutionary research.
MHC class I and II genes are critically monitored by high-resolution sequencing for organ transplant decisions due to their role in GVHD. Their direct or linkage-based causal association, have increased their prominence as targets for drug sensitivity, autoimmune, cancer and infectious disease research. Monitoring HLA genes can however be tricky due to their highly polymorphic nature. Allele-level resolution is thus strongly preferred. However, most studies were historically focused on peptide binding domains of the HLA genes, due to technological challenges. As a result knowledge about the functional role of polymorphisms outside of exons 2 and 3 of HLA genes was rather limited. There are also relatively few full-length gene references currently available in the IMGT HLA database. This made it difficult to quickly adopt high-throughput reference-reliant methods for allele-level HLA sequencing. Increasing awareness regarding role of regulatory region polymorphisms of HLA genes in disease association1, nonetheless have brought about a revolution in full-length HLA gene sequencing. Researchers are now exploring ways to obtain complete information for HLA genes and integrate it with the current HLA database so it can be interpreted used by clinical researchers. We have explored advantages of SMRT Sequencing to obtain fully phased, allele-specific sequences of HLA class I and II genes for 96 samples using completely De novo consensus generation approach for imputation-free 4-field typing. With long read lengths (average >10 kb) and consensus accuracy exceeding 99.999% (Q50), a comprehensive snapshot of variants in exons, introns and UTRs could be obtained for spectrum of polymorphisms in phase across SNP-poor regions. Such information can provide invaluable insights in future causality association and population diversity research.
Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing
The human reference sequence has provided a foundation for studies of genome structure, human variation, evolutionary biology, and disease. At the time the reference was originally completed there were some loci recalcitrant to closure; however, the degree to which structural variation and diversity affected our ability to produce a representative genome sequence at these loci was still unknown. Many of these regions in the genome are associated with large, repetitive sequences and exhibit complex allelic diversity such producing a single, haploid representation is not possible. To overcome this challenge, we have sequenced DNA from two hydatidiform moles (CHM1 and CHM13), which are essentially haploid. CHM13 was sequenced with the latest PacBio technology (P6-C5) to 52X genome coverage and assembled using Daligner and Falcon v0.2 (GCA_000983455.1, CHM13_1.1). Compared to the first mole (CHM1) PacBio assembly (GCA_001007805.1, 54X) contig N50 of 4.5Mb, the contig N50 of CHM13_1.1 is almost 13Mb, and there is a 13-fold reduction in the number of contigs. This demonstrates the improved contiguity of sequence generated with the new chemistry. We annotated 50,188 RefSeq transcripts of which only 0.63% were split transcripts, and the repetitive and segmental duplication content was within the expected range. These data all indicate an extremely high quality assembly. Additionally, we sequenced CHM13 DNA using Illumina SBS technology to 60X coverage, aligned these reads to the GRCh37, GRCh38, and CHM13_1.1 assemblies and performed variant calling using the SpeedSeq pipeline. The number of single nucleotide variants (SNV) and indels was comparable between GRCh37 and GRCh38. Regions that showed increased SNV density in GRCh38 compared to GRCh37 could be attributed to the addition of centromeric alpha satellite sequence to the reference assembly. Alternatively, regions of decreased SNV density in GRCh38 were concentrated in regions that were improved from BAC based sequencing of CHM1 such as 1p12 and 1q21 containing the SRGAP2 gene family. The alignment of PacBio reads to GRCh37 and GRCh38 assemblies allowed us to resolve complex loci such as the MHC region where the best alignment was to the DBB (A2-B57-DR7) haplotype. Finally, we will discuss how combining the two high quality mole assemblies can be used for benchmarking and novel bioinformatics tool development.
The complex immune regions of the genome, including MHC and KIR, contain large copy number variants (CNVs), a high density of genes, hyper-polymorphic gene alleles, and conserved extended haplotypes (CEH) with enormous linkage disequilibrium (LDs). This level of complexity and inherent biases of short-read sequencing make it challenging for extracting immune region haplotype information from reference-reliant, shotgun sequencing and GWAS methods. As NGS based genome and exome sequencing and SNP arrays have become a routine for population studies, numerous efforts are being made for developing software to extract and or impute the immune gene information from these datasets. Despite these efforts, the fine mapping of causal variants of immune genes for their well-documented association with cancer, drug-induced hypersensitivity and immune-related diseases, has been slower than expected. This has in many ways limited our understanding of the mechanisms leading to immune disease. In the present work, we demonstrate the advantages of long reads delivered by SMRT Sequencing for assembling complete haplotypes of MHC and KIR gene clusters, as well as calling correct genotypes of genes comprised within them. All the genotype information is detected at allele- level with full phasing information across SNP-poor regions. Genotypes were called correctly from targeted gene amplicons, haplotypes, as well as from a completely assembled 5 Mb contig of the MHC region from a de novo assembly of whole genome shotgun data. De novo analysis pipeline used in all these approaches allowed for reference-free analysis without imputation, a key for interrogation without prior knowledge about ethnic backgrounds. These methods are thus easily adoptable for previously uncharacterized human or non-human species.