Understanding the genetic basis of infectious diseases is critical to enacting effective treatments, and several large-scale sequencing initiatives are underway to collect this information. Sequencing bacterial samples is typically performed by mapping sequence reads against genomes of known reference strains. While such resequencing informs on the spectrum of single nucleotide differences relative to the chosen reference, it can miss numerous other forms of variation known to influence pathogenicity: structural variations (duplications, inversions), acquisition of mobile elements (phages, plasmids), homonucleotide length variation causing phase variation, and epigenetic marks (methylation, phosphorothioation) that influence gene expression to switch bacteria from non-pathogenic to pathogenic states. Therefore, sequencing methods which provide complete, de novo genome assemblies and epigenomes are necessary to fully characterize infectious disease agents in an unbiased, hypothesis-free manner. Hybrid assembly methods have been described that combine long sequence reads from SMRT DNA sequencing with short, high-accuracy reads (SMRT (circular consensus sequencing) CCS or second-generation reads) to generate long, highly accurate reads that are then used for assembly. We have developed a new paradigm for microbial de novo assemblies in which long SMRT sequencing reads (average readlengths >5,000 bases) are used exclusively to close the genome through a hierarchical genome assembly process, thereby obviating the need for a second sample preparation, sequencing run and data set. We have applied this method to achieve closed de novo genomes with accuracies exceeding QV50 (>99.999%) to numerous disease outbreak samples, including E. coli, Salmonella, Campylobacter, Listeria, Neisseria, and H. pylori. The kinetic information from the same SMRT sequencing reads is utilized to determine epigenomes. Approximately 70% of all methyltransferase specificities we have determined to date represent previously unknown bacterial epigenetic signatures. The process has been automated and requires less than 1 day from an unknown DNA sample to its complete de novo genome and epigenome.
Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers to understand molecular mechanisms in evolution and gain insight into adaptive strategies. With read lengths exceeding 10 kb, we are able to sequence high-quality, closed microbial genomes with associated plasmids, and investigate large genome complexities, such as long, highly repetitive, low-complexity regions and multiple tandem-duplication events. Improved genome quality, observed at 99.9999% (QV60) consensus accuracy, and significant reduction of gap regions in reference genomes (up to and beyond 50%) allow researchers to better understand coding sequences with high confidence, investigate potential regulatory mechanisms in noncoding regions, and make inferences about evolutionary strategies that are otherwise missed by the coverage biases associated with short- read sequencing technologies. Additional benefits afforded by SMRT Sequencing include the simultaneous capability to detect epigenomic modifications and obtain full-length cDNA transcripts that obsolete the need for assembly. With direct sequencing of DNA in real-time, this has resulted in the identification of numerous base modifications and motifs, which genome-wide profiles have linked to specific methyltransferase activities. Our new offering, the Iso-Seq Application, allows for the accurate differentiation between transcript isoforms that are difficult to resolve with short-read technologies. PacBio reads easily span transcripts such that both 5’/3’ primers for cDNA library generation and the poly-A tail are observed. As such, exon configuration and intron retention events can be analyzed without ambiguity. This technological advance is useful for characterizing transcript diversity and improving gene structure annotations in reference genomes. We review solutions available with SMRT Sequencing, from targeted sequencing efforts to obtaining reference genomes (>100 Mb). This includes strategies for identifying microsatellites and conducting phylogenetic comparisons with targeted gene families. We highlight how to best leverage our long reads that have exceeded 20 kb in length for research investigations, as well as currently available bioinformatics strategies for analysis. Benefits for these applications are further realized with consistent use of size selection of input sample using the BluePippin™ device from Sage Science as demonstrated in our genome improvement projects. Using the latest P5-C3 chemistry on model organisms, these efforts have yielded an observed contig N50 of ~6 Mb, with the longest contig exceeding 12.5 Mb and an average base quality of QV50.
Fully phased allele-level sequencing of highly polymorphic HLA genes is greatly facilitated by SMRT Sequencing technology. In the present work, we have evaluated multiple DNA barcoding strategies for multiplexing several loci from multiple individuals, using three different tagging methods. Specifically MHC class I genes HLA-A, -B, and –C were indexed via DNA Barcodes by either tailed primers or barcoded SMRTbell adapters. Eight different 16-bp barcode sequences were used in symmetric & asymmetric pairing. Eight DNA barcoded adapters in symmetric pairing were independently ligated to a pool of HLA-A, -B and –C for eight different individuals, one at a time and pooled for sequencing on a single SMRT Cell. Amplicons generated from barcoded primers were pooled upfront for library generation. Eight symmetric barcoded primers were generated for HLA class I genes. These primers facilitated multiplexing of 8 samples and also allowed generation of unique asymmetric pairings for simultaneous amplification from 28 reference genomic DNA samples. The data generated from all 3 methods was analyzed using LAA protocol in SMRT analysis V2.3. Consensus sequences generated were typed using GenDx NGS engine HLA-typing software.
Multiplexing human HLA class I & II genotyping with DNA barcode adapters for high throughput research.
Human MHC class I genes HLA-A, -B, -C, and class II genes HLA-DR, -DP and -DQ, play a critical role in the immune system as major factors responsible for organ transplant rejection. The have a direct or linkage-based association with several diseases, including cancer and autoimmune diseases, and are important targets for clinical and drug sensitivity research. HLA genes are also highly polymorphic and their diversity originates from exonic combinations as well as recombination events. A large number of new alleles are expected to be encountered if these genes are sequenced through the UTRs. Thus allele-level resolution is strongly preferred when sequencing HLA genes. Pacific Biosciences has developed a method to sequence the HLA genes in their entirety within the span of a single read taking advantage of long read lengths (average >10 kb) facilitated by SMRT technology. A highly accurate consensus sequence (=99.999 or QV50 demonstrated) is generated for each allele in a de novo fashion by our SMRT Analysis software. In the present work, we have combined this imputation-free, fully phased, allele-specific consensus sequence generation workflow and a newly developed DNA-barcode-tagged SMRTbell sample preparation approach to multiplex 96 individual samples for sequencing all of the HLA class I and II genes. Commercially available NGS-go reagents for full-length HLA class I and relevant exons of class II genes were amplified for hi-resolution HLA sequencing. The 96 samples included 72 that are part of UCLA reference panel and had pre-typing information available for 2 fields, based on gold standard SBT methods. SMRTbell adapters with 16 bp barcode tags were ligated to long amplicons in symmetric pairing. PacBio sequencing was highly effective in generating accurate, phased sequences of full-length alleles of HLA genes. In this work we demonstrate scalability of HLA sequencing using off the shelf assays for research applications to find biological significance in full-length sequencing.
Aim: In contrast to exon-based HLA-typing approaches, whole gene genotyping crucially depends on full-length sequences submitted to the IMGT/HLA Database. Currently, full-length sequences are provided for only 7 out of 520 HLA-DPB1 alleles. Therefore, we developed a fully phased whole-gene sequencing approach for DPB1, to facilitate further exploration of the allelic structure at this locus. Methods: Primers were developed flanking the UTR-regions of DPB1 resulting in a 12 kb amplicon. Using a 4-primer approach, secondary primers containing barcodes were combined with the gene-specific primers to obtain barcoded full-gene amplicons in a single amplification step. Amplicons were pooled, purified, and ligated to SMRT bells (i.e. annealing points for sequencing primers) following standard protocols from Pacific Biosciences. Taking advantage of the SMRT chemistry, pools of 48 amplicons were sequenced full length in single runs on a Pacific Biosciences RSII instrument. Demultiplexing was performed using the SMRT portal. Sequence analysis was performed using the NGSengine software (GenDx). Results: We analyzed a set of 48 randomly picked samples. With 3 exceptions due to PCR failure, all genotype assignments conformed to standard genotyping results based on exons 2 and 3. Allelic proportions for heterozygous positions were evenly distributed (range 0.4 – 0.6) for all samples, suggesting unbiased amplifications. Despite the high per-read raw error rates typical for SMRT sequencing (~15%) the consensus sequence proved highly reliable. All consensus sequences for exons 2 and 3 were in full accordance with their MiSeq-derived sequences. We describe novel intronic sequence variation of the 7 so far genomically defined alleles, as well as 7 whole-length DPB1 alleles with hitherto unknown intronic regions. One of these alleles (HLA-DPB1*131:01) is classified as rare. Conclusion: Here we present a whole gene amplification and sequencing workflow for DPB1 alleles utilizing single molecule real-time (SMRT) sequencing from Pacific Biosciences. Validation of consensus sequences against known exonic sequences highlights the reliability of this technology. This workflow will facilitate amending the IMGT/HLA Database for DPB1.
Access full spectrum of polymorphisms in HLA class I & II genes, without imputation for disease association and evolutionary research.
MHC class I and II genes are critically monitored by high-resolution sequencing for organ transplant decisions due to their role in GVHD. Their direct or linkage-based causal association, have increased their prominence as targets for drug sensitivity, autoimmune, cancer and infectious disease research. Monitoring HLA genes can however be tricky due to their highly polymorphic nature. Allele-level resolution is thus strongly preferred. However, most studies were historically focused on peptide binding domains of the HLA genes, due to technological challenges. As a result knowledge about the functional role of polymorphisms outside of exons 2 and 3 of HLA genes was rather limited. There are also relatively few full-length gene references currently available in the IMGT HLA database. This made it difficult to quickly adopt high-throughput reference-reliant methods for allele-level HLA sequencing. Increasing awareness regarding role of regulatory region polymorphisms of HLA genes in disease association1, nonetheless have brought about a revolution in full-length HLA gene sequencing. Researchers are now exploring ways to obtain complete information for HLA genes and integrate it with the current HLA database so it can be interpreted used by clinical researchers. We have explored advantages of SMRT Sequencing to obtain fully phased, allele-specific sequences of HLA class I and II genes for 96 samples using completely De novo consensus generation approach for imputation-free 4-field typing. With long read lengths (average >10 kb) and consensus accuracy exceeding 99.999% (Q50), a comprehensive snapshot of variants in exons, introns and UTRs could be obtained for spectrum of polymorphisms in phase across SNP-poor regions. Such information can provide invaluable insights in future causality association and population diversity research.
The Human Leukocyte Antigen (HLA) genes located on chromosome 6 are responsible for regulating immune function via antigen presentation and are one of the determining factors for stem cell and organ transplantation compatibility. Additionally various alleles within this region have been implicated in autoimmune disorders, cancer, vaccine response and both non-infectious and infectious disease risk. The HLA region is highly variable; containing repetitive regions; and co-dominantly expressed genes. This complicates short read mapping and means that assessing the effect of variation within a gene requires full phase information to resolve haplotypes.One solution to the problem of HLA identification is the use of statistical inference to suggest the most likely diploid alleles given the genotypes observed. The assumption of this approach is the availability of an extensive reference panel. Whilst there exists good population genetics data for imputing European populations, there remains a paucity of information about variation in African populations. Filling this gap is one of the aims of the Genome Diversity in Africa Project and as a first step we are performing a pilot study to identify the optimal method for determining HLA type information for large numbers of samples from African populations.To that end we have obtained samples from 125 consented African participants selected from 5 populations across Africa (Morrocan, Ashanti, Igbo, Kalenjin, and Zulu). The methods included in our pilot study are Sanger sequencing (ABI), NGS on HiSeqX Ten platform (Illumina); long-range PCR combined with single molecule real-time (SMRT) sequencing (PacBio); and for a subset of samples library preparation on GemCode Platform (10x Genomics), which delivers valuable long range contextual information, combined with Illumina NGS sequencing.Results from capillary sequencing suggests the presence of a minimum of two novel alleles. Long Range PCR have been performed initially on a subset of samples using both primers sourced from GenDX and designed as described in Shiina et al (2012). Initial results from both primer sets were promising on Promega DNA test samples but only the GenDX primers proved effective on the African samples, producing consistently PCR products of the expected size in the Igbo, Ashanti, Morrocan and Zulu samples. We will present early results from our evaluation of the different sequencing technologies
With the increasing availability of whole-genome sequencing, haplotype reconstruction of individual genomes, or haplotype assembly, remains unsolved. Like the de novo genome assembly problem, haplotype assembly is greatly simplified by having more long-range information. The Targeted Locus Amplification (TLA) technology from Cergentis has the unique capability of targeting a specific region of the genome using a single primer pair and yielding ~2 kb DNA circles that are comprised of ~500 bp fragments. Fragments from the same circle come from the same haplotype and follow an exponential decay in distance from the target region, with a span that reaches the multi-megabase range. Here, we apply TLA to the BRCA1 gene on NA12878 and then sequence the resulting 2 kb circles on a PacBio RS II. The multiple fragments per circle were iteratively mapped to hg19 and then haplotype assembled using HAPCUT. We show that the 80 kb length of BRCA1 is represented by a single haplotype block, which was validated against GIAB data. We then explored chromosomal-scale haplotype assembly by combining these data with whole genome shotgun PacBio long reads, and demonstrate haplotype blocks approaching the length of chromosome 17 on which BRCA1 lies. Finally, by performing TLA without the amplification step and size selecting for reads >5 kb to maximize the number of fragments per read, we target whole genome haplotype assembly across all chromosomes.
The MHC Diversity in Africa Project (MDAP) pilot – 125 African high resolution HLA types from 5 populations
The major histocompatibility complex (MHC), or human leukocyte antigen (HLA) in humans, is a highly diverse gene family with a key role in immune response to disease; and has been implicated in auto-immune disease, cancer, infectious disease susceptibility, and vaccine response. It has clinical importance in the field of solid organ and bone marrow transplantation, where donors and recipient matching of HLA types is key to transplanted organ outcomes. The Sanger based typing (SBT) methods currently used in clinical practice do not capture the full diversity across this region, and require specific reference sequences to deconvolute ambiguity in HLA types. However, reference databases are based largely on European populations, and the full extent of diversity in Africa remains poorly understood. Here, we present the first systematic characterisation of HLA diversity within Africa in the pilot phase of the MHC Diversity in Africa Project, together with an evaluation of methods to carry out scalable cost-effective, as well as reliable, typing of this region in African populations.To sample a geographically representative panel of African populations we obtained 125 samples, 25 each from the Zulu (South Africa), Igbo (Nigeria), Kalenjin (Kenya), Moroccan and Ashanti (Ghana) groups. For methods validation we included two controls from the International Histocompatibility Working Group (IHWG) collection with known typing information. Sanger typing and Illumina HiSeq X sequencing of these samples indicated potentially novel Class I and Class II alleles; however, we found poor correlation between HiSeq X sequencing and SBT for both classes. Long Range PCR and high resolution PacBio RS-II typing of 4 of these samples identified 7 novel Class II alleles, highlighting the high levels of diversity in these populations, and the need for long read sequencing approaches to characterise this comprehensively. We have now expanded this approach to the entire pilot set of 125 samples. We present these confirmed types and discuss a workflow for scaling this to 5000 individuals across Africa.The large number of new alleles identified in our pilot suggests the high level of African HLA diversity and the utility of high resolution methods. The MDAP project will provide a framework for accurate HLA typing, in addition to providing an invaluable resource for imputation in GWAS, boosting power to identify and resolve HLA disease associations.
Full-length transcriptome sequencing of melanoma cell line complements long-read assessment of genomic rearrangements
Transcriptome sequencing has proven to be an important tool for understanding the biological changes in cancer genomes including the consequences of structural rearrangements. Short read sequencing has been the method of choice, as the high throughput at low cost allows for transcript quantitation and the detection of even rare transcripts. However, the reads are generally too short to reconstruct complete isoforms. Conversely, long-read approaches can provide unambiguous full-length isoforms, but lower throughput has complicated quantitation and high RNA input requirements has made working with cancer samples challenging. Recently, the COLO 829 cell line was sequenced to 50-fold coverage with PacBio SMRT Sequencing. To validate and extend the findings from this effort, we have generated long-read transcriptome data using an updated PacBio Iso-Seq method, the results of which will be shared at the AACR 2019 General Meeting. With this complimentary transcriptome data, we demonstrate how recent innovations in the PacBio Iso-Seq method sample preparation and sequencing chemistry have made long-read sequencing of cancer transcriptomes more practical. In particular, library preparation has been simplified and throughput has increased. The improved protocol has reduced sample prep time from several days to one day while reducing the sample input requirements ten-fold. In addition, the incorporation of unique molecular identifier (UMI) tags into the workflow has improved the bioinformatics analysis. Yield has also increased, with v3 sequencing chemistry typically delivering > 30 Gb per SMRT Cell 1M. By integrating long and short read data, we demonstrate that the Iso-Seq method is a practical tool for annotating cancer genomes with high-quality transcript information.
Structural variant detection with long read sequencing reveals driver and passenger mutations in a melanoma cell line
Past large scale cancer genome sequencing efforts, including The Cancer Genome Atlas and the International Cancer Genome Consortium, have utilized short-read sequencing, which is well-suited for detecting single nucleotide variants (SNVs) but far less reliable for detecting variants larger than 20 base pairs, including insertions, deletions, duplications, inversions and translocations. Recent same-sample comparisons of short- and long-read human reference genome data have revealed that short-read resequencing typically uncovers only ~4,000 structural variants (SVs, =50 bp) per genome and is biased towards deletions, whereas sequencing with PacBio long-reads consistently finds ~20,000 SVs, evenly balanced between insertions and deletions. This discovery has important implications for cancer research, as it is clear that SVs are both common and biologically important in many cancer subtypes, including colorectal, breast and ovarian cancer. Without confident and comprehensive detection of structural variants, it is unlikely we have a sufficiently complete picture of all the genomic changes that impact cancer development, disease progression, treatment response, drug resistance, and relapse. To begin to address this unmet need, we have sequenced the COLO829 tumor and matched normal lymphoblastoid cell lines to 49- and 51-fold coverage, respectively, with PacBio SMRT Sequencing, with the goal of developing a high-confidence structural variant call set that can be used to empirically evaluate cost-effective experimental designs for larger scale studies and develop structural variation calling software suitable for cancer genomics. Structural variant calling revealed over 21,000 deletions and 19,500 insertions larger than 20 bp, nearly four times the number of events detected with short-read sequencing. The vast majority of events are shared between the tumor and normal, with about 100 putative somatic deletions and 400 insertions, primarily in microsatellites. A further 40 rearrangements were detected, nearly exclusively in the tumor. One rearrangement is shared between the tumor and normal, t(5;X) which disrupts the mismatch repeat gene MSH3, and is likely a driver mutation. Generating high-confidence call sets that cover the entire size-spectrum of somatic variants from a range of cancer model systems is the first step in determining what will be the best approach for addressing an ongoing blind spot in our current understanding of cancer genomes. Here the application of PacBio sequencing to a melanoma cancer cell line revealed thousands of previously overlooked variants, including a mutation likely involved in tumorogenesis.
NGS is commonly used for amplicon sequencing in clinical applications to study genetic disorders and detect disease-causing mutations. This approach can be plagued by limited ability to phase sequence variants and makes interpretation of sequence data difficult when pseudogenes are present. Long-read highly accurate amplicon sequencing can provide very accurate, efficient, high throughput (through multiplexing) sequences from single molecules, with read lengths largely limited by PCR. Data is easy to interpret; phased variants and breakpoints are present within high fidelity individual reads. Here we show SMRT Sequencing of the PMS2 and OPN1 (MW and LW) genes using the Sequel System. Homologous regions make NGS and MLPA results very difficult to interpret.
Background: The sequencing and haplotype phasing of entire gene sequences improves the understanding of the genetic basis of disease and drug response. One example is cystic fibrosis (CF). Cystic fibrosis transmembrane conductance regulator (CFTR) modulator therapies have revolutionized CF treatment, but only in a minority of CF subjects. Observed heterogeneity in CFTR modulator efficacy is related to the range of CFTR mutations; revertant mutations can modify the response to CFTR modulators, and other intronic variations in the ~200 kb CFTR gene have been linked to disease severity. Heterogeneity in the CFTR gene may also be linked to differential responses to CFTR modulators. The Targeted Locus Amplification (TLA) technology from Cergentis can be used to selectively amplify, sequence and phase the entire CFTR gene. With PacBio long-read SMRT Sequencing, TLA amplicons are sequenced intact and long-range phasing information of all fragments in entire amplicons is retrieved. Experimental Design and Methods: The TLA process produces amplicons consisting of 5-10 proximity ligated DNA fragments. TLA was performed on cell line and genomic DNA from Coriell GM12878, which has few heterozygous SNVs in CFTR, and the IB3 cell line, with known haplotypes but heterozygous for the delta508 mutation. All sample types were prepared with high and low density TLA primer sets, targeting coverage of >100 kb of the CFTR gene. Conclusion: We have demonstrated the power and utility of TLA with long-read SMRT Sequencing as a valuable research tool in sequencing and phasing across very long regions of the human genome. This process can be done in an efficient manner, multiplexing multiple genes and samples per SMRT Cell in a process amenable to high-throughput sequencing.
ASHG Virtual Poster: The MHC Diversity in Africa Project (MDAP) pilot – 125 African high resolution HLA types from 5 populations
In this ASHG 2016 poster video, Martin Pollard from the Wellcome Trust Sanger Institute and the University of Cambridge describes an ambitious project to better represent natural variation in the…
In this AGBT 2017 talk, PacBio CSO Jonas Korlach provided a technology roadmap for the Sequel System, including plans the continue performance and throughput increases through early 2019. Per SMRT…