The increased sequencing throughput creates a need for multiplexing for several applications. We are here detailing different barcoding strategies for microbial sequencing, targeted sequencing, Iso-Seq full-length isoform sequencing, and Roche NimbleGen’s target enrichment method.
Highly contiguous de novo human genome assembly and long-range haplotype phasing using SMRT Sequencing
The long reads, random error, and unbiased sampling of SMRT Sequencing enables high quality, de novo assembly of the human genome. PacBio long reads are capable of resolving genomic variations at all size scales, including SNPs, insertions, deletions, inversions, translocations, and repeat expansions, all of which are both important in understanding the genetic basis for human disease, and difficult to access via other technologies. In demonstration of this, we report a new high-quality, diploid-aware de novo assembly of Craig Venter’s well-studied genome.
Whole gene sequencing of KIR-3DL1 with SMRT Sequencing and the distribution of allelic variants in different ethnic groups
The killer-cell immunoglobulin-like receptor (KIR) gene family are involved in immune modulation during viral infection, autoimmune disease and in allogeneic stem cell transplantation. Most KIR gene diversity studies and their impact on the transplant outcome is performed by gene absence/presence assays. However, it is well known that KIR gene allelic variations have biological significance. Allele level typing of KIR genes has been very challenging until recently due to the homologous nature of those genes and very long intronic sequences. SMRT (Single Molecule Real-Time) Sequencing generates average long reads of 10 to 15 kb and allows us to obtain in-phase long sequence reads. We have developed a PCR assay for SMRT Sequencing on the PacBio RS II platform in our lab for 3DL1 whole gene sequencing. This approach allows us to obtain allele level typing for 3DL1 genes and could serve as a model to type other KIR genes at allelic level.
Collection of major HLA allele sequences in Japanese population toward the precise NGS based HLA DNA typing at the field 4 level
We previously reported on the use of the Ion PGM next generation sequencing (NGS) platform to genotype HLA class I and class II genes by a super-high resolution, single-molecule, sequence-based typing (SS-SBT) method (Shiina et al. 2012). However, HLA alleles could not be assigned at the field 4 level at some HLA loci such as DQA1, DPA1 and DPB1 because the SNP and indel densities were too low to identify and separate both of the phases. In this regard, we have now added the single molecule, real-time (SMRT) DNA sequencer PacBio RS II method to our analysis in order to test whether it might determine the HLA allele sequences in some of the loci with which we previously had difficulties. In this study, we report on sequence-based genotyping of entire HLA gene sequences from the promoter-enhancer region to 3’UTR of the major HLA loci (A, B, C, DRB1, DRB345, DQA1, DQB1, DPA1 and DPB1) using 46 Japanese reference subjects who represented a distribution of more than 99.5% of the HLA alleles at each of the HLA loci and the PacBio RS II and Ion PGM systems.
A method for the identification of variants in Alzheimer’s disease candidate genes and transcripts using hybridization capture combined with long-read sequencing
Alzheimer’s disease (AD) is a devastating neurodegenerative disease that is genetically complex. Although great progress has been made in identifying fully penetrant mutations in genes such as APP, PSEN1 and PSEN2 that cause early-onset AD, these still represent a very small percentage of AD cases. Large-scale, genome-wide association studies (GWAS) have identified at least 20 additional genetic risk loci for the more common form of late-onset AD. However, the identified SNPs are typically not the actual risk variants, but are in linkage disequilibrium with the presumed causative variant (Van Cauwenberghe C, et al., The genetic landscape of Alzheimer disease: clinical implications and perspectives. Genet Med 2015;18:421-430). Long-read sequencing together with hybrid-capture targeting technologies provides a powerful combination to target candidate genes/transcripts of interest. Shearing the genomic DNA to ~5 kb fragments and then capturing with probes that span the whole gene(s) of interest can provide uniform coverage across the entire region, identifying variants and allowing for phasing into two haplotypes. Furthermore, capturing full-length cDNA from the same sample using the same capture probes can also provide an understanding of isoforms that are generated and allow them to be assigned to their corresponding haplotype. Here we present a method for capturing genomic DNA and cDNA from an AD sample using a panel of probes targeting approximately 20 late-onset AD candidate genes which includes CLU, ABCA7, CD33, TREM2, TOMM40, PSEN2, APH1 and BIN1. By combining xGen® Lockdown® probes with SMRT Sequencing, we provide completely sequenced candidate genes as well as their corresponding transcripts. In addition, we are also able to evaluate structural variants that due to their size, repetitive nature, or low sequence complexity have been un-sequenceable using short-read technologies.
Early detection of colorectal cancer (CRC) and its precursor lesions (adenomas) is crucial to reduce mortality rates. The fecal immunochemical test (FIT) is a non-invasive CRC screening test that detects the blood-derived protein hemoglobin. However, FIT sensitivity is suboptimal especially in detection of CRC precursor lesions. As adenoma-to-carcinoma progression is accompanied by alternative splicing, tumor-specific proteins derived from alternatively spliced RNA transcripts might serve as candidate biomarkers for CRC detection.
Screening and characterization of causative structural variants for bipolar disorder in a significantly linked chromosomal region onXq24-q27 in an extended pedigree from a genetic isolate
Bipolar disorder (BD) is a phenotypically and genetically complex and debilitating neurological disorder that affects 1% of the worldwide population. There is compelling evidence from family, twin and adoption studies supporting the involvement of a genetic predisposition in BD with estimated heritability up to ~ 80%. The risk in first-degree relatives is ten times higher than in the general population. Linkage and association studies have implicated multiple putative chromosomal loci for BP susceptibility, however no disease genes have been identified to date.
A method for the identification of variants in Alzheimer’s disease candidate genes and transcripts using hybridization capture combined with long-read sequencing
Alzheimer’s disease (AD) is a devastating neurodegenerative disease that is genetically complex. Although great progress has been made in identifying fully penetrant mutations in genes such as APP, PSEN1 and PSEN2 that cause early-onset AD, these still represent a very small percentage of AD cases. Large-scale, genome-wide association studies (GWAS) have identified at least 20 additional genetic risk loci for the more common form of late-onset AD. However, the identified SNPs are typically not the actual causal variants, but are in linkage disequilibrium with the presumed causative variant (Van Cauwenberghe C, et al., The genetic landscape of Alzheimer disease: clinical implications and perspectives. Genet Med 2015;18:421-430).
Over the past decades neurological disorders have been extensively studied producing a large number of candidate genomic regions and candidate genes. The SNPs identified in these studies rarely represent the true disease-related functional variants. However, more recently a shift in focus from SNPs to larger structural variants has yielded breakthroughs in our understanding of neurological disorders.Here we have developed candidate gene screening methods that combine enrichment of long DNA fragments with long-read sequencing that is optimized for structural variation discovery. We have also developed a novel, amplification-free enrichment technique using the CRISPR/Cas9 system to target genomic regions.We sequenced gDNA and full-length cDNA extracted from the temporal lobe for two Alzheimer’s patients for 35 GWAS candidate genes. The multi-kilobase long reads allowed for phasing across the genes and detection of a broad range of genomic variants including SNPs to multi-kilobase insertions, deletions and inversions. In the full-length cDNA data we detected differential allelic isoform complexity, novel exons as well as transcript isoforms. By combining the gDNA data with full-length isoform characterization allows to build a more comprehensive view of the underlying biological disease mechanisms in Alzheimer’s disease. Using the novel PCR-free CRISPR-Cas9 enrichment method we screened several genes including the hexanucleotide repeat expansion C9ORF72 that is associated with 40% of familiar ALS cases. This method excludes any PCR bias or errors from an otherwise hard to amplify region as well as preserves the basemodication in a single molecule fashion which allows you to capture mosaicism present in the sample.
The expression of androgen receptor (AR) variants is a frequent, yet poorly-understood mechanism of clinical resistance to AR-targeted therapy for castration-resistant prostate cancer (CRPC). Among the multiple AR variants expressed in CRPC, AR-V7 is considered the most clinically-relevant AR variant due to broad expression in CRPC, correlations of AR-V7 expression with clinical resistance, and growth inhibition when AR-V7 is knocked down in CRPC models. Therefore, efforts are under way to develop strategies for monitoring and inhibiting AR-V7 in castration-resistant prostate cancer (CRPC). The aim of this study was to understand whether other AR variants are co-expressed with AR-V7 and promote resistance to AR-targeted therapies. To test this, we utilized RNA-seq to characterize AR expression in CRPC models. RNA-seq revealed the frequent coexpression of AR-V9 and AR-V7 in multiple CRPC models and metastases. Furthermore, long-read single-molecule real-time (SMRT) sequencing of AR isoforms revealed that AR-V7 and AR-V9 shared a common 3’terminal cryptic exon. To test this, we knocked down AR-V7 in prostate cancer cell lines and confirmed that AR-V9 mRNA and protein expression were also impacted. In reporter assays with AR-responsive promoters, AR-V9 functioned as a constitutive activator of androgen/AR signaling. Similarly, infection of AR-V9 lentiviral construct in LNCaP cells induced androgen-independent cell proliferation. In conclusion, these data implicate co-expression of AR-V9 with AR-V7 as an important component of constitutive AR signaling and therapeutic resistance in CRPC.
Library prep and bioinformatics improvements for full-length transcript sequencing on the PacBio Sequel System
The PacBio Iso-Seq method produces high-quality, full-length transcripts of up to 10 kb and longer and has been used to annotate many important plant and animal genomes. Here we describe an improved, simplified library workflow and analysis pipeline that reduces library preparation time, RNA input, and cost. The Iso-Seq V2 Express workflow is a one day protocol that requires only ~300 ng of total RNA input while also reducing the number of reverse transcription and amplification steps down to single reactions. Compared with the previous workflow, the Iso-Seq V2 Express workflow increases the percentage of full-length (FL) reads while achieving a higher average transcript length. At the same time, the Iso-Seq 3 analysis recently released in the SMRT Link 6.0 software is a major improvement over previous versions. Iso-Seq 3 is highly accurate at detecting and removing library artifacts (TSO and RT artifacts) as well as differentiating barcodes on multiplexed samples. Iso-Seq 3 achieves the same output performance in high-quality transcript sequences compared to previous versions while reducing the runtime and memory usage dramatically.
Full-length transcriptome sequencing of melanoma cell line complements long-read assessment of genomic rearrangements
Transcriptome sequencing has proven to be an important tool for understanding the biological changes in cancer genomes including the consequences of structural rearrangements. Short read sequencing has been the method of choice, as the high throughput at low cost allows for transcript quantitation and the detection of even rare transcripts. However, the reads are generally too short to reconstruct complete isoforms. Conversely, long-read approaches can provide unambiguous full-length isoforms, but lower throughput has complicated quantitation and high RNA input requirements has made working with cancer samples challenging. Recently, the COLO 829 cell line was sequenced to 50-fold coverage with PacBio SMRT Sequencing. To validate and extend the findings from this effort, we have generated long-read transcriptome data using an updated PacBio Iso-Seq method, the results of which will be shared at the AACR 2019 General Meeting. With this complimentary transcriptome data, we demonstrate how recent innovations in the PacBio Iso-Seq method sample preparation and sequencing chemistry have made long-read sequencing of cancer transcriptomes more practical. In particular, library preparation has been simplified and throughput has increased. The improved protocol has reduced sample prep time from several days to one day while reducing the sample input requirements ten-fold. In addition, the incorporation of unique molecular identifier (UMI) tags into the workflow has improved the bioinformatics analysis. Yield has also increased, with v3 sequencing chemistry typically delivering > 30 Gb per SMRT Cell 1M. By integrating long and short read data, we demonstrate that the Iso-Seq method is a practical tool for annotating cancer genomes with high-quality transcript information.
Single cell isoform sequencing (scIso-Seq) identifies novel full-length mRNAs and cell type-specific expression
Single cell RNA-seq (scRNA-seq) is an emerging field for characterizing cell heterogeneity in complex tissues. However, most scRNA-seq methodologies are limited to gene count information due to short read lengths. Here, we combine the microfluidics scRNA-seq technique, Drop-Seq, with PacBio Single Molecule, Real-Time (SMRT) Sequencing to generate full-length transcript isoforms that can be confidently assigned to individual cells. We generated single cell Iso-Seq (scIso-Seq) libraries for chimp and human cerebral organoid samples on the Dolomite Nadia platform and sequenced each library with two SMRT Cells 8M on the PacBio Sequel II System. We developed a bioinformatics pipeline to identify, classify, and filter full-length isoforms at the single-cell level. We show that scIso-Seq reveals full-length isoform information not accessible using short reads that can reveal differences between cell types and amongst different species.
Structural variant in the RNA Binding Motif Protein, X-Linked 2 (RBMX2) gene found to be linked to bipolar disorder
Bipolar disorder (BD) is a phenotypically and genetically complex neurological disorder that affects 1% of the worldwide population. There is compelling evidence from family, twin and adoption studies supporting the involvement of a genetic predisposition with estimated heritability up to ~ 80%. The risk in first-degree relatives is ten times higher than in the general population. Linkage and association studies have implicated multiple putative chromosomal loci for BD susceptibility, however no disease genes have yet to be identified. Here, we have fully characterized a ~12 Mb significantly linked (lod score=3.54) genomic region on chromosome Xq24-q27 in an extended family from a genetic isolate that was using long-read single molecule, real-time (SMRT) sequencing. The family segregates BD in at least 4 generations with 16 individuals out of 61 affected. Thus, this family portrays a highly elevated reoccurrence risk compared to the general population. It is expected that the genetic complexity would be reduced in isolated populations, even in genetically complex disorders such as BD, as in the case of this extended family. We selected 16 key individuals from the X-chromosomally linked family to be sequenced. These selected individuals either carried the disease haplotype, were non-carriers of the disease haplotype, or served as married-in controls. We designed a Nimblegen capture array enriching for 5-9 kb fragments spanning the entire 12 Mb region that were then sequenced using long-read SMRT sequencing to screen for causative structural variants (SVs) explaining the increased risk for BD in this extended family. Altogether, 192 SVs were detected in the critically linked region however most of these represented common variants that could be seen across many of the family members regardless of the disease status. One SV stood out that showed perfect segregation among all affected individuals that were carriers of the disease haplotype. This was a 330bp Alu deletion in intron 4 of the RNA Binding Motif Protein, X-Linked 2 (RBMX2) gene that has previously been shown to play a central role in brain development and function. Moreover, Alu elements in general have also previously been associated with at least 37 neurological and neurodegenerative disorders. In order to validate the finding and the functionality of the identified SV further studies like isoform characterization are warranted.
ASHG PacBio Workshop: SMRT Sequencing as a translational research tool to investigate germline, somatic and infectious diseases
Melissa Laird Smith discussed how the Icahn School of Medicine at Mount Sinai uses long-read sequencing for translational research. She gave several examples of targeted sequencing projects run on the…