While advances in RNA sequencing methods have accelerated our understanding of the human transcriptome, isoform discovery remains a challenge because short read lengths require complicated assembly algorithms to infer the contiguity of full-length transcripts. With PacBio’s long reads, one can now sequence full-length transcript isoforms up to 10 kb. The PacBio Iso- Seq protocol produces reads that originate from independent observations of single molecules, meaning no assembly is needed. Here, we sequenced the transcriptome of the human MCF-7 breast cancer cell line using the Clontech SMARTer® cDNA preparation kit and the PacBio RS II. Using PacBio Iso-Seq bioinformatics software, we obtained 55,770 unique, full-length, high-quality transcript sequences that were subsequently mapped back to the human genome with = 99% accuracy. In addition, we identified both known and novel fusion transcripts. To assess our results, we compared the predicted ORFs from the PacBio data against a published mass spectrometry dataset from the same cell line. 84% of the proteins identified with the Uniprot protein database were recovered by the PacBio predictions. Notably, 251 peptides solely matched to the PacBio generated ORFs and were entirely novel, including abundant cases of single amino acid polymorphisms, cassette exon splicing and potential alternative protein coding frames.
Recent advances in next-generation sequencing have led to the increased use of formalin-fixed and paraffin-embedded (FFPE) tissues for medical samples in disease and scientific research. Single Molecule, Real-Time (SMRT) Sequencing offers a unique advantage in that it allows direct analysis of FFPE samples without amplification. However, obtaining ample long-read information from FFPE samples has been a challenge due to the quality and quantity of the extracted DNA. DNA samples extracted from FFPE often contain damaged sites, including breaks in the backbone and missing or altered nucleotide bases, which directly impact sequencing and amplification. Additionally, the quality and quantity of the recovered DNA also vary depending on the extraction methods used. We have evaluated the Adaptive Focused Acoustics (AFA™) system by Covaris as a method for obtaining high molecular weight DNA suitable for SMRTbell template preparation and subsequent single molecule sequencing. Using this method, genomic DNA was extracted from normal kidney FFPE scrolls acquired from Cooperative Human Tissue Network (CHTN), University of Pennsylvania. Damaged sites present in the extracted DNA were repaired using a DNA Damage Repair step, and the treated DNA was constructed into SMRTbell libraries suitable for sequencing on the PacBio RS II System. Using the same repaired DNA, we also tested PCR efficiency of target gene regions of up to 5 kb. The resulting amplicons were constructed into SMRTbell templates for full-length sequencing on the PacBio RS II System. We found the Adaptive Focused Acoustics (AFA) system combined with truXTRAC™ by Covaris to be effective and efficient. This system is easy and simple to use, and the resulting DNA is compatible with SMRTbell library preparation for targeted and whole genome SMRT Sequencing. The data presented here demonstrates single molecule sequencing of DNA samples extracted from tissues embedded in FFPE.
Highly sensitive and cost-effective detection of somatic cancer variants using single-molecule, real-time sequencing
Next-Generation Sequencing (NGS) technologies allow for molecular profiling of cancer samples with high sensitivity and speed at reduced cost. For efficient profiling of cancer samples, it is important that the NGS methods used are not only robust, but capable of accurately detecting low-frequency somatic mutations. Single Molecule, Real-Time (SMRT) Sequencing offers several advantages, including the ability to sequence single molecules with very high accuracy (>QV40) using the circular consensus sequencing (CCS) approach. The availability of genetically defined, human genomic reference standards provides an industry standard for the development and quality control of molecular assays for studying cancer variants. Here we characterize SMRT Sequencing for the detection of low-frequency somatic variants using the Quantitative Multiplex DNA Reference Standards from Horizon Discovery, combined with amplification of the variants using the Multiplicom Tumor Hotspot MASTR Plus assay. First, we sequenced a reference standard containing precise allelic frequencies from 1% to 24.5% for major oncology targets verified using digital PCR. This reference material recapitulates the complexity of tumor composition and serves as a well-characterized control. The control sample was amplified using the Multiplicom Tumor Hotspot MASTR Plus assay that targets 252 amplicons (121-254 bp) from 26 relevant cancer genes, which includes all 11 variants in the control sample. Next, we sequenced control samples prepared by SeraCare Life Sciences, which contained a defined mutation at allelic frequencies from 10% down to 0.1%. The wild type and mutant amplicons were serially diluted, sequenced and analyzed using SMRT Sequencing to identify the variants and determine the observed frequency. The random error profile and high-accuracy CCS reads make it possible to accurately detect low-frequency somatic variants.
SMRT Sequencing of full-length androgen receptor isoforms in prostate cancer reveals previously hidden drug resistant variants
Prostate cancer is the most frequently diagnosed male cancer. For prostate cancer that has progressed to an advanced or metastatic stage, androgen deprivation therapy (ADT) is the standard of care. ADT inhibits activity of the androgen receptor (AR), a master regulator transcription factor in normal and cancerous prostate cells. The major limitation of ADT is the development of castration-resistant prostate cancer (CRPC), which is almost invariably due to transcriptional re-activation of the AR. One mechanism of AR transcriptional re-activation is expression of AR-V7, a truncated, constitutively active AR variant (AR-V) arising from alternative AR pre-mRNA splicing. Noteworthy, AR-V7 is being developed as a predictive biomarker of primary resistance to androgen receptor (AR)-targeted therapies in CRPC. Multiple additional AR-V species are expressed in clinical CRPC, but the extent to which these may be co-expressed with AR-V7 or predict resistance is not known.
Detection of somatic mutations, especially in heterogeneous tumor samples where variants may be present at a low level, is challenging. Single Molecule, Real-Time (SMRT) Sequencing is ideal for minor variant detection because of its ability to sequence single molecules with very high accuracy (>QV40) using the circular consensus sequencing (CCS) approach.
Tremendous flexibility is maintained in the human proteome via alternative splicing, and cancer genomes often subvert this flexibility to promote survival. Identification and annotation of cancer-specific mRNA isoforms is critical to understanding how mutations in the genome affect the biology of cancer cells. While microarrays and other NGS-based methods have become useful for studying transcriptomes, these technologies yield short, fragmented transcripts that remain a challenge for accurate, complete reconstruction of splice variants. In cancer proteomics studies, the identification of biomarkers from mass spectroscopy data is often limited by incomplete gene isoform expression information to support protein to transcript mapping. The Iso-Seq protocol developed at PacBio offers the only solution for direct sequencing of full-length, single-molecule cDNA sequences needed to discover biomarkers for early detection and cancer stratification, to fully characterize gene fusion events, and to elucidate drug resistance mechanisms. Knowledge of the complete isoform repertoire is also key for accurate quantification of isoform abundance. As most transcripts range from 1 â€“ 10 kb, fully intact RNA molecules can be sequenced using SMRTÂ® Sequencing without requiring fragmentation or post-sequencing assembly. However, some cancer research applications have presented a challenge for the Iso-Seq protocol, due to the combination of limited sample input and the need to deeply sequence heterogenous samples. Here we report the optimization of the Iso-Seq library preparation protocol for the PacBio Sequel platform and its application to cancer cell lines and tumor samples. We demonstrate how loading enhancements on the higher-throughput Sequel instrument have decreased the need for size fractionation steps, reducing sample input requirements while simultaneously simplifying the sample preparation workflow and increasing the number of full-length transcripts per SMRT Cell.
Scalability and reliability improvements to the Iso-Seq analysis pipeline enables higher throughput sequencing of full-length cancer transcripts
The characterization of gene expression profiles via transcriptome sequencing has proven to be an important tool for characterizing how genomic rearrangements in cancer affect the biological pathways involved in cancer progression and treatment response. More recently, better resolution of transcript isoforms has shown that this additional level of information may be useful in stratifying patients into cancer subtypes with different outcomes and responses to treatment.1 The Iso-Seq protocol developed at PacBio is uniquely able to deliver full-length, high-quality cDNA sequences, allowing the unambiguous determination of splice variants, identifying potential biomarkers and yielding new insights into gene fusion events. Recent improvements to the Iso-Seq bioinformatics pipeline increases the speed and scalability of data analysis while boosting the reliability of isoform detection and cross-platform usability. Here we report evaluation of Sequel Iso-Seq runs of human UHRR samples with spiked-in synthetic RNA controls and show that the new pipeline is more CPU efficient and recovers more human and synthetic isoforms while reducing the number of false positives. We also share the results of sequencing the well-characterized HCC-1954 breast cancer and normal breast cell lines, which will be made publicly available. Combined with the recent simplification of the Iso-Seq sample preparation2, the new analysis pipeline completes a streamlined workflow for revealing the most comprehensive picture of transcriptomes at the throughput needed to characterize cancer samples.
Allelic specificity of immunoglobulin heavy chain ([email protected]) translocation in B-cell acute lymphoblastic leukemia (B-ALL) unveiled by long-read sequencing
Oncogenic fusion of IGH-DUX4 has recently been reported as a hallmark that defines a B-ALL subtype present in up to 7% of adolescents and young adults B-ALL. The translocation of DUX4 into IGH results in aberrant activation of DUX4 by hijacking the intronic IGH enhancer (Eµ). How IGH-DUX4 translocation interplays with IGH allelic exclusion was never been explored. We investigated this in Nalm6 B-ALL cell line, using long-read (PacBio Iso-Seq method and 10X Chromium WGS), short-read (Illumina total stranded RNA and WGS), epigenome (H3K27ac ChIP-seq, ATAC-seq) and 3-D genome (Hi-C, H3K27ac HiChIP, Capture-C).
The expression of androgen receptor (AR) variants is a frequent, yet poorly-understood mechanism of clinical resistance to AR-targeted therapy for castration-resistant prostate cancer (CRPC). Among the multiple AR variants expressed in CRPC, AR-V7 is considered the most clinically-relevant AR variant due to broad expression in CRPC, correlations of AR-V7 expression with clinical resistance, and growth inhibition when AR-V7 is knocked down in CRPC models. Therefore, efforts are under way to develop strategies for monitoring and inhibiting AR-V7 in castration-resistant prostate cancer (CRPC). The aim of this study was to understand whether other AR variants are co-expressed with AR-V7 and promote resistance to AR-targeted therapies. To test this, we utilized RNA-seq to characterize AR expression in CRPC models. RNA-seq revealed the frequent coexpression of AR-V9 and AR-V7 in multiple CRPC models and metastases. Furthermore, long-read single-molecule real-time (SMRT) sequencing of AR isoforms revealed that AR-V7 and AR-V9 shared a common 3’terminal cryptic exon. To test this, we knocked down AR-V7 in prostate cancer cell lines and confirmed that AR-V9 mRNA and protein expression were also impacted. In reporter assays with AR-responsive promoters, AR-V9 functioned as a constitutive activator of androgen/AR signaling. Similarly, infection of AR-V9 lentiviral construct in LNCaP cells induced androgen-independent cell proliferation. In conclusion, these data implicate co-expression of AR-V9 with AR-V7 as an important component of constitutive AR signaling and therapeutic resistance in CRPC.
Full-length transcriptome sequencing of melanoma cell line complements long-read assessment of genomic rearrangements
Transcriptome sequencing has proven to be an important tool for understanding the biological changes in cancer genomes including the consequences of structural rearrangements. Short read sequencing has been the method of choice, as the high throughput at low cost allows for transcript quantitation and the detection of even rare transcripts. However, the reads are generally too short to reconstruct complete isoforms. Conversely, long-read approaches can provide unambiguous full-length isoforms, but lower throughput has complicated quantitation and high RNA input requirements has made working with cancer samples challenging. Recently, the COLO 829 cell line was sequenced to 50-fold coverage with PacBio SMRT Sequencing. To validate and extend the findings from this effort, we have generated long-read transcriptome data using an updated PacBio Iso-Seq method, the results of which will be shared at the AACR 2019 General Meeting. With this complimentary transcriptome data, we demonstrate how recent innovations in the PacBio Iso-Seq method sample preparation and sequencing chemistry have made long-read sequencing of cancer transcriptomes more practical. In particular, library preparation has been simplified and throughput has increased. The improved protocol has reduced sample prep time from several days to one day while reducing the sample input requirements ten-fold. In addition, the incorporation of unique molecular identifier (UMI) tags into the workflow has improved the bioinformatics analysis. Yield has also increased, with v3 sequencing chemistry typically delivering > 30 Gb per SMRT Cell 1M. By integrating long and short read data, we demonstrate that the Iso-Seq method is a practical tool for annotating cancer genomes with high-quality transcript information.
Structural variant detection with long read sequencing reveals driver and passenger mutations in a melanoma cell line
Past large scale cancer genome sequencing efforts, including The Cancer Genome Atlas and the International Cancer Genome Consortium, have utilized short-read sequencing, which is well-suited for detecting single nucleotide variants (SNVs) but far less reliable for detecting variants larger than 20 base pairs, including insertions, deletions, duplications, inversions and translocations. Recent same-sample comparisons of short- and long-read human reference genome data have revealed that short-read resequencing typically uncovers only ~4,000 structural variants (SVs, =50 bp) per genome and is biased towards deletions, whereas sequencing with PacBio long-reads consistently finds ~20,000 SVs, evenly balanced between insertions and deletions. This discovery has important implications for cancer research, as it is clear that SVs are both common and biologically important in many cancer subtypes, including colorectal, breast and ovarian cancer. Without confident and comprehensive detection of structural variants, it is unlikely we have a sufficiently complete picture of all the genomic changes that impact cancer development, disease progression, treatment response, drug resistance, and relapse. To begin to address this unmet need, we have sequenced the COLO829 tumor and matched normal lymphoblastoid cell lines to 49- and 51-fold coverage, respectively, with PacBio SMRT Sequencing, with the goal of developing a high-confidence structural variant call set that can be used to empirically evaluate cost-effective experimental designs for larger scale studies and develop structural variation calling software suitable for cancer genomics. Structural variant calling revealed over 21,000 deletions and 19,500 insertions larger than 20 bp, nearly four times the number of events detected with short-read sequencing. The vast majority of events are shared between the tumor and normal, with about 100 putative somatic deletions and 400 insertions, primarily in microsatellites. A further 40 rearrangements were detected, nearly exclusively in the tumor. One rearrangement is shared between the tumor and normal, t(5;X) which disrupts the mismatch repeat gene MSH3, and is likely a driver mutation. Generating high-confidence call sets that cover the entire size-spectrum of somatic variants from a range of cancer model systems is the first step in determining what will be the best approach for addressing an ongoing blind spot in our current understanding of cancer genomes. Here the application of PacBio sequencing to a melanoma cancer cell line revealed thousands of previously overlooked variants, including a mutation likely involved in tumorogenesis.