June 1, 2021  |  

Full-length cDNA sequencing of alternatively spliced isoforms provides insight into human diseases.

The majority of human genes are alternatively spliced, making it possible for most genes to generate multiple proteins. The process of alternative splicing is highly regulated in a developmental-stage and tissue-specific manner. Perturbations in the regulation of these events can lead to disease in humans. Alternative splicing has been shown to play a role in human cancer, muscular dystrophy, Alzheimer’s, and many other diseases. Understanding these diseases requires knowing the full complement of mRNA isoforms. Microarrays and high-throughput cDNA sequencing have become highly successful tools for studying transcriptomes, however these technologies only provide small fragments of transcripts and building complete transcript isoforms has been very challenging. We have developed the Iso-Seq technique, which is capable of sequencing full-length, single-molecule cDNA sequences. The method employs SMRT Sequencing to generate individual molecules with average read lengths of more than 10 kb and some as long as 40 kb. As most transcripts are from 1 to 10 kb, we can sequence through entire RNA molecules, requiring no fragmentation or post-sequencing assembly. Jointly with the sequencing method, we developed a computational pipeline that polishes these full-length transcript sequences into high-quality, non-redundant transcript consensus sequences. Iso-Seq sequencing enables unambiguous identification of alternative splicing events, alternative transcriptional start and poly-A sites, and transcripts from gene fusion events. Knowledge of the complete set of isoforms from a sample of interest is key for accurate quantification of isoform abundance when using any technology for transcriptome studies. Here we characterize the full-length transcriptome of normal human tissues, paired tumor/normal samples from breast cancer, and a brain sample from a patient with Alzheimer’s using deep Iso-Seq sequencing. We highlight numerous discoveries of novel alternatively spliced isoforms, gene-fusions events, and previously unannotated genes that will improve our understanding of human diseases.


June 1, 2021  |  

Full-length cDNA sequencing of alternatively spliced isoforms provides insight into human cancer

The majority of human genes are alternatively spliced, making it possible for most genes to generate multiple proteins. The process of alternative splicing is highly regulated in a developmental-stage and tissue-specific manner. Perturbations in the regulation of these events can lead to disease in humans (1). Alternative splicing has been shown to play a role in human cancer, muscular dystrophy, Alzheimer’s, and many other diseases. Understanding these diseases requires knowing the full complement of mRNA isoforms. Microarrays and high-throughput cDNA sequencing have become highly successful tools for studying transcriptomes, however these technologies only provide small fragments of transcripts and building complete transcript isoforms has been very challenging (2). We have developed a technique, called Iso-Seq sequencing, that is capable of sequencing full-length, single-molecule cDNA sequences. The method employs SMRT Sequencing from PacBio, which can sequence individual molecules with read lengths that average more than 10 kb and can reach as long as 40 kb. As most transcripts are from 1 – 10 kb, we can sequence through entire RNA molecules, requiring no fragmentation or post-sequencing assembly. Jointly with the sequencing method, we developed a computational pipeline that polishes these full-length transcript sequences into high-quality, non-redundant transcript consensus sequences. Iso-Seq sequencing enables unambiguous identification of alternative splicing events, alternative transcriptional start and polyA sites, and transcripts from gene fusion events. Knowledge of the complete set of isoforms from a sample of interest is key for accurate quantification of isoform abundance when using any technology for transcriptome studies (3). Here we characterize the full-length transcriptome of paired tumor/normal samples from breast cancer using deep Iso-Seq sequencing. We highlight numerous discoveries of novel alternatively spliced isoforms, gene-fusion events, and previously unannotated genes that will improve our understanding of human cancer. (1) Faustino NA and Cooper TA. Genes and Development. 2003. 17: 419-437(2) Steijger T, et al. Nat Methods. 2013 Dec;10(12):1177-84.(3) Au KF, et al. Proc Natl Acad Sci U S A. 2013 Dec 10;110(50):E4821-30.


June 1, 2021  |  

Comprehensive genome and transcriptome structural analysis of a breast cancer cell line using PacBio long read sequencing

Genomic instability is one of the hallmarks of cancer, leading to widespread copy number variations, chromosomal fusions, and other structural variations. The breast cancer cell line SK-BR-3 is an important model for HER2+ breast cancers, which are among the most aggressive forms of the disease and affect one in five cases. Through short read sequencing, copy number arrays, and other technologies, the genome of SK-BR-3 is known to be highly rearranged with many copy number variations, including an approximately twenty-fold amplification of the HER2 oncogene. However, these technologies cannot precisely characterize the nature and context of the identified genomic events and other important mutations may be missed altogether because of repeats, multi-mapping reads, and the failure to reliably anchor alignments to both sides of a variation. To address these challenges, we have sequenced SK-BR-3 using PacBio long read technology. Using the new P6-C4 chemistry, we generated more than 70X coverage of the genome with average read lengths of 9-13kb (max: 71kb). Using Lumpy for split-read alignment analysis, as well as our novel assembly-based algorithms for finding complex variants, we have developed a detailed map of structural variations in this cell line. Taking advantage of the newly identified breakpoints and combining these with copy number assignments, we have developed an algorithm to reconstruct the mutational history of this cancer genome. From this we have discovered a complex series of nested duplications and translocations between chr17 and chr8, two of the most frequent translocation partners in primary breast cancers, resulting in amplification of HER2. We have also carried out full-length transcriptome sequencing using PacBio’s Iso-Seq technology, which has revealed a number of previously unrecognized gene fusions and isoforms. Combining long-read genome and transcriptome sequencing technologies enables an in-depth analysis of how changes in the genome affect the transcriptome, including how gene fusions are created across multiple chromosomes. This analysis has established the most complete cancer reference genome available to date, and is already opening the door to applying long-read sequencing to patient samples with complex genome structures.


June 1, 2021  |  

Cogent: Reconstructing the coding genome from full-length transcriptome sequences

For highly complex and large genomes, a well-annotated genome may be computationally challenging and costly, yet the study of alternative splicing events and gene annotations usually rely on the existence of a genome. Long-read sequencing technology provides new opportunities to sequence full-length cDNAs, avoiding computational challenges that short read transcript assembly brings. The use of single molecule, real-time sequencing from Pacific Biosciences to sequence transcriptomes (the Iso-SeqTM method), which produces de novo, high-quality, full-length transcripts, has revealed an astonishing amount of alternative splicing in eukaryotic species. With the Iso-Seq method, it is now possible to reconstruct the transcribed regions of the genome using just the transcripts themselves. We present Cogent, a tool for finding gene families and reconstructing the coding genome in the absence of a reference genome. Cogent uses k-mer similarities to first partition the transcripts into different gene families. Then, for each gene family, the transcripts are used to build a splice graph. Cogent identifies bubbles resulting from sequencing errors, minor variants, and exon skipping events, and attempts to resolve each splice graph down to the minimal set of reconstructed contigs. We apply Cogent to a Cuttlefish Iso-Seq dataset, for which there is a highly fragmented, Illumina-based draft genome assembly and little annotation. We show that Cogent successfully discovers gene families and can reconstruct the coding region of gene loci. The reconstructed contigs can then be used to visualize alternative splicing events, identify minor variants, and even be used to improve genome assemblies.


June 1, 2021  |  

Reconstruction of the spinach coding genome using full-length transcriptome without a reference genome

For highly complex and large genomes, a well-annotated genome may be computationally challenging and costly, yet the study of alternative splicing events and gene annotations usually rely on the existence of a genome. Long-read sequencing technology provides new opportunities to sequence full-length cDNAs, avoiding computational challenges that short read transcript assembly brings. The use of single molecule, real-time sequencing from PacBio to sequence transcriptomes (the Iso-Seq method), which produces de novo, high-quality, full-length transcripts, has revealed an astonishing amount of alternative splicing in eukaryotic species. With the Iso-Seq method, it is now possible to reconstruct the transcribed regions of the genome using just the transcripts themselves. We present Cogent, a tool for finding gene families and reconstructing the coding genome in the absence of a high-quality reference genome. Cogent uses k-mer similarities to first partition the transcripts into different gene families. Then, for each gene family, the transcripts are used to build a splice graph. Cogent identifies bubbles resulting from sequencing errors, minor variants, and exon skipping events, and attempts to resolve each splice graph down to the minimal set of reconstructed contigs. We apply Cogent to the Iso-Seq data for spinach, Spinacia oleracea, for which there is also a PacBio-based draft genome to validate the reconstruction. The Iso-Seq dataset consists of 68,263 fulllength, Quiver-polished transcript sequences ranging from 528 bp to 6 kbp long (mean: 2.1 kbp). Using the genome mapping as ground truth, we found that 95% (8045/8446) of the Cogent gene families found corresponded to a single genomic loci. For families that contained multiple loci, they were often homologous genes that would be categorized as belonging to the same gene family. Coding genome reconstruction was then performed individually for each gene family. A total of 86% (7283/8446) of the gene families were resolved to a single contig by Cogent, and was validated to be also a single contig in the genome. In 59 cases, Cogent reconstructed a single contig, however the contig corresponded to 2 or more loci in the genome, suggesting possible scaffolding opportunities. In 24 cases, the transcripts had no hits to the genome, though Pfam and BLAST searches of the transcripts show that they were indeed coding, suggesting that the genome is missing certain coding portions. Given the high quality of the spinach genome, we were not surprised to find that Cogent only minorly improved the genome space. However the ability of Cogent to accurately identify gene families and reconstruct the coding genome in a de novo fashion shows that it will be extremely powerful when applied to datasets for which there is no or low-quality reference genome.


June 1, 2021  |  

Haplotyping using full-length transcript sequencing reveals allele-specific expression

An important need in analyzing complex genomes is the ability to separate and phase haplotypes. While whole genome assembly can deliver this information, it cannot reveal whether there is allele-specific gene or isoform expression. The PacBio Iso-Seq method, which can produce high-quality transcript sequences of 10 kb and longer, has been used to annotate many important plant and animal genomes. We present an algorithm called IsoPhase that post-processes Iso-Seq data for transcript-based haplotyping. We applied IsoPhase to a maize Iso-Seq dataset consisting of two homozygous parents and two F1 cross hybrids. We validated the majority of the SNPs called with IsoPhase against matching short read data and identified cases of allele-specific, gene-level and isoform-level expression.


June 1, 2021  |  

Full-length transcriptome sequencing of melanoma cell line complements long-read assessment of genomic rearrangements

Transcriptome sequencing has proven to be an important tool for understanding the biological changes in cancer genomes including the consequences of structural rearrangements. Short read sequencing has been the method of choice, as the high throughput at low cost allows for transcript quantitation and the detection of even rare transcripts. However, the reads are generally too short to reconstruct complete isoforms. Conversely, long-read approaches can provide unambiguous full-length isoforms, but lower throughput has complicated quantitation and high RNA input requirements has made working with cancer samples challenging. Recently, the COLO 829 cell line was sequenced to 50-fold coverage with PacBio SMRT Sequencing. To validate and extend the findings from this effort, we have generated long-read transcriptome data using an updated PacBio Iso-Seq method, the results of which will be shared at the AACR 2019 General Meeting. With this complimentary transcriptome data, we demonstrate how recent innovations in the PacBio Iso-Seq method sample preparation and sequencing chemistry have made long-read sequencing of cancer transcriptomes more practical. In particular, library preparation has been simplified and throughput has increased. The improved protocol has reduced sample prep time from several days to one day while reducing the sample input requirements ten-fold. In addition, the incorporation of unique molecular identifier (UMI) tags into the workflow has improved the bioinformatics analysis. Yield has also increased, with v3 sequencing chemistry typically delivering > 30 Gb per SMRT Cell 1M. By integrating long and short read data, we demonstrate that the Iso-Seq method is a practical tool for annotating cancer genomes with high-quality transcript information.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.