Quality Statement

Pacific Biosciences is committed to providing high-quality products that meet customer expectations and comply with regulations. We will achieve these goals by adhering to and maintaining an effective quality-management system designed to ensure product quality, performance, and safety.


Image Use Agreement

By downloading, copying, or making any use of the images located on this website (“Site”) you acknowledge that you have read and understand, and agree to, the terms of this Image Usage Agreement, as well as the terms provided on the Legal Notices webpage, which together govern your use of the images as provided below. If you do not agree to such terms, do not download, copy or use the images in any way, unless you have written permission signed by an authorized Pacific Biosciences representative.

Subject to the terms of this Agreement and the terms provided on the Legal Notices webpage (to the extent they do not conflict with the terms of this Agreement), you may use the images on the Site solely for (a) editorial use by press and/or industry analysts, (b) in connection with a normal, peer-reviewed, scientific publication, book or presentation, or the like. You may not alter or modify any image, in whole or in part, for any reason. You may not use any image in a manner that misrepresents the associated Pacific Biosciences product, service or technology or any associated characteristics, data, or properties thereof. You also may not use any image in a manner that denotes some representation or warranty (express, implied or statutory) from Pacific Biosciences of the product, service or technology. The rights granted by this Agreement are personal to you and are not transferable by you to another party.

You, and not Pacific Biosciences, are responsible for your use of the images. You acknowledge and agree that any misuse of the images or breach of this Agreement will cause Pacific Biosciences irreparable harm. Pacific Biosciences is either an owner or licensee of the image, and not an agent for the owner. You agree to give Pacific Biosciences a credit line as follows: "Courtesy of Pacific Biosciences of California, Inc., Menlo Park, CA, USA" and also include any other credits or acknowledgments noted by Pacific Biosciences. You must include any copyright notice originally included with the images on all copies.


You agree that Pacific Biosciences may terminate your access to and use of the images located on the PacificBiosciences.com website at any time and without prior notice, if it considers you to have violated any of the terms of this Image Use Agreement. You agree to indemnify, defend and hold harmless Pacific Biosciences, its officers, directors, employees, agents, licensors, suppliers and any third party information providers to the Site from and against all losses, expenses, damages and costs, including reasonable attorneys' fees, resulting from any violation by you of the terms of this Image Use Agreement or Pacific Biosciences' termination of your access to or use of the Site. Termination will not affect Pacific Biosciences' rights or your obligations which accrued before the termination.

I have read and understand, and agree to, the Image Usage Agreement.

I disagree and would like to return to the Pacific Biosciences home page.

Pacific Biosciences

Updated! Data Release: Human MCF-7 Transcriptome

Thursday, June 11, 2015


Our R&D team has added a new dataset for the MCF-7 human breast cancer transcriptome, originally released in 2013. The new results were produced using 28 SMRT® Cells with 4-hour movies and P5-C3 chemistry. Sizing was performed with the SageELF™ platform (fractions collected: 1-2 kb, 2-3 kb, 3-5 kb, and 5-10 kb). Sequencing of the larger fractions with our newer sequencing chemistry that generates longer reads added longer transcripts (up to 10 kb) to the MCF-7 dataset, which previously had only transcripts up to 4 kb.

New FASTA and GFF files are available, representing the new combined dataset. Raw data for both the 2013 and 2015 sequencing is also available.

ORIGINAL POST (December 11, 2013):

Understanding the biology of a genome requires knowing the full complement of mRNA isoforms. In recent years, microarrays, high-throughput cDNA sequencing, and RNA-seq have become very useful tools for studying transcriptomes. High-throughput cDNA sequencing is accurate but laborious, while the inherently complex nature of the transcriptome makes transcript assembly computationally intractable. Recently, Steijger et al. (1) showed that complete isoform reconstruction from RNA-seq short-read data remains challenging even when all constituent exons are identified.

A number of recent publications have demonstrated the utility of full-length transcript sequencing by taking advantage of the long read lengths of SMRT® Sequencing technology (2)–(4). SMRT Sequencing produces reads that originate from independent observations of single molecules; no assembly is needed if a read spans the entire length of the transcript. To demonstrate the capabilities of PacBio® Isoform Sequencing (Iso-Seq) technology and show a glimpse of the complexity of eukaryotic transcriptomes, we generated a deep dataset of full-length cDNA sequencing of RNA from MCF-7, a human breast cancer cell line. The sequencing data was collected from several internal training sessions where different library preparation techniques were tested. We are releasing the underlying data in an effort to aid the design of future PacBio Iso-Seq experiments and to spur advances in the development of bioinformatics tools for analyzing full-length transcripts. 

In our final dataset, we obtained 44,531 non-redundant transcript-length consensus sequences ranging from 400 bp – 4,900 bp, with an average length of 1,929 bp (Fig. 1a). The total percentage of consensus bases that disagreed with the hg19 genome is 0.27%, out of which 0.16% are due to substitutions and thus could likely be true SNPs (Fig. 1b). About half of the transcribed loci have one observed isoform, while the rest have mostly 2-5 isoforms (Fig. 2). We compared our predicted full-length transcripts against the known annotations and found that we were able to recover full-length alternative splice forms (Fig. 3), alternative polyadenylation, novel transcripts, and known fusion genes (Fig. 4). We encourage interested researchers to explore the dataset.
Materials & Methods

Full-length cDNA was generated from polyA RNA using standard cDNA synthesis kits (Clontech® SMARTer™ and Invitrogen® Superscript® kits). To capture longer, rarer transcripts in sufficient abundance, parts of the double-stranded cDNA were size selected into three fractions, which were subsequently amplified and converted into SMRTbell™ templates. Details on the sample preparation can be found on Sample Net. SMRTbell libraries were sequenced using the P4-C2 sequencing chemistry with 2-hour movies.

After sequencing, we computationally determined the completeness of the sequences using polyA-tail signals and library adapters. To obtain a non-redundant set of full-length, high-quality transcript sequences without bias from other sequencing platforms, we developed a de novo, isoform-level clustering algorithm that uses only PacBio data. Briefly, the algorithm iteratively clusters reads to generate consensus sequences that represent the original transcripts. The algorithm takes into account the existence of the polyA-tail signal to differentiate isoforms with alternative stop sites. The final consensus sequences were called using Quiver and filtered to create the final polished, full-length, non-redundant dataset. Details of the clustering algorithm will be described in two upcoming webinars on Wednesday, January 22 at 8 AM PST and 5 PM PST.  

Some statistics from the sequencing and results are listed below:

  • Number of SMRT Cells: 119
  • no-size selection: 12
  • 1-2 kb: 37
  • 2-3 kb: 37
  • > 3 kb: 33
  • Total number of post-filtered bases: 14,062,161,755

 Figure 1.  (a) Length distribution of polished, non-redundant transcript sequences. Each transcript sequence represents a unique isoform. (b) Breakdown of differences to hg19. Consensus sequences were mapped to hg19 using GMAP (version 2013-07-20) with default parameters. Different error categories were aggregated over all 44,531 transcript sequences. Some errors are likely to be due to real biological differences from the reference sequence.

 Figure 2. Number of isoforms per loci. Transcripts that overlap on the genomic coordinate by 1 bp are grouped together to form non-overlapping transcribed loci. Total number of loci: 14,385. The majority (61%) of transcribed loci have only 1 or 2 transcripts while 0.6% of them have 20 or more isoforms. This is consistent with other studies of full-length cDNA sequencing of a single sample type [5].

Figure 3. UCSC browser screenshot of the CREM gene region. PacBio transcripts (top, red) capture multiple isoforms of the CREM gene, including alternatively spliced exons and alternative poly adenylation sites.

Figure 4.  Known cancer fusion gene BCAS4/BCAS3 identified. PacBio transcripts (top, red) show three different fusion variants of the BCAS4/BCAS3 genes. All three variants contain a portion of the 5’ region of the BCAS4 gene (chr20q13) and a portion of the 3’ region of the BCAS3 gene (chr17q23).

We welcome researchers to download and use the dataset for their research. For citation of the dataset, please use:

MCF-7 transcriptome sequence data was generated by Pacific Biosciences, Menlo Park, California, and additional information about the sequencing and assembly is provided at https://www.pacb.com/blog/2013/12/data-release-human-mcf-7-transcriptome.html and http://datasets.pacb.com.s3.amazonaws.com/2013/IsoSeqHumanMCF7Transcriptome/list.html. The data used in the present study was retrieved from PacBio’s online database at http://datasets.pacb.com.s3.amazonaws.com/2013/IsoSeqHumanMCF7Transcriptome/list.html/ (date of retrieval).


  1. T. Steijger, J. F. Abril, P. G. Engström, et. al., “Assessment of transcript reconstruction methods for RNA-seq,” Nat. Methods, vol. 10, no. 12, pp. 1177–1184, Nov. 2013.
  2. D. Sharon, H. Tilgner, F. Grubert, and M. Snyder, “A single-molecule long-read survey of the human transcriptome,” Nat. Biotechnol., vol. 31, no. 11, pp. 1009–1014, Nov. 2013.
  3. W. Zhang, P. Ciclitira, and J. Messing, “PacBio sequencing of gene families-a case study with wheat gluten genes,” Gene, 2013.
  4. K. F. Au, V. Sebastiano, P. T. Afshar, J. D. Durruthy, L. Lee, B. A. Williams, H. van Bakel, E. E. Schadt, R. A. Reijo-Pera, J. G. Underwood, and W. H. Wong, “Characterization of the human ESC transcriptome by hybrid sequencing,” Proc. Natl. Acad. Sci. U. S. A., Nov. 2013.

    Subscribe for blog updates: