Stanford Team Finds Novel Transcripts Using Long-Read Isoform Sequencing
Thursday, October 17, 2013
An advance online publication in Nature Biotechnology from Michael Snyder’s lab at Stanford University demonstrates the utility of long-read sequencing for assessing transcribed regions across the human genome. Long PacBio reads were able to completely cover full-length RNA molecules, characterizing genetic regions that have not been previously annotated.
The paper, entitled “A single-molecule long-read survey of the human transcriptome,” reports the application of Single Molecule, Real-Time (SMRT®) Sequencing to studying RNA, comparing it to results from libraries sequenced with a 454® instrument. The scientists sequenced cDNA synthesized from pooled RNA gathered from 20 human organs and tissues in order to identify as many transcript isoforms as possible.
The purpose of this effort was to find a better alternative to existing RNA-seq approaches, which have so far relied on short-read sequencers. “It is difficult to identify full-length transcript iso-forms using short reads,” the authors write. “Thus, a complete understanding of all spliced RNAs within a transcriptome is not yet possible and can be inferred only from a patchwork of short fragments.” The amplification and cDNA fragmentation required by these short-read technologies may also be a factor in the challenge of representing full transcripts, the authors note.
Sharon, Tilgner et al. set out to evaluate the PacBio® sequencing platform to see if its industry-leading read lengths and lack of amplification could be used to sequence full isoforms. “Given sufficient material, amplification-free sequencing of full-length cDNA molecules provides a more direct view of RNA molecules,” they write.
In this paper, the scientists report sequencing the transcriptome using circular consensus sequencing (CCS) reads, noting that these reads were of high quality and often covered the full cDNA template at least twice to include both forward and reverse DNA strands. Testing to determine how much of the isoform was represented by these reads found that “the majority of CCS reads represent all introns of the original transcript, including most of the 5′ exons,” according to the publication. After examining how much of the transcript start and end sites were captured in the sequencing data, the authors found that median distance to the 3’ end was just 6 bp, and median distance to the 5’ end was 47 bp.
The scientists compared these results to sequences generated on the 454 platform, noting that this sequencer’s reads average 522 bp and “usually do not cover entire RNA molecules.” Using the 454 platform, median distance to the 3’ end was 281 bp, while median distance to the 5’ ends was 626 bp. They add, “CCS reads exhibited more constant quality values along the read than 454 reads.”
Ultimately, the most important measure of success lies in transcript coverage. “Comparison with the high-quality GENCODE 15 annotation of the human transcriptome revealed many unannotated transcripts and isoform structures within the CCS data set and provided a more comprehensive assessment of the true complexity of the transcriptome,” they write.
The authors identified some 14,000 spliced GENCODE genes, and estimate that 10 percent of the alignments “represent unannotated transcripts,” many of which included features associated with long, noncoding RNAs. They found that PacBio CCS reads routinely cover cDNAs up to 1.5 Kb and in many cases cover cDNAs as large as 2.5 Kb. “Our results show the feasibility of deep sequencing full-length RNA from complex eukaryotic transcriptomes on a single-molecule level,” the authors write.
Previous studies have also reported success using PacBio single-pass long reads instead of CCS to sequence full-length transcripts ranging in size from 500 bp to over 6,000 bp in length. This includes a study presented by Elizabeth Tseng and Jason Underwood at this year’s ABRF conference. In it, they describe a complete pipeline for characterizing full-length transcript isoforms from total RNA. Their method included two approaches for capturing full-length transcripts, followed by size-selected library prep and sequencing. The study also describes an informatics pipeline useful for identifying full-length transcripts from single-pass reads, followed by reference mapping to identify novel gene isoforms.
With even longer read lengths announced with newer chemistries, the feasibility of using single-molecule SMRT Sequencing for isoform sequencing will continue to increase. In a Nature Review article from 2011, Jeff Martin and Zhong Wang from JGI noted that “PacBio sequencers are capable of sequencing a single transcript to its full length in a single read”, and that as the technology continues to improve, “the future of transcriptome assembly will be ‘no assembly required.’” Perhaps the future is now….