This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
In recent interactions with the scientific community, we’ve seen a growing number of questions around scaffolding genome assemblies. We thought it might be useful to review the concepts behind contigs and scaffolds, as well as the circumstances in which one might want to scaffold a high-quality PacBio genome assembly.
Contigs vs. Scaffolds
Contigs are continuous stretches of sequence containing only A, C, G, or T bases without gaps. SMRT Sequencing has all of the necessary performance characteristics – long reads, lack of sequence-context bias, and high consensus accuracy – to generate contiguous genome assemblies with megabase-sized contigs. Ultra-long contigs provide complete and uninterrupted sequence information across full genes, and more recently even allow separation of the two chromosomes for diploid organisms.1 The unprecedented quality of PacBio de novo genome assemblies has been described in many publications, such as the gorilla genome assembly with a contig N50 of 9.5 Mb recently featured on the cover of Science.2
Scaffolds are created by chaining contigs together using additional information about the relative position and orientation of the contigs in the genome. Contigs in a scaffold are separated by gaps, which are designated by a variable number of ‘N’ letters. Scaffolding is often used for short-read assemblies to make sense of the fragmented genome assemblies containing short contigs. However, there are three important principal deficiencies of scaffolds:
- Scaffolds miss critical information. Gaps represent missing genomic information and, in many cases, these gaps can coincide with important genomic loci. Many promoters and first exons are GC-rich in sequence, often resulting in missing or low-quality sequence reads from short-read or Sanger sequencing. Thus, genes are incompletely resolved, and their regulation cannot be understood. Another reason for gaps in scaffolded assemblies is large, repetitive elements which short-read sequencing methods struggle to bridge. Thus, duplicated genes, genes vs. pseudogenes, short tandem repeats, variable number tandem repeats, microsatellites, and many other structural genomic features are often unresolved in scaffolds.
- The length of a scaffold gap often has no relation to the true gap size. In several reference genomes, gaps are arbitrarily set to certain fixed lengths. For example, most gaps in the zebra finch reference are set to 100 Ns, while in the version 3 maize reference they are set to 1,000 Ns. This means that in most cases, the true length of sequence represented by the gap differs from the set gap size, and is sometimes off by thousands of bases. The uncertainties of gap sizes in scaffolds result in an inability to understand the true spatial relationships of functional elements in genomes and is an underestimate of the actual extent of missing information.
- Gap-flanking scaffold sequence can be low-quality, and is sometimes completely wrong. The sequences surrounding gaps often fall into areas where short-read technologies have deficiencies due to GC-bias or read-length limitations. This can result in sequence that is of lower quality and, in some cases, completely erroneous. For example, because of complex repeat structures in the human IGH locus, the right edge of a 50,000 N gap in the short-read assembly contains 1,836 bases of flanking sequence that has no support in the hg19 human genome reference or the PacBio assembly. In some ways, having incorrect flanking sequence in scaffolds is worse than having ‘N’ gaps, since that erroneous sequence is considered and included for downstream analyses.
Illustration of the difference between contigs and scaffolds in genome assemblies
The information missed by gapped scaffold assemblies complicates and may preclude downstream analysis and understanding related to functional and comparative genomics. Scaffolded short-read assemblies get nowhere near the quality of PacBio genome assemblies in terms of contiguity and completeness, and they often require labor-intensive follow-up work to close gaps, adding time and cost to projects.
Scaffolding PacBio assemblies for chromosome-scale genome representations
For even longer-range genomic connectivity, e.g. to bridge the largest segmental duplications and repeat regions, researchers can go a step further by adding scaffolding information to a PacBio assembly, often resulting in telomere-to-telomere, chromosome-scale genome representations. Several methods have been demonstrated to work very well for this purpose, including optical mapping and crosslinking approaches. Check out examples of maize and human genome sequencing to see how chromosome-level scaffolding enables more comprehensive insights.
- Chin, CS et al. (2016) Phased diploid genome assembly with Single Molecule Real-Time Sequencing. bioRxiv.
- Gordon D. et al. (2016) Long-read sequence assembly of the gorilla genome. Science. 352 (6281), aae0344.
Alzheimer’s disease (AD) is a devastating neurodegenerative disease that affects ~44 million people worldwide, making it the most common form of dementia. Pathologically it is defined by severe neuronal loss, aggregation of amyloid β (Aβ) in extracellular senile plaques in the brain, and formation of intraneuronal neurofibrillary tangles consisting of hyperphosphorylated tau protein. Studies looking into disease mechanism have shown that changes in gene expression due to alternative splicing likely contribute to the initiation and progression of AD. Hence, efforts have been made to better understand the gene expression changes in the AD brain by sequencing the transcriptome of affected brain regions.
Most transcriptome studies conducted to date have used short-read sequencing technologies, which provide the abundance of transcript reads needed for evaluating expression profiles. However, the ability to accurately identify alternative splicing and the associated expression patterns for different splice isoforms is limited by the short read-lengths. Given that the average size of a human gene transcript is several kilobases long, a 150bp to 300bp read will fail to span the entire transcript and therefore assembly will be required. In most cases this process can be very difficult, if not impossible, given the high similarity between expressed gene isoforms.
Recent studies using the PacBio isoform sequencing (Iso-Seq) method demonstrated the advantages of obtaining high-quality, full-length transcript sequences for improving genome annotation , , identifying fusion cancer genes , and discovering novel alternative splicing patterns . Here we apply the Iso-Seq method from an Alzheimer brain RNA sample. The purpose of releasing the dataset to the public is to provide researchers with a full-length transcriptome reference from which they can develop bioinformatics tools and validate their own findings.
In our final “confident” dataset, we obtained 21,742 high-quality, full-length isoforms covering 9,313 non-overlapping loci ranging from 352 bp – 9,457 bp, with an average length of 3,400 bp (Fig. 1). The total percentage of consensus bases that disagreed with the hg38 genome is 0.036% substitutions, 0.08% insertions, and 0.08% deletions, bringing the overall concordance with hg38 to 99.8%. More than half of the transcribed loci have one observed isoform, while most of the rest have about two to five isoforms (Fig. 2). When compared with the reference transcript annotation Gencode v25, more than a third of the isoforms match a reference transcript completely, while the majority of isoforms are possible novel splice forms of a known gene. In addition to the stringent “confident” dataset, we are also releasing a larger, less stringent “promiscuous” dataset. Details on the difference between the two versions can be found in the download section.
Library Preparation and Sequencing
An Alzheimer’s Disease Brain total RNA sample was purchased from BioChain. First strand cDNA library was generated using Clontech SMARTer cDNA synthesis kit followed by size selection using the SageELFTM device by Sage Science, with lanes combined to create five size libraries that roughly correspond to 1-2 kb, 2-3 kb, 3-5 kb, 5-7 kb, and > 7kb libraries. Sequencing was done using P6-C4 chemistry and 3-hr movies for the 1-2 kb fraction and 4-hr movies for the remaining fractions. Sequencing was completed in 2015. Download details on the sample preparation procedure.
The standard Iso-Seq pipeline (ToFU version 2.2.3 or equivalent to SMRT Analysis 3.1; for detailed methods see ) was used to process the data. Iso-Seq classify generated 1,107,889 FLNC reads and 1,929,319 nFL reads. The reads were then used to generate high-quality, full-length isoforms using ICE followed by Quiver polishing (HQ Quiver isoform consensus). By definition, an HQ Quiver consensus sequence must have at least two supporting full-length reads and predicted accuracy of >= 99%. The HQ Quiver consensus sequences were then aligned to human reference genome hg38 to create a final “confident” dataset of unique isoforms. To create the larger “promiscuous” dataset, additional consensus results that contained only one supporting full-length read were added. For details on the bioinformatics analysis, please see the README file on the Download Page.
Figure 1. Length distribution of final, unique, full-length isoforms.
Number of isoforms: 21,742
Min-max length: 352 bp – 9457 bp
Average length: 3400 bp
Figure 2. Number of isoforms per loci. 21,742 isoforms were grouped into 9,313 non-overlapping strand-specific loci. The average number of isoforms per loci was 2.3.
We welcome researchers to download and use the dataset for their research. For citation of the dataset, please use:
The Alzheimer brain Iso-Seq dataset was generated by Pacific Biosciences, Menlo Park, California, and additional information about the sequencing and analysis is provided at https://downloads.pacbcloud.com/public/dataset/Alzheimer_IsoSeq_2016/. The data used in the present study was retrieved from PacBio’s online database at https://downloads.pacbcloud.com/public/dataset/Alzheimer_IsoSeq_2016/ (date of retrieval).
 B. Wang, E. Tseng, M. Regulski, T. A. Clark, T. Hon, Y. Jiao, Z. Lu, A. Olson, J. C. Stein, and D. Ware, “Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing,” Nat Comms, vol. 7, p. 11708, Jun. 2016.
 S. E. Abdel-Ghany, M. Hamilton, J. L. Jacobi, P. Ngam, N. Devitt, F. Schilkey, A. Ben-Hur, and A. S. N. Reddy, “A survey of the sorghum transcriptome using single-molecule long reads,” Nat Comms, vol. 7, p. 11706, Jun. 2016.
 J. L. Weirather, P. T. Afshar, T. A. Clark, E. Tseng, L. S. Powers, J. G. Underwood, J. Zabner, J. Korlach, W. H. Wong, and K. F. Au, “Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing,” Nucleic Acids Research, vol. 43, no. 18, pp. e116–e116, Oct. 2015.
 D. I. Pretto, J. S. Eid, C. M. Yrigollen, H.-T. Tang, E. W. Loomis, C. Raske, B. Durbin-Johnson, P. J. Hagerman, and F. Tassone, “Differential increases of specific FMR1mRNA isoforms in premutation carriers,” J Med Genet, vol. 52, no. 1, pp. 42–52, Dec. 2014.
 S. P. Gordon, E. Tseng, A. Salamov, J. Zhang, X. Meng, Z. Zhao, D. Kang, J. Underwood, I. V. Grigoriev, M. Figueroa, J. S. Schilling, F. Chen, and Z. Wang, “Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing,” PLoS ONE, vol. 10, no. 7, p. e0132628, Jul. 2015.
A recent cover story in New Zealand Geographic vividly details the efforts to sequence not just the kākāpō genome, but the genomes of every single living kākāpō.
If you missed our earlier blog about this bird, the kākāpō is a member of the parrot family known for its unique attributes: it’s heavy, flightless, and mostly active at night. As author Rebekah White reports in “Decoding Kākāpō,” the remaining members of this species — about 125 of them — live on islands near New Zealand.
White recounts how scientist Jason Howard, a member of Erich Jarvis’s lab at Duke, first became interested in this unusual bird, and pushed to have its genome sequenced as part of the B10K project. After hitting obstacles with other sequencing technologies, Howard found the PacBio sequencing platform, which was finally able to get through the kākāpō genome.
Meanwhile, New Zealand native David Iorns had gotten involved, launching a crowdfunding campaign to resequence every living kākāpō, relying on the PacBio reference assembly to streamline the process. The campaign was part of our Genome Galaxy Initiative and was successfully funded earlier this year. The wealth of genomic information will be used to help save the birds, which are so inbred due to a recent population bottleneck that they struggle to reproduce naturally.
According to White, delivery of the PacBio genome data to the conservation geneticist in charge of the final kākāpō assembly “was as though all his Christmases had come at once.” The genome data will be publicly available, allowing scientists around the world to use this information to better understand evolution, traits like vocal learning, and much more.
We’re lucky to have such a vibrant community of SMRT Sequencing users, and there’s nothing better than getting them together to share tips, exchange ideas, and develop new applications. These upcoming events will facilitate just that — and we hope you can join us!
Frances C. Arrillaga Alumni Center at Stanford University, Palo Alto, Calif.
Our annual West Coast event taking place next week will include the usual day-long user group meeting as well as half-day workshops on bioinformatics and sample preparation, which will take place on September 7th at PacBio headquarters in Menlo Park. In addition to PacBio speakers who will update attendees on the technology roadmap and new applications of SMRT Sequencing, confirmed customer presentations include:
- Yahya Anvar (Leiden University Medical Center)
- Christine Beck (Baylor College of Medicine)
- Dario Cantu (UC Davis)
- Alexandra Dainis (Stanford University)
- Timothy Smith (USDA-ARS)
- Tina Graves-Lindsay (McDonnell Genome Institute)
- Stephen Mondo (Joint Genome Institute)
- Amanda Larracuente (University of Rochester)
- Jason Underwood (PacBio/University of Washington)
Hilton Washington DC North/Gaithersburg
This one-day event will take place right before the Genome in a Bottle Workshop hosted at the NIST campus. With a focus on collaborative informatics for developing and improving data analysis tools for SMRT Sequencing, topics will range from de novo assembly and genome phasing to structural variation, Iso-Seq analysis, and much more. Confirmed speakers include:
- Gene Myers (Max Planck Institute)
- Kin Fai Au (University of Iowa)
- Ali Bashir (Mount Sinai School of Medicine)
- Brett Bowman (PacBio)
- Andrew Carroll (DNAnexus)
- Jason Chin (PacBio)
- Richard Hall (PacBio)
- Sergey Koren (NHGRI)
- Maria Nattestad (CSHL)
- Nik Putnam (Dovetail Genomics)
- Mike Schatz (Johns Hopkins University)
- Yuta Suzuki (University of Tokyo)
- Elizabeth Tseng (PacBio)
- Aleksey Zimin (University of Maryland)
EMEA User Group Meeting
Hesperia Tower Hotel & Convention Center in Barcelona
Our Annual EMEA User Group Meeting will be returning to beautiful Barcelona this winter. Through a combination of presentations and breakout sessions, the meeting will provide a unique opportunity for our users to present how they are applying SMRT Sequencing to their research needs and to share best practices.
On Thursday December 1st, the event will kick off at 14:00 with presentations and workshops, followed by an evening reception. The following day will resume the presentations and workshops by both users and PacBio staff.
More information to come soon!
A case study produced by QRIScloud, an Australia-based cloud computing service, offers interesting insight into a recent project that is using SMRT Sequencing to generate a reference-quality de novo genome assembly for the grape used to make Chardonnay wine.
The sequencing effort was conducted by collaborating scientists at the Australian Wine Research Institute (AWRI) and the BC Genome Sciences Centre in Canada. This new assembly, which is still undergoing polishing and in-depth analysis, adds to very sparse genome resources for wine grapes. Until recently, the only genome assemblies available were draft-quality ones for the Pinot Noir varietal.
With PacBio long-read sequencing, scientists were able to create an assembly of dramatically higher quality, despite the complex, highly heterozygous genome. “Unraveling the underlying genetic complexities of grapevine and how genetic variation shapes wine quality is critical,” said Simon Schmidt, a project leader and senior research scientist at AWRI. “It can facilitate vine selection and enable the tailoring of wines by winemakers, allowing them to meet ever-changing consumer demands and access new markets.”
Schmidt added, “The combination of PacBio long-read sequencing, QRIScloud bioinformatics infrastructure and newly developed haplotype aware assembly software has already provided a high-quality draft genome.” The team is working on incorporating clone sequence data to improve quality even further.
Stanford’s Euan Ashley wrote a terrific review about the clinical use of genome sequencing for Nature Reviews Genetics. “Towards Precision Medicine” is well worth a read, covering topics from the ethnic background of the human reference genome to public interest in precision medicine. He also covers technical angles such as mapping of sequence reads for variant calling across challenging regions of the genome with known clinical significance.
Ashley’s premise is that many of the current standards in genomics — from sequencers to analysis tools and more — were developed for use in basic research, where the consequences of inaccurate information are less severe than they would be in a clinical setting. Throughout the review, he considers what challenges need to be overcome “to bring genomics up to clinical grade.”
What caught our attention was Ashley’s excellent description of the genomic elements that make the human genome so difficult to interpret accurately: repetitive sequence, structural variants, segmental duplications, and so on. “Much of this genomic complexity is only challenging because of the prevailing technology used to assess it: short-read sequencing,” he writes. “With extensive paralogy, originating in gene families, segmental duplication or pseudogenes, the genomic location of many short reads cannot be determined with confidence.” Repeat expansion disorders, such as Huntington disease, are marked by a long series of simple repeats that are much longer than a short read, making it all but impossible to reconstruct these regions accurately with short-read sequencers.
In another example, he cites regions like the famously polymorphic major histocompatibility complex (MHC) as stumbling blocks for short-read sequencers. “The MHC is challenging to resolve using only short-read approaches because of the lack of a comprehensive catalogue of haplotypes and the intrinsic lack of phase information — that is, knowledge of the parental chromosome of origin — in short reads,” he notes, adding that phasing data is important for a variety of clinical applications, including phasing of the HLA genes housed in this region, which are associated with more than 100 diseases and many drug reactions.
Ashley sees long-read sequencing as a potential solution to many of these problems. “Long-read sequencing facilitates de novo assembly that automatically provides phase information,” he writes. “Such sequencing provides a more complete picture of the genome.” Long reads can easily span structural variants and even long stretches of repeats, making it possible to fully reconstruct these clinically relevant regions. Ashley notes that these larger structural variants have much lower variant calling accuracy with short-read sequencing methods due to their size and issues related to mapping ambiguity. He also points out that “variants that are more disruptive of the open reading frame, such as structural variants (SVs), are generally more likely to cause disease,” and highlights over 25 clinical disorders that are caused by pathogenic structural variants as an example.
Ashely ends by providing a path forward for improved accuracy in clinical genomics through “Reducing reliance on reference sequences, making phasing routine, improving calling of indels and structural variants, characterizing complex areas of the genome through long-read sequencing and maximizing the cost effectiveness of genomic coverage.” He also reminds us of how far we’ve come, and what the future holds when we get there, “Fueled by technological advancement, fundamental discovery of genetic elements related to health and disease has been the engine of human genetics for decades,” Ashley concludes. “Building on this foundation, precision medicine will use the knowledge gained to redefine disease, to realize new therapies and to provide hope for generations of patients to come.”
Scientists from Rutgers University and the University of California, Davis, used SMRT Sequencing to study structural variation in maize. They found that this approach delivered more complete information at lower cost than standard methods and generated new findings that could be important for crop breeding.
From lead author Jiaqiang Dong, senior author Jo Messing, and collaborators, “Analysis of tandem gene copies in maize chromosomal regions reconstructed from long sequence reads” was published in PNAS recently. They chose to evaluate SMRT Sequencing for copy number detection as an alternative to short-read sequencing, which doesn’t span long repeats, and BAC cloning, which is prohibitively expensive. “The single most critical parameter is the length of each sequence read to establish overlaps without the need of genomic clone libraries,” the authors write. “Therefore, we tested the new SMRT technology to determine whether we could assemble chromosomal regions from one shotgun DNA sequencing dataset that would comprise large tandem gene copies.”
They chose maize because of its high proportion of repetitive sequence — repeats make up a remarkable 85% of its genome — and focused on the alpha zein gene family. Spread across six chromosomes, the gene family is important because it “acts as a sink for reduced nitrogen in the seed,” the authors explain. In other maize strains, as many as 48 copies of these genes were found.
The team notes that the average read length generated by the PacBio System was “26 times longer than Illumina and 8 times longer than ABI3730, providing us with significantly more contiguous information for shotgun DNA sequence assemblies.” This long-read data enabled the comprehensive genomic picture that the scientists were hoping for: “Based on this high-quality single shotgun DNA sequencing dataset, we were able to use zein gene sequences as digital probes to assemble the entire collection of orthologous regions from [the W22 strain],” they report. A detailed analysis demonstrated that the self-corrected SMRT Sequencing data had an error rate of less than 0.1%.
The use of SMRT Sequencing proved useful “for resolution of large complex repeats or tandem/dispersed gene family clusters,” the scientists conclude. “Given the effectiveness of this approach in maize, we anticipate that it will be of general use with any complex genome including human and, in particular, cancer genomics, where structural changes can be dramatic.”
PacBio customer HistoGenetics was just awarded a major, multi-year contract to perform HLA typing on as many as thousands of samples per week using SMRT Sequencing. The company is a pioneer and global leader in high-resolution sequence-based HLA typing services.
As blog readers know, HLA typing involves analysis of highly polymorphic human leukocyte antigen (HLA) genes comprised within the major histocompatibility complex (MHC) on chromosome 6. Accurate HLA typing is essential for research on donor recipient tissue matching during transplantation, autoimmune disease-association studies, drug hypersensitivity research, and several other applications. But the complexity of the region, which contains thousands of possible alleles, has made it challenging to represent with short-read sequencing technologies.
With its long reads and high consensus accuracy, SMRT Sequencing has been a natural fit for scientists trying to analyze the HLA region.
In an announcement about this news, HistoGenetics CEO Nezih Cereb said, “In HLA typing, there is no room for errors. The combination of PacBio’s high accuracy and long read lengths to accurately sequence and phase HLA genes is the new gold standard in the field.”
For more, check out this video of Dr. Cereb explaining the utility of SMRT Sequencing for HLA.
A recent publication from scientists at the University of Florida and the University of Missouri used SMRT Sequencing to analyze epigenomic changes that occur when free-living bacteria associate with a host and become symbiotic instead.
Published in the Frontiers in Microbiology journal, “Integrating DNA Methylation and Gene Expression Data in the Development of the Soybean-Bradyrhizobium N2-Fixing Symbiosis” comes from a team of collaborators including lead author Austin Davis-Richardson and senior author Eric Triplett. The scientists aimed to assess the role of epigenetics in bacterial evolution from free-living to symbiont and chose SMRT Sequencing because it generates base-specific modification information as it sequences DNA.
In its symbiotic state, the nitrogen-fixing Bradyrhizobium diazoefficiens colonizes the roots of soybean plants. Previous studies had reported significant differences in gene expression as the bacterium underwent the transition from free-living to symbiont, leading this team to investigate how much of that could be explained by changes in methylation. They sequenced a free-living bacterium and an endosymbiont, generating a single finished 9 Mb contig representing the genome of the free-living organism. This was used as a reference genome for assembly of the symbiotic strain and made it possible for scientists to identify the 681 kb symbiosis island, which contains genes known to be involved in the infection process.
The methylation analysis found 3,276 changes in five DNA motifs between the strains; 768 of them were associated with differentiation from free-living to symbiotic state, representing more than 9% of all genes and more than 35% of genes with differential expression between the strains. Of the altered genes, 80 were located in the symbiosis island but currently have no known function in the differentiation process. Of the five DNA motifs found, four had increased methylation in the free-living bacterium, and intriguingly, the fifth — the 5’-CCTTGm6AG-3’ motif — was only seen in the symbiont.
“These associations between methylation and expression changes in many B. diazoefficiens genes suggest an important role of the epigenome in bacterial differentiation to the symbiotic state,” the scientists report, noting that follow-up studies with more replicates will be needed to fully test the hypothesis.
Scientists from Yale University and Memorial Sloan Kettering Cancer Center used SMRT Sequencing to determine whether antiretroviral therapies were triggering mitochondrial genome mutations in HIV patients. The results were recently published in HIV Medicine (“High frequency of mitochondrial DNA mutations in HIV-infected treatment-experienced individuals”).
The publication, from lead author Min Li, senior author Elijah Paintsil, and collaborators, reports results from an analysis of 71 people, including 47 HIV patients who had received antiretroviral therapy (about half had mitochondrial toxicity) and 24 healthy controls. DNA was isolated from peripheral blood mononuclear cells and mitochondrial genome sequencing performed on a PacBio System.
SMRT Sequencing was chosen for its long reads and accuracy, according to the authors. “PacBio technology overcomes some of the limitations of current next-generation sequencing platforms by providing significantly longer reads (> 1 kb), single molecule sequencing, and a single-pass error rate of < 15%,” Li et al. write. “Moreover, SMRT sequencing exceeds the consensus accuracy achieved by other sequencing methods because of the random nature of the errors. The SMRT sequencing achieves results with > 99.999% accuracy.”
The team found that HIV-infected patients, regardless of whether they experienced mitochondrial toxicity or not, had a higher frequency of mitochondrial mutations — in particular, large-scale deletions — than their healthy counterparts. However, they did not find statistically significant differences in mutation load between HIV patients with and without mitochondrial toxicity. “We did not observe mtDNA depletion in HIV-infected treatment-experienced patients with toxicity as expected,” the scientists report. “To our surprise, HIV-infected treatment-experienced patients with toxicity had significantly higher mRNA expression of Pol-γ in comparison with HIV-infected treatment-experienced patients without toxicity (P < 0.05) and HIV-uninfected controls (P < 0.01). This contradicts the Pol-γ inhibition theory.”
The authors attribute some of these novel findings to sequencing and analyzing the whole mitochondrial genome, whereas previous studies have taken less comprehensive approaches. The work also paves the way for increased use of peripheral blood mononuclear cells to characterize the biology of HIV patients, compared to the more commonly used invasive tissue biopsies.
Additional studies will be needed to follow up on these discoveries, the scientists note. “There are currently two proposed mechanisms to explain the association between mtDNA mutations and [antiretroviral therapy (ART)],” they write, “ART directly or indirectly provides a permissive environment for sporadic mtDNA mutagenesis; and ART pressure leads to clonal expansion of existing mutations.”
A paper from scientists at Colorado State University and the National Center for Genome Resources provides an in-depth view of the transcriptome of sorghum, a crop that’s important for human and animal food and also shows potential as a biofuel. Through this project, the team produced a new isoform analysis pipeline for community use and identified novel genes, as well as far more alternative splicing than had been expected for this plant.
The publication, “A survey of the sorghum transcriptome using single-molecule long reads,” comes from lead author Salah Abdel-Ghany, senior author Anireddy Reddy, and collaborators. The researchers were particularly interested in alternative splicing and alternative polyadenylation, two mechanisms that increase transcript diversity and may help plants adapt to stress. “Despite the fact that several large-scale RNA-seq studies have been performed in plants to analyse [alternative splicing], currently it is not known how many distinct splice isoforms are produced,” the team writes. “This is primarily due to challenges associated with short-read sequencing in accurately reconstructing full-length splice variants.”
To directly observe full-length splice isoforms, they turned to long-read SMRT Sequencing to characterize the transcriptome of sorghum seedlings. The scientists also developed the Transcriptome Analysis Pipeline for Isoform Sequencing, or TAPIS, to identify alternative splicing events and evidence of alternative polyadenylation. “The analysis of sorghum Iso-Seq data uncovered over 7,000 novel [alternative splicing] events, ~11,000 novel splice isoforms, over 2,100 novel genes and several thousand transcripts that differ in 3′ untranslated regions due to [alternative polyadenylation],” the team reports, noting that many of the novel genes are putative long non-coding transcripts. The total number of unique transcripts was nearly 28,000, covering more than 14,000 genes.
The scientists discovered a significantly higher rate of alternative splicing than had been expected for sorghum. “Previously, it was reported that pre-mRNAs of ~1,500 genes undergo [alternative splicing] in sorghum,” they note. “In this work, we demonstrate that this number is much higher.” Indeed, they found more than 10,000 alternative splicing events, compared with fewer than 3,000 included in existing gene models. The scientists performed a number of validation studies to confirm these and other novel findings and have made their TAPIS tool available for use in other organisms.
“Pacific Biosciences single-molecule long reads obtained using the Iso-Seq protocol offer a considerable advantage in transcriptome-wide identification of full-length splice isoforms and other forms of post-transcriptional regulatory events such as [alternative polyadenylation],” the team writes.
A paper just out in Nature Communications reports the de novo genome assembly and transcriptome of a Chinese individual, generated from long-read SMRT Sequencing and other technologies. The effort revealed nearly 13 Mb of sequence not included in the GRCh38 reference genome as well as novel gene and alternative splicing content not annotated in GENCODE.
“Long-read sequencing and de novo assembly of a Chinese genome” comes from lead author Lingling Shi at Jinan University and senior author Kai Wang from the University of Southern California, as well as many other collaborators in China and the US. The team was particularly interested in finding population-specific variants, including structural variants, which required the use of long-read sequencing. Assemblies based on short-read sequence data “may have inherent technical limitations in characterizing repeat elements that span longer than the read length, yet repeats and segmental duplications are known to cover approximately half of the human genome,” the scientists write. Using SMRT Sequencing and mapping technology from BioNano Genomics, “we perform detailed characterization of the HX1 genome and demonstrate that long-read sequencing can detect functional elements in human genomes that are missed by short-read sequencing.”
For the genome assembly, the team sequenced DNA from an anonymous Chinese individual (HX1) to 103x coverage, producing a 2.93 Gb genome with a contig N50 of 8.3 Mb. Included in the results were 206 Mb of alternative haplotypes that “were constructed along with the primary contigs,” Shi et al. write. Consensus accuracy for the assembly was 99.73%, matching the accuracy of the well-known NA12878 genome assembly. In an analysis of structural variants, the team found about 20,000 insertions and deletions, with half of them classified as short tandem repeats or mobile elements. Nearly 50 exonic deletions or insertions were specific to the HX1 genome, including one previously characterized deletion that has only been seen in the Asian population.
The team also developed a new gap-filling method to make use of all this sequence data. They determined that nearly 30% of gaps in the GRCh38 reference genome could be addressed with data from HX1. “The total length of filled or shortened gaps amounts to 7.1 Mb,” they report. “We further evaluated the repeat contents within the gaps that can be closed by us, and found that simple repeats and satellite sequences were significantly enriched within the closed gaps compared with GRCh38.”
Using the Iso-Seq method, the scientists also analyzed the transcriptome of this individual and detected more than 58,000 isoforms, including “57 isoforms at 42 loci that do not overlap with any GENCODE transcript,” they write. Follow-up studies for some of the more complex data — such as “a novel transcribed element with at least five exons and six isoforms” — validated these predicted splicing events. They also found at least two genes that have never been identified with short-read data. The team looked for disease-causing variants, finding two that were classified in ClinVar as pathogenic. However, “manual review of the literature cited in the two ClinVar records indicated that both of them represented erroneous database records,” the scientists report. “This analysis highlights the need for extreme caution in interpreting ‘pathogenic’ variants documented in variant databases.”
“In summary, while short-read-based alignment and variant calling based on reference genome remain a common practice to assay personal genomes, de novo assembly by long-read sequencing may reveal novel and complementary biological insights,” Shi et al. conclude. “Furthermore, long-read RNA sequencing may identify novel transcripts that can be missed by short-read RNA sequencing.”
In a new publication from Cold Spring Harbor Laboratory, scientists produced a dataset for what authors call “the single largest collection of [full-length] cDNAs available in maize” and significantly improved genome annotation. The effort relied on the Iso-Seq method with SMRT Sequencing, which allows scientists to generate ultra-long reads covering full transcripts.
The paper, “Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing,” comes from lead author Bo Wang and senior author Doreen Ware, who is also affiliated with the USDA Agricultural Research Service. It offers the first published results from using the Iso-Seq method on a maize plant and includes a number of advances, such as barcoding to permit cost-effective pooling of tissue samples.
The scientists embarked on the project to see what advantages long-read sequencing offers for transcriptome analysis in a complex plant. “Although data from short-read sequencing have accumulated over recent years, they do not provide full-length (FL) sequence for each RNA, limiting their utility for defining alternatively spliced forms,” Wang et al. write. “In some cases, short-read sequencing generates low-quality transcripts, leading to incorrect annotations.” Long-read sequencing, on the other hand, makes it possible to capture high-quality, full-length transcripts.
The team used PacBio sequencing for six tissue types from the maize line B73, size-selecting libraries with the SageELF system to increase average read length. Results were impressive, with more than 110,000 transcript isoform sequences associated with nearly 27,000 genes, representing 70% of previously annotated maize genes as well as novel isoforms and even some novel genes. Long noncoding RNAs were another highlight: they found nearly 900 novel lncRNAs, many of them significantly longer than previously identified lncRNAs. “Our analysis indicates that the new transcriptome data have enormous potential to improve the current maize annotation,” the authors write. “The 111,151 unique transcripts characterized here almost double the number of transcripts documented in the RefGen_v3 annotation.”
They also analyzed alternative splicing, finding more than twice as many isoforms per gene than exist in the previous maize annotation and contributing thousands of novel isoforms to the public resource. To learn more, they studied methylation patterns associated with alternative splicing and discovered that CHG methylation appears to suppress splicing while CG methylation apparently increases the rate of splicing.
In addition to demonstrating that SMRT Sequencing data could correct mis-annotated gene models, the team showed that long reads are even more important than expected for transcriptome studies. The average transcript length in this project — almost 3 kb — is much longer than that from the previous maize annotation. “These findings show that the prevalence of long transcripts, from both coding and non-coding genes, is higher than previously thought,” Wang et al. write. “Just as the availability of short-read technologies over the last decade heralded an era of tremendous gains in small RNA research, it is reasonable to expect that long-read technologies will prompt a new focus on heretofore poorly understood characteristics of exceptionally long RNAs.”
For more, check out this case study [PDF] of Doreen Ware and the maize project.
We are excited about the release of a new genome assembly for the mosquito Aedes aegypti, which we hope will aid scientists in studying vector-pathogen dynamics, including those of the rapidly spreading Zika virus. The Aedes aegypti Aag2 cell line genome sequence was generated by a joint effort between Raul Andino’s lab at The University of California, San Francisco and PacBio. This cell line was derived by Peleg in 1975 and adapted in 1991 by Lan and Fallon. It was selected for sequencing based on its susceptibility to infection by many arboviruses, including Dengue, Chikungunya, Zika, Sindbis, and Rift Valley Fever Virus, and for its use in molecular biology and virology experimentation. Genome sequencing was performed on genomic DNA purified from cells acquired from the Gamarnik lab at the Fundación Instituto Leloir. The Aag2 cell line is also persistently infected with Cell Fusing Agent Virus, an insect-specific Flavivirus.
The de novo assembly from ~58x shotgun SMRT Sequencing long-read coverage was performed using PacBio’s FALCON assembler, followed by polishing using the Quiver consensus caller. The assembly resulted in 3752 contigs, totaling 1.72 Gigabases, with a contig N50 of 1.42 Mb. The genome size of this cell line is significantly larger than that of the Liverpool assembly, likely due to the acquisition of a mini-chromosome.
The assembly has been publicly released through VectorBase. It is our hope that this new genome assembly will be of value to the scientific community in their efforts to understand the underlying mechanisms of arboviral infection and transmission in this critical disease-carrying vector.
Many thanks to the nearly 200 scientists who signed up for our East Coast User Group Meeting, and to the Institute for Genome Sciences for hosting us! The event was a hit with customers and PacBio staff alike, and discussions we heard in the hallways and during breaks told us there was some great knowledge exchange that should give labs inspiration for generating and analyzing their SMRT Sequencing data.
The day kicked off with a talk from our own Marty Badgett, senior product manager for the PacBio RS II System and Sequel System. He offered some historical perspective on throughput improvements over time for the SMRT Sequencing platforms (believe it or not, throughput has increased 100x in the past five years) and shared some data produced by the Sequel System. He spoke about the newly released SMRT Link software, which brings existing modules into an integrated workflow and adds components for data management. Badgett also alerted users to four upcoming enhancements: spin column size selection, on-chip additive cleanup, asymmetric SMRTbell templates, and active loading. With these and other changes, Badgett predicted that a 3 Gb genome could be assembled de novo with the Sequel System for $9,000 by next year and as little as $1,500 in 2018.
The event focused on our customers and the impressive results they’ve achieved with SMRT Sequencing. Jerry Jenkins from HudsonAlpha talked about high-quality de novo plant genomes — his team has released six FALCON-assembled plant genomes already, with more coming soon. The simple sequences and repetitive elements common to plant genomes are a major challenge, he said, noting that “it gives Illumina a fit and with PacBio it works well.” He said SMRT Sequencing has been “a game-changer” for his team when it comes to contiguity and completeness of assemblies. He gave several examples, including the bean Phaseolus vulgaris, for which a Sanger/454 assembly had a contig N50 of 39.5 kb while PacBio yielded 1.9 Mb. An analysis of cotton showed that a short-read assembly was missing nearly 500 Mb of sequence, which was recovered in the PacBio assembly.
Another plant presentation came from Victor Albert of the University at Buffalo, who intrigued attendees with his description of the carnivorous aquatic plant Utricularia gibba. Its genome was originally published in 2013 sporting 82 Mb in a fragmented assembly, but a new PacBio assembly shows the genome size is actually about 100 Mb and captures the data in fewer than 600 contigs. One challenge for this plant is its 3 Mb of rDNA repeats. “You can never assemble rDNA repeats,” Albert said of other sequencing platforms. “PacBio zips right through them.” He was delighted to find that the largest contig contained a full chromosome, and noted that the accuracy was excellent. “There’s no sense in doing any polishing with Illumina,” he added. With the new assembly, he’s able to explore telomeres, centromeres, and retrotransposons; he can also separate tandem duplication events from artifacts of previous whole genome duplications.
Rachael Workman from Johns Hopkins presented transcriptome data generated through the Most Interesting Genome SMRT Grant program for the ruby-throated hummingbird, Archilochus colubris. The team used the Iso-Seq method to analyze liver tissue to better understand this bird’s remarkable metabolism, finding 450,000 unique isoforms. While the full analysis is still underway, Workman shared some interesting examples, such as the complete coverage of a glucose transporter gene that shows quite a bit of divergence across avian species.
Continuing the avian theme, Duke’s Erich Jarvis spoke about the B10K project, an effort to sequence every species of bird on the planet. Jarvis’s area of interest is vocal learning, a trait shared by few organisms. His lab has found that PacBio sequencing produces genome assemblies of much higher quality than those from short-read data (in some cases increasing the contig N50 from as little as 30 kb to as much as 10 Mb, and reducing contig counts from more than 120,000 to about 1,000). In one example, he showed that a dopamine receptor that appeared to be lost by many bird species was actually conserved across them; it just wasn’t included in the short-read draft assemblies. Another example was of Egr1, with a promoter region that scientists had long wanted to study but wasn’t assembled in any existing bird genome. With PacBio sequencing, the region is fully assembled for the first time.
Tao Wu from Yale presented findings of A-base methylation in mouse embryonic stem cells (if you missed the seminal Nature paper on this previously unsuspected form of methylation in a mammal, check out our recap). Wu and team developed SMRT ChIP, a method for conducting ChIP-seq on a PacBio sequencer, and discovered N6-methyladenine throughout the genome. “Without PacBio, I think we could not get this done,” he said. The findings were confirmed by mass spec. Wu also tracked down the Alkbh1 demethylase and conducted functional studies that showed the methylation suppressed genes and transposons.
After lunch, the topic shifted to studies of primate and human genomes. Julie Karl from the University of Wisconsin analyzed the MHC locus in macaques, which have a more complex allele assortment than humans do. Using PacBio amplicon sequencing to span the full 1.1 kb locus, she was able to resolve full-length alleles unambiguously and phase haplotypes, finding many novel alleles to contribute to community databases. Karl is now working to expand her immunology investigation to KIR and FCGR, two other complex regions.
NIAID’s Brandon DeKosky spoke about sequencing antibody repertoires, which requires analyzing both heavy chain and light chain elements. Previous protocols involved analyzing three separate amplicons to cover these regions, but with the PacBio system his team has been able to sequence full-length amplicons covering both elements in B cells. “I certainly am in favor of moving everything over to the PacBio,” he said. “We can see this entire molecule at once.” DeKosky and his colleagues have studied cells from humans and rhesus macaque, and are now using transgenic mice to model antibody response to knocked-in precursor genes to determine whether this approach could be used towards developing a vaccine.
In a clinically oriented research presentation, Tetsuo Ashizawa from the Houston Methodist Research Institute reported the analysis of the ATXN10 gene responsible for spinocerebellar ataxia type 10. The gene harbors a pentanucleotide repeat in intron 9, characterized by an ATTCT repeat sometimes interrupted by other five-nucleotide repeats that lead to three broad disease phenotypes. Since the full region can’t be amplified, Ashizawa used CRISPR/Cas9 for target enrichment, making a double-strand break near the region and attaching a SMRTbell adapter for PacBio sequencing . This allowed the team to assess the motifs associated with different subtypes, in one example studying brothers with different phenotypes of the disorder and finding that the difference lay in repeat content.
Maria Nattestad from Cold Spring Harbor spoke about her genome and transcriptome analysis of SK-BR-3, one of the most widely studied breast cancer cell lines with a Her2 amplification. Long-read SMRT Sequencing was able to characterize the whole genome. Nattestad’s Assemblytics tool, which she used to call variants ranging from single-base changes to large structural changes, is now available as a web app. She discussed her reconstruction of how variants worked together to amplify the Her2 oncogene (an effort we profiled in-depth). Using the Iso-Seq method to produce full-length transcripts, she was also able to detect and explain complex gene fusions for which there was previously no DNA evidence.
Our CSO Jonas Korlach wrapped up the day with a talk highlighting some of the biggest advances we’ve seen from SMRT Sequencing, noting that there are now nearly 1,500 papers describing the use and value of this technology. Particularly productive topics include hospital-associated infections, base modifications in bacteria, and phylogenetic profiling of microbial communities. He also spoke about the recent wave of de novo human assemblies from PacBio data, noting that a shift toward diploid assemblies will be important as the community deepens its understanding of human genetic variation. Korlach also cautioned attendees about confusing contigs with scaffolds; while some assemblies boast impressive scaffold N50s, they may be missing a lot of important information if contigs are short and disconnected.
We’d like to thank our partners, who helped us put on this great event: Advanced Analytical Technologies, Covaris, Diagenode, DNAnexus, PerkinElmer, and Sage Science. With their support, we were able to host the full-day user group meeting, but also half-day workshops on sample prep and bioinformatics as well.
We’re pleased to launch the latest opportunity to have your favorite microbe sequenced on a PacBio System. The SMRTest Microbe Grant Program, kicking off at this year’s American Society for Microbiology annual meeting in Boston, will give one scientist a free genome sequence for his or her chosen microbe. For those folks unable to make it to the meeting this year, don’t worry — you do not need to attend ASM Microbe to enter the program.
To enter, simply submit a short application describing your microbe or microbial community and how it would benefit from the long read lengths, high accuracy, and direct detection of epigenetic modifications generated by SMRT Sequencing. Entries are due by July 15, 2016.
One winner will be selected by a panel of scientists and will receive up to four library preparations and a sequencing run using up to eight SMRT Cells. Sequencing and bioinformatics analysis will be performed by our SMRT Grant co-sponsor, the Institute for Genome Sciences, a PacBio Certified Service Provider.
If you’ll be attending ASM, learn more about the SMRT Grant program by visiting the booths for PacBio (#304) or the Institute for Genome Sciences (#407). Good luck to all grant applicants!
An updated case study about the Genomics Resource Center (GRC) at the University of Maryland’s Institute for Genome Sciences (IGS) reports that SMRT Sequencing has become an integral tool for generating complete microbial genomes, improving plant and animal genome assemblies, and exploring human genome variation.
The GRC has a scientific pedigree and a sample-to-interpretation service commitment that place it in a league of its own. The team operates under a simple mantra: ‘If it can be sequenced, we can do it.’
Both the GRC and IGS were founded in 2007 when a high-powered team of investigators formerly at The Institute for Genomic Research (TIGR), led by Claire Fraser, joined the University of Maryland School of Medicine. “The team of faculty and staff that came here to start the institute was heavily focused on infectious disease research,” says Luke Tallon, scientific director and founding leader of the GRC. “Our primary goal in joining a medical school was to extend our pathogen genomics expertise into host-pathogen studies and direct clinical genomics applications.”
In addition to its infectious disease and genomics expertise, TIGR was also renowned for its bioinformatics talent — a trait that continues with the group at the GRC. Their team of 15 staff members is evenly split between wet lab and bioinformatics, and more than half of the institute’s 100-plus employees are bioinformaticians.
The GRC was formed both to serve the genomics institute and as a university core. “We serve investigators throughout the University of Maryland system as well as across the country and around the world,” says Lisa Sadzewicz, administrative director of the facility. “Our strength is not just our deep history and experience in sequencing and genomics, but our end-to-end service level from the initial project consultation through to publication, including all of the informatics.”
Over the past five years, the GRC has applied these strengths to the PacBio platform. As early adopters of the technology, they have dedicated significant resources to the development of both laboratory and data analysis processes to leverage SMRT Sequencing. Since its adoption of the original PacBio RS in 2011, the GRC has steadily increased its utilization of the platform. In the past year alone, they have constructed more than 400 libraries and sequenced more than 1,200 SMRT Cells. These have spanned projects ranging from whole genome sequencing to metagenomics, Iso-Seq transcriptomes, and custom amplicons.
The GRC’s newest PacBio sequencer arrived in February 2016, making them one of the first PacBio Certified Service Providers to take delivery of a Sequel System. “Given our history of early adoption and success with the PacBio RS, and the promise of increased and scalable throughput, we were excited to be among the first centers to acquire a Sequel instrument,” says Dr. Fraser. The Sequel System represents the newest generation of SMRT Sequencing, providing more scalability and lower sequencing project costs compared to the PacBio RS II.
Development of processes and applications for the Sequel System is well underway at the GRC. The team plans to use the increased throughput to expand their services for Iso-Seq transcriptome sequencing and amplicon projects. In collaboration with Dr. Jacques Ravel, associate director of IGS, they are also developing a full-length 16S sequencing pipeline to complement and expand their human microbiome and metagenomic research portfolios. As read lengths and throughput on the Sequel instrument improve, Tallon’s team will shift whole genome sequencing projects onto the platform.
For other teams considering whether SMRT Sequencing is the right choice for them, Tallon says: “If you value complete genome sequences, de novo transcript discovery, and are looking at epigenetics in addition to the genome sequence, there’s no better technology out there.”
Check out the full case study to learn more about the GRC’s strength in infectious disease, development of the FDA ARGOS database, and work on large genomes.
The PacBio team headed to New Orleans this past April to take in all the exciting new research presented at the American Association for Cancer Research Annual Meeting, show off our new Sequel instrument, and of course enjoy some crawfish and beignets!
On the first day of the conference, we had the pleasure of hearing a talk from last year’s AACR “What Will You Discover About Cancer?” SMRT Grant winners, Malgorzata Komor and Remond Fijneman from the Netherlands Cancer Institute. Malgorzata discussed her work to identify novel biomarkers to identify precursor lesions in colorectal cancer, which can be integrated into the fecal immunochemical test (FIT) used by the Dutch national health system. While the current FIT has useful sensitivity for cancer, it is only 27% sensitive for precancerous lesions. In order to uncover new biomarkers, they are mining mass spectrometry data from patient samples. However, ~50% of the spectra cannot be matched back to a known protein. In order to augment standard protein databases with peptides that may be present in precancerous cells, they have undertaken transcriptome sequencing of cell lines with disrupted splicing machinery. While they used both short-read and PacBio sequencing, they found that the Iso-Seq method particularly excelled at uncovering skipped and retained exon events. A number of the mis-splicing events found from RNA sequencing were confirmed to be translated by mass spectra. They plan to further investigate their biomarker candidates by screening additional patient samples.
Later in the session, we heard from Maria Nattestad, who discussed her work analyzing the genome and transcriptome of the breast cancer cell line SK-BR3 with PacBio long-read sequencing. In addition to presenting a compelling assembly and overview of the whole transcriptome results, she delved into the details of the extensive rearrangements of the Her2 gene in this highly aneuploid metastatic cell line. She described the use of two new software tools, Sniffles and SplitThreader, to confirm rearrangement events by split-read analysis and to visualize and explore the connectivity of the cancer genome and understand the natural history of complex structural variants.
We also saw a range of posters featuring PacBio long-read sequencing throughout the week, including posters on long-read sequencing from FFPE samples, the methylation profiles of Heliobacter pylori strains linked to gastric cancer, B-cell receptor signaling in non-GCB DLBCL, and targeted capture of 6 kb fragments for cost-effective detection and haplotype resolution of somatic cancer variants.
Finally, in a much-anticipated address at the close of the conference, Vice President Joe Biden shared his vision for a cancer moonshot with AACR attendees. He pledged to “make a decade’s worth of progress in five years” and shared ideas for accelerating the pace of research that had coalesced over the past year spent speaking to a wide range of scientists and leaders in the cancer research community. Biden has picked up the call for a number of reformist ideas that have been gaining traction in recent years within scientific community: incentivizing the sharing of data, getting publicly funded results out from behind pay walls, simplifying the grant application process, and funding reproducibility studies. With a goal of allocating $800 million in new funding for his initiative, including $600 million for the National Cancer Institute, Biden’s ideas for removing some of the inefficiencies of our science infrastructure should have some heft behind them. He also highlighted a number of research areas as priorities for new funding, including enhanced and early detection technology, cancer vaccine development, cancer immunotherapy and combination therapy, genomic analysis of tumor and surrounding cells, and pediatric cancer research.
Scientists from the UK published new work detailing important advances in protecting potatoes from the disease that caused the Irish potato famine in the 1800s. It’s not just of historical interest; the team points out that late blight disease is once again endangering the food supply, with global yields of potatoes shrinking in recent years.
“Accelerated cloning of a potato late blight–resistance gene using RenSeq and SMRT sequencing,” published in Nature Biotechnology, comes from lead authors Kamil Witek and Florian Jupe, senior author Jonathan Jones, and collaborators at The Sainsbury Laboratory and The Genome Analysis Centre.
In the paper, the scientists describe an ongoing search among wild potato species for genetic resistance to Phytophthora infestans, the pathogen responsible for late blight disease. For this work, they looked for NLR genes (encoding nucleotide-binding, leucine-rich repeat proteins that are essential for activating plant defense mechanisms) in Solanum americanum, a wild Mexican potato relative. The process of enriching for resistance genes, which they call RenSeq, originally began with short-read sequencing but encountered difficulties in de novo assembly of NLR genes due to highly repetitive sequence and gene copies.
“To accelerate [resistance] gene cloning, and to remove any need for construction of a bacterial artificial chromosome or fosmid libraries, we refined RenSeq to capture and sequence fragments of up to 3.2 kb, which is the average NLR gene length,” the authors report. They incorporated SMRT Sequencing and found that most molecules were covered multiple times in individual reads, leading to highly accurate results.
In a comparison of short-read and long-read NLR assemblies, the scientists found that the short-read versions were as fragmented as they suspected, with just 21% of contigs including a full-length NLR gene. They determined that more than 300 long-read sequences were not represented in short-read data and identified several chimeric sequences in the short-read contigs. “The central positions of matching, aligned CLC or SPAdes NLR assemblies to long-read contigs further confirmed the superiority of SMRT RenSeq by also capturing >1 kb of flanking promoter and terminator sequences,” Witek et al. write. “Capture and sequencing of long fragments can resolve any repetitive gene family or structural genome variation by spanning repeat-rich regions with long reads.”
The team says this approach could be very useful in general for rapid engineering of pathogen-resistant crops. “The SMRT RenSeq method has the potential for use in investigating genetic variation for other important traits likely to involve known multigene families such as metabolic pathways (e.g., cytochromes P450, terpene cyclases) or transcription factors, especially if combined with mutagenesis,” they conclude.
The PacBio team is looking forward to joining 3,000 other scientists in Barcelona May 21-24 for the European Human Genetics Conference, better known as ESHG. Organized by the European Society of Human Genetics, this is the 49th year of a high-quality meeting where the latest developments in human and medical genetics are discussed.
This year, we’ll be showcasing our new Sequel System at booth #260 in the exhibit hall. Come visit us and learn more about it! With higher throughput than our previous instrument, we think the Sequel System will be a great fit for the genomics community on projects such as multiplex targeted sequencing and RNA isoform sequencing.
To learn how scientists are already using PacBio sequencing to address unanswered questions in genomics, don’t miss the SMRT Sequencing workshop hosted by Roche Sequencing. The luncheon event will be held on Monday, May 23, from 11:15 a.m. to 12:45 p.m. in rooms 120 and 121. Christine Beck from Baylor College of Medicine will discuss the use of long fragment capture and sequencing techniques to reveal structural variation at clinically relevant loci. Robert Sebra from the Icahn School of Medicine at Mount Sinai will discuss how his lab uses long-read sequencing to gain a more comprehensive view of complex regions of the genome, including pharmalogically important sites, oncogenes, and structural variants linked to genetic disease.
There will also be ESHG talks and posters featuring SMRT Sequencing data in a wide range of applications, including whole genome assembly and haplotyping, immunology, repeat expansion disorders, and non-coding RNAs. Please join us at the following presentations:
Saturday, May 21, 2016, 10:30 a.m. – 12:00 p.m.
Talk Title: E01.1 Long-read sequencing of complex genomes
Speaker: Evan Eichler, University of Washington
Saturday, May 21, 2016, 6:30 p.m. – 8:00 p.m.
Talk Title: A distinct class of chromoanagenesis events characterized by focal copy number gains
Speaker: Matthew Hestand, Leuven, Belgium
Sunday, May 22, 2016, 1:00 p.m. – 2:30 p.m.
Talk Title: Enrichment of unamplified DNA and long-read SMRT Sequencing to unlock repeat expansion disorders
Speaker: Tyson Clark, Pacific Biosciences
Sunday, May 22, 2016, 1:00 p.m. – 2:30 p.m.
Talk Title: C07.6 – Detection of AGG interruptions in FMR1 premutation females by single-molecule sequencing
Speaker: S. Ardui, KU Leuven
Monday, May 23, 2016, 8:30 a.m. – 10:00 a.m.
Talk Title: S11.3 – Mapping Human Long Noncoding RNAs
Speaker: Rory Johnson, Barcelona, Spain
Tuesday, May 24, 2016, 11:00 a.m. – 12:30 p.m.
Talk Title: C23.6 – Identifying novel long non-coding RNAs in the Human genome.
Speaker: M. P. Hardy, Wellcome Trust Sanger Institute
Saturday, May 21, 12:00 p.m. – 2:00 p.m.
Monday, May 23, 10:15 a.m. – 11:15 a.m.
Presentation: P16.07C – Application Specific Barcoding Strategies for SMRT Sequencing
Saturday, May 21, 12:00 p.m. – 2:00 p.m.
Sunday, May 22, 4:45 p.m. – 5:45 p.m.
Presentation: P16.02B – Highly Contiguous de novo Human Genome Assembly and Long-Range
Haplotype Phasing Using SMRT Sequencing
Sunday, May 22, 10:15 a.m. – 11:15 a.m.
Presentation: P07.17A – Resolving KIR genotypes and haplotypes simultaneously using Single-
Molecule, Real-Time Sequencing
Sunday, May 22, 4:45 p.m. – 5:45 p.m.
Presentation: P15.14B – Full-length and phased CYP2D6 variant genotyping using the PacBio RS II