This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
We were delighted to have so many ASHG attendees join our workshop, titled “Discovering and Targeting Causative Variation Underlying Human Genetic Disease Using SMRT Sequencing.” If you missed it, check out the video recordings, or read our summary below.
The event featured three impressive customer presentations, beginning with Euan Ashley from Stanford University. In his presentation titled “Towards Precision Medicine,” He started off by acknowledging that “genomic medicine is here” and described how genomes and exomes are now routinely sequenced on a daily basis, with impressive genetic discovery results. For patients with rare and undiagnosed disease, Ashley reported that current sequencing efforts now solve approximately 30% of patient cases — a real improvement over years past. Despite these gains, he said there is still a need for new approaches to make sense of the remaining 70%. “If you can’t see it, you can’t call it,” “the genome is complex,” and “repeat tracts cause disease.” These are a few of the reasons Ashley mentioned that explain why current short-read NGS methods sometimes fail to achieve the same high accuracy levels establish by Sanger sequencing for calling known causal pathogenic variants in Precision Medicine studies.
Ashley then told attendees about the unique attributes of PacBio SMRT Sequencing that make it very well suited to address these challenges. He described how longer read lengths expand both the size and types of variants that can be studied. Toward that end, he said that accurately calling structural variants (SVs) is a major need, since short-read technologies work well for single nucleotide variants (SNVs) but not for longer variants. “Long read approaches reveal previously unseen structural variation,” he said, noting that this information is critical for research into repeat expansion disorders and other diseases tied to such variants.
In the first of its kind study, Ashley described the results of a new low-fold long-read WGS method using the PacBio Sequel System (recently published bioRxiv pre-print of this study). He reported sequencing a translational research sample from his clinic to an average depth of 8.6-fold coverage. Following mapping and genome-wide SV calling, Ashley said that SMRT Sequencing then allowed his team to identify six novel SVs occurring in OMIM genes in an individual with complex and varied symptoms. One gene was associated with Carney syndrome, which was a match for the person’s physiology and was later validated. He also called for the establishment of a massive SV resource, something like the ExAc repository, that would allow scientists to understand common and rare SVs and further facilitate discovery of causative pathogenic SVs in Precision Medicine studies. In separate work, his team used the Iso-Seq method with personalized haplotyping to determine how precision gene silencing could be used for people with hypertrophic cardiomyopathy.
Our next speaker, Melissa Laird Smith from the Icahn School of Medicine at Mount Sinai, spoke about “SMRT Sequencing as a Translational Research Tool to Investigate Germline, Somatic and Infectious Diseases.” In a fascinating and wide-ranging talk, she offered examples of how the Sequel System has been deployed at Mount Sinai for applications including pharmacogenomics, immune profiling, cancer profiling, and more. She cited Stuart Scott’s CYP2D6 work, which involves amplicon sequencing to understand an individual’s drug metabolism profile. Laird Smith said the team can now multiplex 384 samples on each SMRT Cell for 100-fold coverage on the Sequel System. She also presented work on Fabry’s disease spectrum, for which amplicon sequencing resolves phased mutations to make sense of the X-linked disease. In a personalized cancer therapy pipeline, low-coverage PacBio sequencing is used to validate somatic variants found in tumors. Finally, Laird Smith talked about immune profiling, where SMRT Sequencing of full-length single molecule VDJ sequences provides complete, accurate contigs of this highly variable and complex region.
In the final customer presentation, Michael Lutz from the Duke University Medical Center gave a talk entitled “Identification and Characterization of Informative Genetic Structural Variants for Neurodegenerative Diseases.” Focusing on Alzheimer’s disease, ALS, and Lewy body dementia, Lutz spoke about a recently published software tool that can be used in a pipeline with SMRT Sequencing data to find structural variant biomarkers. His team is particularly interested in short sequence repeats and short tandem repeats, which have already been implicated in neurodegenerative disease. In one example, they used SMRT Sequencing to characterize haplotypes of the low-complexity SNCA gene that could explain the differences between traditional Alzheimer’s and the Lewy body form of Alzheimer’s. In another project, Lutz used SMRT Sequencing to phase haplotypes across APOE alleles — something that wasn’t possible with short-read data — for insight into Alzheimer’s patterns of onset, severity, and more.
Lutz spoke in place of Allen Roses, who was originally scheduled to participate in the workshop but sadly passed away last month. Our CSO Jonas Korlach began the workshop with a tribute to the human genetics visionary, whose legacy in neurodegenerative diseases will be felt for decades to come.
Korlach also spoke about recent SMRT Sequencing updates, such as the Integrative Genomics Viewer update that is now optimized for PacBio data and the recently announced Sequel System chemistry (v 1.2.1) release. It offers a 50-fold reduction in DNA input requirements for 20 kb and 30 kb libraries, with SMRT Cell output ranging from 4 Gb to 8 Gb. He noted the recent data release of structural variation detected in the NA12878 genome, including many more insertions and deletions than short-read-based technologies were able to find. Korlach also congratulated the community of SMRT Sequencing users for their impressive publication rate. In 2016 alone there were more than 1,000 papers published citing the technology! Those include de novo human genomes, deep dives into structural variation and gene expression, plenty of novel findings in human genetic variation, and much more.
Thanks to everyone who participated in the event, and also to the great scientists everywhere who are applying SMRT Sequencing to any number of complex problems to reveal new information about our species and the world around us.
Looks like the sun will be shining on the annual Plant and Animal Genome (PAG) conference next week (despite the recent stormy weather in CA). We’re excited to be a part of the event which is always a great forum for cutting-edge scientific projects, new ways to apply technology, and networking with leaders in the plant and animal realm.
The 25th annual PAG conference will take place January 14-18 in San Diego and SMRT Sequencing will be featured in a variety of activities throughout the event. Visit us at booth #418 to learn more about SMRT Sequencing, the Sequel System, and our latest chemistry release. Attendees can also sign up for daily ‘expert hours’ featuring educational presentations on the Iso-Seq method, sample preparation, and data analysis.
As usual, we’ll be hosting a workshop at the meeting. “SMRT Sequencing for Complete Genomes” will run from 12:50 pm to 3:00 pm on Monday, January 16th in the conference hotel’s San Diego Ballroom. Speakers include Rebecca Johnson from the Australian Museum Research Institute, Erich Jarvis and Ben Matthews from Rockefeller University, the University of Arizona’s Rod Wing, Richard Kuo from the Roslin Institute and our own Jonas Korlach. Be sure to reserve your seat ahead of time or sign up to get the workshop recording if you can’t attend.
In addition, PAG attendees and any other scientists are welcome to join our SMRT Informatics Developers Conference on Wednesday, January 18th, from 12:00 pm to 4:30 pm (reserve your seat). It will also take place in the Town and Country Hotel, in the Sheffield/Hampton Ballroom. This collaborative event will aim to develop and improve data analysis tools for SMRT Sequencing data, with a focus on de novo genome assembly and Iso-Seq full-length transcript sequencing. Past events have been huge hits and have facilitated great advances in the analysis community. Speakers include:
- Sergey Koren, National Institutes of Health
- Jason Chin, PacBio
- Ben Rosen, United States Department of Agriculture
- Shaun Jackman, University of British Columbia
- Roberto Lleras, PacBio
- Kin Fai Au, University of Iowa
- Richard Kuo, Roslin Institute, University of Edinburgh
- Rachael Workman, Johns Hopkins University
- Richard Hall, PacBio
Lastly, we’re pleased to announce our fourth annual Plant and Animal SMRT Grant Program, which encourages scientists to tell us about their ‘most interesting genome in the world’ for a chance to win sequencing on the Sequel System. Entries are due by Jan. 31.
We hope to see you in San Diego!
We are pleased to announce the launch of a new version of our chemistry, SMRT Cells, and software for the Sequel System. The V4 software, V2 chemistry, and SMRT Cells tuned for the new sequencing chemistry kits will be available on January 23rd.
These new releases allow the system to achieve mean read lengths of 10-18 kb, with half of the data in reads >20 kb, and throughput of 5-8 Gb. This enhancement improves results for important applications such as structural variant detection, targeted sequencing, metagenomics, minor variant detection, and isoform sequencing. The software release includes updates to the base calling algorithm that increase accuracy, as well as new features designed for clinical research applications. In addition to the performance improvements, the Sequel System is now capable of loading 80 kb sequencing libraries.
This release improves users’ ability to perform low-fold structural variant detection and key targeted sequencing applications. For structural variant detection, they can now accomplish the same or better quality of results for structural variant analysis using, on average, half the number of SMRT Cells compared to the previously available chemistry. Long reads provided by the new chemistry also enable the detection of larger-scale structural variants; in particular, there is a 3-fold increase in sensitivity of insertions over 5 kb. For targeted sequencing, the new chemistry and software give users more flexibility. For example, for minor variant detection, customers can either gain detection sensitivity or reduce cost per sample with increased sample multiplexing.
In a statement for the release, Kevin Corcoran, our Senior Vice President of Market Development, said, “This release is part of our continued commitment to increasing the performance of the Sequel System, and we are very pleased with the data we are seeing both internally at our beta-test sites.”
We’re excited to be participating again in the Precision Medicine World Conference (PMWC), an independent conference series founded in 2009 and co-hosted with Stanford Health Care, UCSF, Intermountain Health, Duke University, and Duke Health. Considered to be the preeminent precision medicine conference, it attracts recognized leaders, top global researchers and medical professionals, and innovators across the healthcare and biotechnology sectors.
PMWC provides an exceptional forum for the exchange of information about the latest advances in technology (e.g. DNA sequencing technology), in clinical implementation (e.g. cancer and beyond), research, and all aspects related to the regulatory and reimbursement sectors.
From January 23rd to the 25th, some 1,300 attendees will descend on the Computer History Museum in Mountain View, Calif. The event will kick off the evening of January 22nd with a reception honoring Jennifer Doudna, who will receive the Luminary Award for her groundbreaking work on CRISPR/CAS-9 genome editing technology, and James Allison, who will receive the Pioneer Award for his work on cancer immunotherapy through the discovery of the immune checkpoint blockade.
PacBio founder Stephen Turner will be speaking at PMWC about the advancements made possible by SMRT Sequencing. If you’ll be attending the meeting, be sure to check out the session “Advancing the Clinic with Emerging NGS Technologies” on January 25th at 10:30 am.
Registration for the meeting is still open. Use the code PacBio by January 15th to receive a 5% discount. We hope to see you there!
A new article in Drug Discovery & Development from Stuart Scott and Yao Yang at the Icahn School of Medicine at Mount Sinai offers a compelling look at enhanced analysis of the CYP2D6 gene. The article, “Long-Read CYP2D6 Sequencing Enables Full Gene Characterization and Novel Allele Discovery,” describes the Mount Sinai team’s efforts to provide better resolution of this region with SMRT Sequencing.
The gene is important for drug development because the enzyme it encodes is involved in metabolizing nearly a quarter of all drugs frequently prescribed today. According to the article, “Variants within this gene can help predict how patients respond to medications ranging from painkillers to antipsychotics, which makes CYP2D6 an essential gene to consider when implementing pharmacogenomics into clinical care.”
The scientists turned to PacBio long-read sequencing to characterize this genomic region because the gene includes many deletions, duplications, and other structural variants which would be challenging to capture using other technologies. To date there are more than 100 known variations of the allele and new ones are constantly discovered. Most methods for querying the gene are limited by the number of alleles they can recognize or by their inability to capture complex structural variations and DNA duplications. Implementing SMRT Sequencing, however, allowed the scientists to detect all CYP2D6 alleles with great accuracy as well as to phase alleles.
While validating the new SMRT Sequencing pipeline for CYP2D6 analysis, the scientists discovered three novel alleles. “In fact, ~20% of the samples used to evaluate CYP2D6 SMRT sequencing were revised to either a non-genotyped or novel star (*) allele,” Scott and Yang write, “which highlights how long-read sequencing can reveal previously unrecognized variation in well-studied genes and specimens that were previously tested by other technologies.”
The authors conclude that establishing this approach for CYP2D6 analysis could address some of the consistency problems that have been seen in previous pharmacogenomics studies, which likely have been caused by different allele frequencies among patient populations. They recommend that long-read CYP2D6 analysis should be used for diverse populations in clinical trials to improve the community’s understanding of the natural variation in this gene across ethnicities. This information will be essential for developing better prescription and dosing guidelines for all patients.
A new publication from scientists at the University of California, Davis, and the USDA Agricultural Research Service presents important findings about a fungus that threatens global grape production. As part of the project, the team used SMRT Sequencing to generate a new assembly of the fungal genome, resulting in a more complete assembly than a previous short-read attempt.
“Condition-dependent co-regulation of genomic clusters of virulence factors in the grapevine trunk pathogen Neofusicoccum parvum,” published in Molecular Plant Pathology, comes from lead author Mélanie Massonnet, senior author Dario Cantu, and collaborators. The team was eager to determine why the wood-infecting Neofusicoccum parvum has such pathogenicity and virulence.
The scientists had previously produced a genome assembly for the fungus using short-read data, but it was highly fragmented across more than 1,800 contigs. By contrast, the 43.7 Mb PacBio assembly they generated is represented in only 28 contigs, including one that fully covers the mitochondrial genome. More than half of the contigs had telomeric repeats at both ends, “suggesting that these contigs encompass complete chromosomes, telomere-to-telomere,” the authors write. An analysis found the assembly’s accuracy rate to be 99.99976%.
To understand the differences between the new long-read assembly and the existing short-read one, the team used nucmer and Assemblytics. These analyses showed that repeat reconstruction had been a problem in the short-read assembly, where these regions were consistently reported as shorter than they were revealed to be by PacBio log-read sequencing. More than 180 sites — for a total of 113 kb — were completely missing from the short-read assembly, and structural variation was less likely to be detected.
With this high-quality genome resource as a foundation, the scientists were able to delve into a detailed transcriptome analysis. “Co-expressed gene clusters were significantly enriched not only in genes associated with secondary metabolism, but also with cell wall degradation, suggesting that dynamic co-regulation of transcriptional networks contribute to multiple aspects of N. parvum virulence,” the scientists report. In the majority of these clusters, genes had common motifs in their promoter regions, suggesting that co-regulation is controlled by common transcription factors.
While these findings are important on their own, the scientists underscore the need for additional studies. “Understanding how functions that lead to colonization of certain cell types/tissues, and the corresponding fungal genes activated during subsequent degradation of such host tissues, may help us understand mechanism(s) of cultivar resistance and interactions within the trunk-pathogen community,” they conclude.
To hear more great research from plant and animal scientists using SMRT Sequencing, sign up to attend or receive the recording of our PAG 2017 workshop.
We’re pleased to announce the winner of this year’s SMRT Grant, which launched during the American Society for Microbiology annual meeting this summer. The grant program, co-sponsored by PacBio and the Institute for Genome Sciences (IGS), was very competitive, with over 100 submitted proposals. From this broad range of entries, our judges faced quite a task choosing just one recipient for the grant.
Congratulations to Jessica Sieber from the University of Minnesota Duluth, who impressed reviewers with her proposal, “Metagenomic analysis of the gut microbiota of the 13-lined ground squirrel, a model fat storing hibernator.”
Ground squirrels have been models for human health conditions from diabetes and obesity to longevity and hypothermia. These particular squirrels are scientifically interesting because they almost triple their weight before going into a six-month hibernation, during which they consume nothing. Sieber notes that the hibernation process involves reducing the squirrel’s body temperature to 4 degrees Celsius. While that should be a challenging environment for the animal’s gut microbes, in fact they appear to thrive and may be responsible for folate production to protect the squirrel’s brain during hibernation. A deeper understanding of the role these microbiota play in this process may have downstream implications for human health.
Sieber’s project involves using SMRT Sequencing to produce a high-resolution picture of these gut microbial communities, including how they withstand the cold hibernation temperature. We look forward to learning about the new insights she discovers as a result of this grant!
Thank you to all of the submitters who participated in the grant competition. We look forward to a number of exciting new projects in the coming months!
We recently co-sponsored a webinar with Springer Nature, and if you missed it live, you can now register to watch the recording. Moderated by Nature Publishing Group’s Jayshan Carpen, the webinar is entitled “Reveal hidden genetic variation by combining long-read target capture with SMRT Sequencing” and features several terrific speakers. We’d like to thank Tetsuo Ashizawa from Houston Methodist Research Institute, Melissa Laird Smith from the Icahn Institute for Genomics and Multiscale Biology at Mount Sinai, and our own Meredith Ashby for taking the time to present fascinating data and answer audience questions.
The webinar kicks off with a talk from Ashby, a scientist at PacBio, who discusses the use of targeted capture approaches with long-read SMRT Sequencing. Noting that these read lengths are necessary for spanning large genomic elements such as insertions and deletions, she shares two case studies to illustrate the process and findings. In one, scientists in Australia used long-read sequencing with Roche NimbleGen capture probes to study transcripts of relaxin genes in samples from both cell lines and prostate cancer patients. They discovered new isoforms, including two fusion genes that could easily be missed with other methods. In the second project, scientists at Baylor College of Medicine developed a large-insert targeted sequencing protocol (called PacBio-LITS) to study Potocki-Lupski syndrome, a rare disorder caused by a duplication event on chromosome 17. Their work uncovered possible rearrangement mechanisms that suggest opportunity for even more long-read-powered discoveries in this area.
Next, Laird Smith, the Icahn Institute’s assistant director of technology development, presents data from long-read studies of the IGH locus, a remarkably complex region that encodes the VDJ gene segments which are recombined during the adaptive immune response. Nearly half of the IGH region falls in a segmental duplication, making it quite difficult to sequence. Her team is using long-read sequencing to generate fully resolved haplotypes of the IGH locus from ethnically diverse individuals. They have also developed an oligo-based enrichment and long-read sequencing approach that should make it more straightforward to interrogate this challenging genomic region and generate results that include large structural variants.
Finally, Ashizawa, who is Director of the Neuroscience Research Program at Houston Methodist, speaks about ATTCT repeat expansions in SCA10, which cause a form of spinocerebellar ataxia. He describes a novel enrichment method that uses CRISPR-Cas9 to target the repeat expansion, without the need for amplification. Paired with long-read sequencing, this approach has allowed his team to span extremely long repeat regions while identifying interruption sequence motifs associated with distinct, epilepsy-linked clinical phenotypes. Based on detailed work with family samples, Ashizawa was also able to trace the likely origin of the initial SCA10 mutation, which seems to have occurred first in Asia.
The speakers also respond to audience questions about sequencing GC-rich regions, sample preparation details, read length statistics, and more. This Q&A session nicely expanded on the earlier examples to show how SMRT Sequencing can be combined with capture techniques for an economical means of querying specific regions in the human genome.
You may remember the firefly genome sequencing project, which was a finalist in our recent SMRT Grant competition and ultimately was crowdfunded through the Experiment site and our Genome Galaxy Initiative. We’re thoroughly enjoying the lab updates on this project, and couldn’t resist sharing this latest one from Team Firefly.
In a jubilant note, the scientists report to their funders: “Good News! PacBio long-read sequencing data received.” The update is written with lots of great explanation about the basics of sequencing, analysis, and more for science enthusiasts. Our favorite part is the visualization of read length: the team included the sequence from a full PacBio read, 16,933 bases long, and compared it to the full sequence of paired-end Illumina reads. It’s a clear and compelling look at the extra value long reads can provide.
The update also provides a detailed comparison of data quality, with graphics illustrating single-read errors and how stacking reads for a consensus call overcomes individual errors. “Therefore the ‘consensus accuracy’ of PacBio sequencing ends up being the best of any sequencing technology,” the firefly team writes. Later in the update, they show how PacBio long reads contain data that were missed by short-read sequencers, setting them up to achieve a high-quality reference genome from SMRT Sequencing data.
Congratulations to the firefly team on this first look at their new data, and we wish them continued luck as they move ahead with the project!
Last month we attended the annual meeting of the American Society for Histocompatibility and Immunogenetics (ASHI), where we were impressed to see the great progress in scientific research around transplantation, immunogenetics, HLA, vaccines and much more.
There were an increasing number of presentations and posters showcasing new approaches to HLA sequencing. For the last few years, early protocols with NGS were focused purely on exon sequencing. Steady improvement in sequencing technologies has led to a new focus on full-length allele sequencing of all relevant MHC genes. It was great to see leading labs share their advice on the best methods for characterizing the ultra-complex HLA region and we look forward to seeing how a significantly expanded understanding of natural genetic variation will ultimately improve patient care.
We noticed impactful presentations from PacBio users generating high-quality, fully phased HLA alleles without imputation. HistoGenetics CEO Nezih Cereb reported using SMRT Sequencing to analyze 60,000 samples for the National Marrow Donor Program, that operates the Be the Match Registry. During his talk he also called upon the HLA community for making full-length, phased sequences a standard for all HLA typing. In other platform and poster presentations, scientists spoke about deploying SMRT Sequencing for KIR gene sequencing, validating SNP linkages with HLA types, generating full-length HLA gene references for submission to a public database, and novel or discrepant allele sequencing.
Many scientists who participated in the original Human Genome Project shared a grand vision that individual genomes would one day be part of routine medical care. Genomics veteran Richard Gibbs, founder and Director of the Genome Sequencing Center at Baylor College of Medicine, tells Mendelspod host Theral Timpson in a new podcast interview that “we are more than halfway [there].”
In the podcast, Gibbs shares his perspective on the complementary roles that genomics and genetics approaches have in driving our understanding of human biology. He noted that long before the Human Genome Project gained momentum, the discovery of human single gene defects in pediatric disease “had been long regarded as a very good way to advance biology and to advance knowledge.” The study of Mendelian disease is not only of tremendous benefit to patients and their families, but it is a powerful empirical approach to advancing biological understanding. However, the Human Genome Project marked the beginning of efforts to develop a complementary, more systematic approach to genomics. He notes that, “there really was this departure between human genetics and genomics for a decade and a half or more, really because of the demands of doing the genome project there was too much to do to stop and think about some of these more fundamental problems in genetics.”
Gibbs observes that we have now entered a new chapter, where we are closing the loop and marrying the tools and methods developed to sequence and study the human genome with the single-gene approach to unraveling complex systems. In recognition of the importance of single-gene diseases, six years ago the US government created a national program to centralize, streamline, and accelerate the study of Mendelian disease with the establishment of Centers for Mendelian Genomics at the Baylor College of Medicine and several other sites. Gibbs notes that at Baylor, they are combining the knowledge, resources, and tools developed during the Human Genome Project with their longstanding role as a leader in hospital and patient care services, not only helping families but hopefully generating insights into a wide range of common diseases which share phenotypes with single-gene disorders.
Gibbs explained that currently, from a cohort of unscreened, unselected patients suspected of having a disorder caused by a single gene defect, only 25-28% can be diagnosed with a known Mendelian disease. The mainstay diagnostic tool they use at the Baylor Center for Mendelian Disease is whole exome sequencing (WES), but given the low solve rate, one of their goals is to learn how they can supplement WES with new technologies and methods to arrive at what he calls the “best quality genome”. He states that long-read sequencing will have a role to play in this, noting that with PacBio sequencing systems, “we can resolve parts of the genome that are difficult otherwise,” and that for at least small number of genomes, we need “the best amount of information we can possibly get.” One idea he has for a supplementary method is a SMRT Sequencing test to examine short tandem repeats, which can cause disease when expanded, such as in Fragile X syndrome and various ataxias. Such a test would “take all of the regions in the human genome which are known to cause a disease phenotype in an expanded state, and … test for them in a single test or method – enrich for flanking regions, then sequence through the long repeats.” They are also exploring questions such as, “Can we use PacBio [sequencing] to revisit cases where we don’t yet have an answer from short read methods?” and “Can we benefit from taking a small fraction of the samples and adding in long reads to improve haplotype descriptions, then use that information across the whole study?” The key to making progress, according to Gibbs, is to “keep your eye on the ball – the quality of the genome product – rather than focusing on any one method of getting there.”
Towards the end of the interview, Gibbs shares his assessment of our progress towards integrating genomics and medicine. From Gibbs’ perspective, the biggest obstacle to moving genomics fully into the clinic is a more complete deciphering of the genome and better understanding all the allelic contributions to complex disease. “You can’t take your sequencing to the doctor’s office right now as an adult” and expect them to tell you about your future risks for common diseases, he says. “We just in general don’t have that information.” That said, Gibbs argues that “by many criteria, what we see in front of us today is just mind-blowing and represents an enormity of progress.” He is particularly heartened by the evolution in how medical research is now being conducted, which he attributes in part to the increasing incorporation of the science of genomics into medicine. “If you look at the state of medicine today, and the more scientific nature of many of the investigations in clinical science, they’re more digital, they’re more genomic, they’re more focused on comprehensive experiments,” Gibbs explains. “We are seeing an enormity of change in many arenas, culturally, knowledge-based, and now at last somewhat in therapeutics.” Overall, Gibbs concludes “things are going pretty well in my view, and although we don’t have a pill for everything, we are really on the right trajectory,” he says.
In a recent Mendelspod interview, host Theral Timpson talked with Valerie Schneider of the National Center for Biotechnology Information about the work of the Genome Reference Consortium (GRC) to bring more ethnic diversity to the latest human reference assembly (GRCh38).
Describing the reference genome as something like a Rosetta Stone for scientists working with genomic data, Schneider says it is “really the central piece of data upon which most genomics-based analyses are done, [serving as] the coordinate system for annotations ranging from genes to repeats to epigenomic markers.”
As the importance of increasing the representation of population diversity in this reference has become more appreciated, the GRC team has worked to bring many more ethnic populations into the reference. Currently this is being addressed by patches and the use of alternate loci scaffolds, Schneider says. Scientists working with population graphs are among the early adopters of these new alternate loci scaffolds.
The new ethnic genomes “are also intended to stand on their own as complements to the reference so users can examine variation for their own samples in the context of these different backgrounds, or they can try to get genome-wide views of different populations,” Schneider says.
She and her colleagues are looking forward to seeing the ways in which this information will be used by both the basic and the clinical research communities. One effort she is particularly excited about is underway at the McDonnell Genome Institute at Washington University, where scientists are generating a set of high-quality, de novo whole genomes from a wide variety of populations.
At the ASHG annual meeting earlier this month, the GRC hosted a workshop entitled “Getting the Most from the Reference Assembly and Reference Materials: Updates and Developments from the Genome Reference Consortium (GRC) and Genome in a Bottle (GIAB).”
We’re excited to announce that we’ll be working closely with two programs that are committing significant resources toward generating reference-quality genomes of thousands of vertebrate species. Both the Genome 10K (G10K) and Bird 10,000 Genomes (B10K) initiatives have invested in SMRT Sequencing to build high-quality de novo genome assemblies for the next phase of their programs. By sequencing large numbers of vertebrates, the groups hope to develop resources that will be useful for species conservation efforts in the future.
The G10K project was established in 2009 by a consortium of biologists and genome scientists, including Duke neurobiologist Erich Jarvis, Steve O’Brien of the Dobzhansky Center for Genome Bioinformatics, David Haussler and Beth Shapiro of the UC Santa Cruz Genome Institute, and Oliver Ryder of the San Diego Zoo Institute for Conservation Research. Together they determined to sequence the genomes of 10,000 vertebrate species by 2020. The B10K project, launched in 2015 and co-led by Jarvis along with Guojie Zhang of BGI and Thomas Gilbert of the University of Copenhagen, is an initiative to generate representative draft genome sequences for all 10,500 bird species, also within the next five years.
These groups have already contributed genomic resources to the conservation biology community. They collaborated for the first phase of the projects, yielding outcomes such as the Avian Phylogenomics Project, which involved more than 200 scientists and sequenced the genomes of more than 45 new bird species. At the start they used short-read technologies, but have since discovered that with long-read SMRT Sequencing they can produce de novo assemblies of complex genomes with much higher quality.
Jarvis recently sequenced two bird species with SMRT Sequencing, generating high-quality assemblies with long, gapless contigs, half of which were several megabases in length or larger. For example, for Anna’s hummingbird (Calypte anna), the project significantly increased the number of complete genes and reduced the number of contigs compared to a previous short-read assembly, from 124,000 contigs using short-read sequencing to 1,000 using SMRT Sequencing. In a separate sequencing project for zebra finch, PacBio Sequencing fully resolved gaps in the Sanger reference and detected errors in the previous reference genome. For additional details, check out our recap of Jarvis’s talk at this year’s East Coast user group meeting.
Now, the G10K and the B10K initiatives will include Sequel Systems for the next phases of their work. They intend to sequence the genomes of several thousand vertebrate species with PacBio technology for diploid-resolved, high-quality de novo genome assemblies, and perform subsequent chromosome-level scaffolding with complementary approaches, including BioNano Genomics’ optical genome mapping, Dovetail’s proximity in vitro genome mapping, and Phase Genomics Hi-C mapping. To that end, Jarvis, who is now at The Rockefeller University and affiliated with the New York Genome Center, has ordered two Sequel Systems and plans to bring on three additional units. Several other global leaders of the G10K and B10K consortia will also contribute use of their recently acquired Sequel Systems toward their goal of creating de novo assembled vertebrate genomes, including Harris Lewin at UC Davis in the USA, Richard Durbin at the Sanger Institute in the UK, Gene Myers at the Max Planck Institute of Molecular Cell Biology & Genetics in Germany, and Guojie Zhang with affiliations at BGI in China and Denmark.
Jarvis and other members of the G10K and B10K consortia recently submitted a proposal to the MacArthur Foundation’s new 100andchange competition, hoping to secure $100 million to create a Digital Noah’s Ark Genome Library of all 8,000 endangered vertebrate species on Earth. In addition, the G10K and B10K consortiums decided that their goals and the MacArthur proposal will be stages of a longer-term larger effort to populate the Digital Noah’s Ark Genome Library with high-quality blueprint genomes of all ~66,000 vertebrate species in the world through an umbrella program called the Vertebrate Genomes Project. It’s an audacious goal and we wish them luck in the competition!
Recent de novo assemblies of individual human genomes have uncovered thousands of structural variants, many of which are accessible only with PacBio long reads [1-3].
|Personal Genome||PacBio Coverage||Deletions ≥50 bp||Insertions ≥50 bp|
A similar increase in structural variant sensitivity relative to short-read methods has been demonstrated with low-fold coverage PacBio sequencing interpreted against the reference genome . To demonstrate and evaluate the low-fold coverage approach on the PacBio Sequel System, we generated approximately 10-fold coverage of the well-studied human sample NA12878.
Purified DNA for NA12878 was obtained from Coriell, sheared to an average size of 25 kb, converted to SMRTbell templates, and size selected to 15 kb on the BluePippin system (Sage Science). The resulting library was loaded on 10 SMRT Cells. Each SMRT Cell was run for 6 hours on the Sequel System with chemistry v1.2 (an older chemistry than was used for recently released Arabidopsis data, which uses the newer chemistry v1.2.1 and has a yield of about 5 Gb per SMRT Cell and read length N50 of 16.4kb). In total, the runs generated 32.8 Gb of data contained in 3.4 million reads with half of the bases in reads longer than 11.8 kb.
|Run Time||60 hrs|
|Number of Bases||32.8 Gb|
|Number of Reads||3.4 M|
|Read Length N50||11,823 bp|
Reads were mapped to the GRCh37 human reference genome with NGM-LR , and structural variants were called with PBHoney . A total of 7,386 deletions and 7,445 insertions of at least 50 bp were identified and comprise the “10-fold SV call set.”
Visualizing Structural Variants
Ongoing improvements to the IGV browser  (available now in the development version) improve visualization for PacBio reads and structural variants. With these updates, IGV provides a clear representation of deletions, insertions, and trinucleotide repeats, and shows how long reads span structural variants.
Heterozygous 315 bp deletion at chrX:116,454,160-116,454,859
Homozygous 328 bp insertion at chr10:92,213,800-92,216,245
FMR1 trinucleotide repeat small expansion at chrX:146,993,200-146,993,950
Evaluation of 10-fold Call Set
To quantify sensitivity, the 10-fold SV call set was compared to a merged NA12878 “truth” set from the 1000 Genomes Project  and Genome in a Bottle .
|Set||Platform||Deletions ≥50 bp||Insertions ≥50 bp|
|truth: 1000 Genomes + GIAB [8,9]||Illumina||3,021||1,090|
|10-fold SV call set||PacBio Sequel||7,386||7,445|
The 10-fold SV call set recalls 86% of truth set deletions and 81% of insertions. Moreover, it includes thousands of deletions and insertions that are not in the truth sets, most of which are directly confirmed by a FALCON-Unzip de novo assembly from 60-fold PacBio RS II coverage.
In summary, this 10-fold SV call set demonstrates that low-fold coverage sequencing on the PacBio Sequel System is an affordable, effective approach for identifying structural variants and provides much improved sensitivity compared to short-read approaches. We are excited to see how this approach will be extended and applied to study genetic variation in disease cohorts, in human populations, and in other organisms.
To illustrate the low-fold coverage structural variant calling workflow, the NA12878 Sequel data is available for analysis on DNAnexus.
 Chaisson MJ, et al. (2015). Nature, 517(7536):608-11.
 Shi L, et al. (2016). Nat Commun, 7:12065.
 Seo JS, et al. (2016). Nature, 538(7624):243-7.
 English AC, et al. (2014) BMC Bioinformatics, 15:180.
 English AC, et al. (2015). BMC Genomics, 16:286.
 Robinson JT, et al. (2011). Nat Biotechnol, 29(1):24-6.
 Parikh H, et al. (2016). BMC Genomics, 17:64.
 Sudmant PH, et al. (2015). Nature, 526(7571):75-81.
We’re here in rainy, but beautiful Vancouver for the American Society of Human Genetics. ASHG 2016 promises to be every bit as fascinating as always, with great speakers, excellent sessions, and thought-provoking posters.
The PacBio team will be based in booth #718, and we encourage you to stop by to see the Sequel System and learn more about how SMRT Sequencing has already made a genuine difference in our understanding of human genetics. We’re impressed by the wide variety of ASHG posters citing PacBio data this year and hope you get a chance to peruse them.
We’ll be hosting a luncheon workshop at ASHG called “Discovering and Targeting Causative Variation Underlying Human Genetic Disease Using SMRT Sequencing.” The event will be held on Thursday, October 20th, at 1:00 pm PDT in the Crystal Pavilion Ballroom at the Pan Pacific Hotel (connected to the convention center). We have a great speaker lineup:
Euan Ashley, Stanford University
Towards Precision Medicine
Melissa Laird Smith, Icahn School of Medicine at Mount Sinai
SMRT Sequencing as a Translational Research Tool to Investigate Germline, Somatic and Infectious Diseases
Michael Lutz, Duke University Medical Center
Identification and Characterization of Informative Genetic Structural Variants for Neurodegenerative Diseases
Jonas Korlach, PacBio
A Future of High-Quality Genomes, Transcriptomes & Epigenomes
We hope to see you at ASHG!
In a Nature Methods paper released today, scientists describe the new bioinformatics tools to produce diploid genome assemblies from SMRT Sequencing reads. FALCON (Fast ALignment and CONsensus for assembly) and FALCON-Unzip were developed by PacBio scientists in collaboration with researchers at Johns Hopkins University, Cold Spring Harbor Laboratory, the Joint Genome Institute, and other institutions.
“Phased diploid genome assembly with single-molecule real-time sequencing” comes from lead authors Chen-Shan Chin and Paul Peluso, senior author Michael Schatz, and collaborators. In the publication, the team details how FALCON and FALCON-Unzip work and presents data from several validation studies of organisms including Arabidopsis, the Cabernet Sauvignon grape, and the diploid fungus Clavicorona pyxidata.
“Currently available genome assemblies rarely capture the heterozygosity present within a diploid or polyploid species,” Chin et al. write. “Most assemblers output a mosaic genome sequence that arbitrarily alternates between parental alleles.” That leads to a loss of important information about differences between homologous chromosomes. To address this issue, the team developed the diploid-aware FALCON assembler and FALCON-Unzip, a tool for resolving haplotypes. Both tools are open-source.
As the authors describe it, “The FALCON assembler follows the design of the hierarchical genome assembly process (HGAP) but uses more computationally optimized components.” FALCON builds a string graph with bubbles representing differences between paired chromosomes. “FALCON-Unzip identifies read haplotypes using phasing information from heterozygous positions that it identifies,” they add. The phased reads are used to construct contigs for both haplotypes as well as the unique sequence for each chromosome, resulting in a “final diploid assembly with phased single-nucleotide polymorphisms (SNPs) and structural variants (SVs).”
The team assembled a trio of Arabidopsis plants for validating the accuracy of the haplotype speration, then applied the tools to the fungus and wine grape genomes. “In all three genomes that we studied, the FALCON/FALCON-Unzip assembly was two- to three-fold more contiguous than alternative long-read assemblers and 30- to >100-fold more contiguous than state-of-the-art short-read assemblers,” they report. In Arabidopsis, for instance, they were able to resolve haplotype chromosomes for almost the entire genome. In the V. vinifera grape, the diploid assembly revealed high variation rates in homologous regions, and in C. pyxidata it showed long stretches of much lower heterozygosity than expected.
This new view of genomes could have major implications for characterizing methylation, gene expression, and regulatory elements. “More systematic study of phased diploid references will expose the detailed cis-regulatory mechanisms of differential expression in diploid genomes to improve our general understanding of the biology beyond haploid genomes,” the scientists write. “Looking forward, we expect many new opportunities for understanding diploid and polyploid genomic diversity and its impact on genome annotation, gene regulation, and evolution.”
In a paper published today in Nature, scientists from Seoul National University, Macrogen, and other institutions present the de novo genome assembly for a Korean individual. The effort used SMRT Sequencing and other technologies to generate the assembly, fully phase all chromosomes, and perform detailed analyses of structural variation and other elements. In the process, the team generated novel sequence data that helps fill gaps in the human reference genome and continues the trend of developing important new population-specific resources.
The work, reported in “De novo assembly and phasing of a Korean human genome,” was contributed by lead authors Jeong-Sun Seo, Arang Rhie, Junsoo Kim, and Sangjin Lee, senior author Changhoon Kim, and collaborators. The authors note that standard NGS approaches could not have accomplished the high-quality genomic resource they required. “Simple alignment of short reads to a reference genome cannot be used to investigate the full range of structural variation and phased diploid architecture, which are important for precision medicine,” they write. “By contrast, the single-molecule real-time (SMRT) sequencing platform produces long reads that can resolve repetitive structures effectively.”
For this effort, the scientists performed genome sequencing with PacBio technology and then integrated data from orthogonal platforms such as BioNano Genomics. SMRT Sequencing alone produced a highly accurate de novo assembly with 3,128 contigs and a contig N50 length of nearly 18 Mb. Combined with BioNano data and polished with Illumina sequence, the final assembly “is characterized by marked contiguity that has not been achieved by non-reference assemblies of the human diploid genome so far, and improves on the previous best N50 length by 18 Mb,” the scientists note. Ninety percent of the genome is covered in the largest 91 scaffolds.
That assembly was compared to the human reference, GRCh38, where it closed 105 of 190 remaining euchromatic gaps and extended into 72 more, adding about 1 Mb of novel sequence. “These locations, previously intractable using only short reads, commonly contained simple tandem repeats,” the authors report.
The contiguity of the assembly allowed scientists to delve deeply into structural variation, identifying more than 18,000 variants—nearly 12,000 of which had never been reported before. “Of the new SVs, 86% were highly enriched for clusters of mobile and tandem repeats,” the team writes. A look at insertions found that almost half had significant variability in frequency across populations, while nearly 10 percent of them were specific to people of Asian descent. (This follows the pattern seen with other population-specific assemblies, such as the recently published Chinese genome.)
Finally, the scientists constructed separate assemblies for each haplotype to more accurately represent the diploid genome. To assess the results, they examined the HLA complex, finding that phasing had been successful despite a large amount of structural variation. “Our approach also allowed a clinically important duplication of CYP2D6 to be detected and assigned to one phase,” the scientists report. “This result demonstrates that de novo assembly-based phasing has advantages in resolving challenging hypervariable regions, and could be used further for pharmacogenomics.”
The scientists note that this work produced “the most contiguous diploid human genome assembly so far,” supporting the idea that integrating technologies leads to optimal results for detecting structural variants and other elements that have been impossible to resolve with short reads. They also remind the community that many more population-specific resources will be important for realizing the potential of genomics. “Our findings demonstrate the important genomic differences of Asian ancestral group from the others, and highlight the need for further genomic studies focused on individuals outside of European ancestry to describe the full range of functionally important variations in humans,” they write.
More than 150 SMRT Sequencing users gathered at Stanford University for our annual West Coast User Meeting & workshops earlier this month. Many thanks to all the scientists who attended and shared their research. For anyone who couldn’t make it, we’ve included some highlights from each talk below (and links to download the full presentations when possible):
The event began with Marty Badgett, our senior product manager for the Sequel System and the PacBio RS II, discussing recent technology updates. He presented the most recent results from the Sequel System highlighting resequencing and small and large genome applications. Specifically, two metagenomic- and one immunology-targeted sequencing datasets demonstrated high single-molecule accuracy with over 225,000 reads each at >QV30. Next up were large insert libraries, showing a range of data from bacterial, plant, and animal projects. These featured the benefits of a near 7-fold increase in number of reads coming from the larger Sequel SMRT Cell with half of the data coming from reads >14,500 bp each. Finally, Marty showed the history of our development efforts on the PacBio RS platform and how we are applying those understandings to future developments on the Sequel System, including improvements to read length and reductions in input library amounts.
Kicking off the user presentations, Yahya Anvar from Leiden University Medical Center presented results from using SMRT Sequencing to study drug metabolism, specifically variants in the CYP2D6 gene. Anvar’s team has developed a CYP2D6 genotyping approach that enables his group to obtain high-quality, full-length, phased CYP2D6 sequences. According to Anvar, this leads to accurate variant calling and haplotyping of the entire gene locus, including exonic, intronic, and upstream and downstream regions. In addition to accurate characterization of variants within this locus, they can reliably describe copy-number changes, rearrangements, and gene conversions that have been missed by standard genotyping assays. He concluded that this method provides a powerful framework to infer drug response phenotype.
Christine Beck, a postdoctoral fellow at Baylor College of Medicine, discussed the use of target capture for complex genomic loci. The team uses targeted large-insert capture of human chromosome region 17p11.2 combined with long-read PacBio sequencing, which has allowed them to identify novel breakpoint junctional sequences in previously intractable repetitive DNA at this locus. She detailed the use of genomic approaches to characterize additional rearrangements of this structurally complex region and described mechanistic insights into genomic rearrangement formation that have been gleaned from these data.
Aaron Wenger, a senior staff research scientist at PacBio, spoke about improved support for long reads in the Integrative Genomics Viewer (IGV). New features include a quick consensus mode that suppresses random base-pair errors, quick phasing to group reads based on the nucleotide at a selected heterozygous variant, and labels for large insertions and deletions to reveal structural variants. He also presented examples of the extended IGV to explore haplotype phasing and structural variants in a human whole genome sequence. He noted that the development build of the viewer is available to download.
In Euan Ashley’s lab at Stanford, researchers are studying cardiac disease genes using SMRT Sequencing. Graduate student Alexandra Dainis presented their use of targeted Iso-Seq to phase cardiac disease genes. They were interested in using PacBio long-read sequencing because, unlike short-read sequencing, it can capture multiple SNPs or mutations on a single sequencing read and provide phased genetic information without the need for familial sequencing or inferential phasing from population data. Dainis discussed their work in hypertrophic cardiomyopathy, an autosomal genetic disorder that remains a leading cause of sudden death in young adults. Phasing disease-causing mutations may reveal disease-associated haplotypes that could be targets for new genetic therapies. The team has phased two sarcomeric genes (MYH7 and MYBPC3) in 10 left-ventricular heart RNA samples, from both controls and diseased hearts, and used this data to phase exonic disease-causing mutations and common SNPs into haplotypes for each sample. Their goal is to proceed to the development of new, haplotype-specific therapeutics.
Continuing the theme of human studies, Tina Graves-Lindsay from the McDonnell Genome Institute at Washington University School of Medicine spoke about plans to provide additional allelic diversity to the current human reference sequence by generating high-quality, highly contiguous human genome assemblies of individuals representing diverse populations. To date, they have sequenced seven diploid genomes. Their strategy involves generating deep coverage of PacBio sequence and scaffolding using optical mapping or cross-linking technologies to give even larger, chromosome-level information. This strategy also involves the use of large insert clone sequencing in targeted regions, which are typically not resolved in the whole genome assemblies.
Jason Underwood, who is both a principal scientist at PacBio and a senior fellow at the University of Washington, talked about the challenges posed by segmental duplications. An important source of genetic instability, they are associated with both rare and common diseases and can provide seeds for evolutionary innovation. UW used the Iso-Seq method to yield full-length transcript information and distinguish between gene copies with more than 99% sequence homology. Their approach uses complementary biotinylated oligonucleotide probes to enrich for duplicate genes from cDNA. They designed probes to 20 gene families that underwent duplications specifically on the human lineage since divergence from chimpanzee. Sequence analysis of captured cDNA from fetal and adult brain revealed mean transcript sizes ranging from 1,200 bp to 2,300 bp with transcripts up to 4 kb identified with high confidence. Among the human-specific duplications, they observed new isoforms, including novel sites of transcription initiation and polyadenylation, as well as previously unannotated open-reading frames, indicating that potentially novel human-specific brain mRNAs have previously been missed by short-read profiling.
The talks also included several studies of plants and animals. Stephen Mondo of the Joint Genome Institute focused on epigenetics, specifically N6-methyldeoxyadenine (6mA), which has only been found in four species: the alga Chlamydomonas reinhardtii and Drosophila melanogaster, C. elegans, and Mus musculus. Despite appearing at low levels, 6mA is critical for proper development, as it plays an important role in regulating gene expression. JGI scientists conducted the first kingdom-wide exploration of 6mA in fungi, where they found abundant utilization of 6mA in early diverging fungi, with up to 2.8% of all adenines methylated, vastly exceeding the levels observed in other organisms. Their results demonstrated the importance of 6mA as a broadly conserved epigenomic mark in eukaryotes and implicate 6mA as an epigenomic mark transmissible across nuclear division.
Amanda Larracuente from the University of Rochester talked about work in Drosophila looking at satellite DNA (satDNA), large blocks of tandem repeats that accumulate in heterochromatic genomic regions with low recombination, such as near centromeres and on Y chromosomes. Using SMRT Sequencing and multiple algorithms and parameter combinations to determine the optimal assembly approaches for heterochromatic regions rich in satDNA, they revealed the structure of complex satDNA loci with unprecedented resolution. These assemblies are providing a platform for evolutionary and functional genomic studies of satDNAs and other repeat-rich regions of the genome.
We also heard about work with wine grapes from Dario Cantu at the University of California, Davis. The genomes of the grapes and their microbial communities can shed light on beneficial organisms and how to avoid infestations that can kill these high-value crops. Deep sequencing of rRNA and metagenomes has allowed UC Davis to characterize the microbial communities in the vineyard, while whole-genome shotgun sequencing provided them with the references necessary to apply metatranscriptomics and profile gene expression of all interacting organisms simultaneously, including the grapevine host. The highly heterozygous genome of Cabernet Sauvignon was sequenced at 140x coverage with the PacBio RS II using a combination of 20 kb and 30 kb DNA libraries, producing an assembly with a contig N50 of 2.17 Mb. SMRT Sequencing was also used to sequence the genomes of some of the most common and economically important grape pathogens. For most fungal species, entire chromosomes were reconstructed into single-contig, telomere-to-telomere assemblies.
Tim Smith from the USDA Agricultural Research Service gave an update on work with the goat genome as a model for chromosome-scale assemblies. He pointed out that highly fragmented short-read assemblies impede downstream applications. That’s why their work for de novo assembly of the domestic goat (Capra hircus) is based on PacBio long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced the most contiguous de novo mammalian assembly to date, with chromosome-length scaffolds and only 663 gaps. The assembly represents better than 250-fold improvement in contiguity compared to the previously published C. hircus assembly, and resolves many repetitive structures, including the most complete repeat family and immune gene complex representation ever produced for a ruminant species.
Our chief scientific officer, Jonas Korlach, capped off the day with a talk about how SMRT Sequencing is enabling a future of high-quality genomes, transcriptomes, and epigenomes. He said that scientific papers using SMRT Sequencing technology are being published at a rate of 25-30 per week, with more than ~650 so far this year. Now established as the gold standard for closing bacterial genomes, he also noted that there has also been an explosion in using SMRT Sequencing for methylation detection in bacteria, unleashing a new era in bacterial methylomes. He congratulated PacBio users who belong to the “1 MB Contig Club,” which now extends to characterizing transcriptomes. He also highlighted recent work on maize and sorghum as well as human genomes, where we’ve seen a number of high-quality assemblies from various ethnic populations. Korlach highlighted the differences between contigs and scaffolds, and how long strings of unknown bases in assemblies dramatically alter their utility.
We’d like to thank our host, Jodi Puglisi from Stanford, as well as the partners present at the event: Advanced Analytical Technologies, Computomics, Covaris, Diagenode, DNAnexus, PerkinElmer, and Sage Science.
Today we are pleased to release the first Arabidopsis thaliana (Ler-0) dataset and de novo genome assembly generated with the Sequel System, using two SMRT Cells and 12 hours of runtime. Only three years ago, we released our first genome assembly1 for Arabidopsis produced on the PacBio RS II using P4-C2 chemistry, 85 SMRT Cells and 255 hours of runtime. Four months later, we released a second Arabidopsis dataset1 using the improved P5-C3 chemistry, which reduced the number of SMRT Cells to 46 and runtime to 138 hours.
We produced this Sequel dataset using our latest chemistry enhancements which significantly reduce the amount of DNA required. Prior to these chemistry improvements, the amount of DNA needed to run many large genome projects on the Sequel System was prohibitive. These modifications enable the use of loading concentrations equivalent to PacBio RS II levels.
Details of the Library Protocol, Data Generation, and Assembly Process
Purified Arabidopsis (Ler-0) genomic DNA was sheared to an average size of 32 kb and converted to SMRTbell templates, followed by a 20 kb size selection performed on a BluePippin system (Sage Science). Each SMRT Cell was loaded at an on-plate concentration of 144 pM of library and run for 6 hours on the Sequel System using the modified chemistry. Collectively, the two SMRT Cells produced 10.8 Gb of data, contained in 1.1 million reads, with half of the data in reads greater than 16,400 bp in length. The data were assembled with HGAP4 in SMRT Link.
Results of Sequel System Arabidopsis genome assembly
|PacBio RS II
|PacBio RS II
|Release date||Sept 2013||Jan 2014||Sept 2016|
|Number of SMRT Cells||85||46||2|
|Run Time (hrs)||255||138||12|
|Number of Bases (Gb)||11.0||15.9||10.8|
|Number of Reads (M)||4.25||2.30||1.14|
|Read Length N50 (bp)||7,700||11,900||16,400|
|PacBio RS II
|PacBio RS II
|Release date||Sept 2013||Jan 2014||Sept 2016|
|Assembly Size (Mb)||121.7||124.5||122.9|
|Contig N50 (Mb)||6.2||6.7||10.4|
|Max Contig Length (Mb)||13.0||13.2||15.0|
The raw and assembled data is publicly available for download.
De novo assembly of an Arabidopsis genome with SMRT Sequencing is not as groundbreaking as it was three years ago. However, this model organism data release demonstrates that, with these latest improvements, the Sequel System allows for the routine generation of high-quality assemblies of large, complex eukaryotic genomes. The modified chemistry is currently in testing and will be made available broadly once testing completes.
- Kim, K. E. et al. (2014) Long-read, whole-genome shotgun sequence data for five model organisms. Scientific Data. 1, 140045.