This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
The May issue of Genome Research is a special edition focusing on advances in sequencing technologies and genome assembly techniques. The research papers selected for this special issue cover reference-grade genome assemblies, structural variant detection, diploid assemblies, and other features enabled by new high-quality sequencing tools.
The issue kicks off with a perspective from NHGRI’s Adam Phillippy, who reflects on the history of sequencing and assembly. Dusting off publications from as early as 1979, he illustrates the remarkable pace of advances in this field for the past four decades. Phillippy has worked with just about every kind of sequence data, so his view of the current landscape is particularly instructive. “The biggest gains in contig lengths have come from single-molecule sequencing,” he writes. “Critically, 10-kb reads are longer than the most common repeats in both microbial and vertebrate genomes and can therefore generate highly continuous assemblies. In fact, the complete reconstruction of bacterial genomes—a process that used to require teams of people—is now automated and routine.” Phillippy also notes that long-read sequencing assemblies have spurred “a renewed interest in repetitive sequences, which can be properly analyzed for the first time” and are “even revealing new variation in the human genome.”
We are very pleased that more than half of the papers in this special issue feature our sequencing data and genome assemblies derived therefrom, underscoring PacBio’s leading role in long-read sequencing and de novo assembly. We congratulate all the authors for their exciting contributions to this special issue and encourage you to review these excellent publications:
- Discovery and genotyping of structural variation from long-read haploid genome sequence data: Scientists used SMRT Sequencing to scan human genomes for structural variants, finding that more than 89% of those found had been missed in the 1,000 Genomes Project.
- Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly: An exploration of the latest human reference assembly, which expands the number of alternate loci and for the first time includes sequence coverage of centromeres.
Plant and Animal Genomes
- Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data: This project used SMRT Sequencing data to generate genomes of three relatives of the model plant Arabidopsis thaliana,assembling all three genomes into only a few hundred contigs. Integration of optical mapping and chromosome conformation capture techniques yielded chromosome-scale assemblies of these repetitive plant genomes. The scaffolds even revealed some of the heterochromatic regions which are not present in gold standard reference sequences.
- Single-molecule sequencing resolves the detailed structure of complex satellite DNA loci in Drosophila melanogaster: Long-read PacBio sequencing allowed scientists to characterize complex satellite DNA regions, which have been challenging to resolve due to their repetitive nature.
- Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications: This analysis of Eurasian crow genomes found that assembling two high-quality genome references using SMRT sequencing, combined with optical mapping, made it possible to recover missing regions and correct errors in a previous short-read-only assembly.
- An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations: Scientists use SMRT Sequencing of full-length cDNAs for genome annotation of a new wheat genome assembly, identifying protein-coding genes and noncoding RNA genes with high confidence.
New Tools for Long-Read Data
- Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm: Scientists present a new hybrid assembly algorithm to combine short-read and long-read data for optimal accuracy and contiguity.
- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation: Based on Celera Assembler, Canu was designed for long-read data and significantly reduces computational time for genome assembly.
- HINGE: long-read assembly achieves optimal repeat resolution: This assembler focuses on resolving challenging repeats.
- Fast and accurate de novo genome assembly from long uncorrected reads: For long-read assembly, scientists pair Racon with miniasm to rapidly generate high-quality consensus sequences without an error-correction step.
- HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies: This tool performs fast, high-resolution haplotype assembly from data produced by long-read sequencing, short-read sequencing, and other genome analysis technologies.
- HySA: a Hybrid Structural variant Assembly approach using next-generation and single-molecule sequencing technologies: This method calls structural variants from human genomes using short-read and long-read sequence data; tests showed it improved detection rates for several types of variants.
A publication in BMC Genomics upends some of the conventional wisdom about variants that may cause virulence in Mycobacterium tuberculosis. Scientists at San Diego State University used SMRT Sequencing to produce a complete assembly of the pathogen, finding that earlier assemblies encountered problems due to GC bias and repetitive DNA.
“SMRT genome assembly corrects reference errors, resolving the genetic basis of virulence in Mycobacterium tuberculosis” comes from Afif Elghraoui, Samuel Modlin, and Faramarz Valafar. The team used long-read PacBio sequencing on an attenuated strain of M. tuberculosis, which is often compared to a virulent strain to highlight sources of pathogenicity. The same strain was previously sequenced with Sanger technology and published in 2008.
The sequencing process required just two SMRT Cells to achieve an average of 217-fold coverage. Assembly resulted in a single contig. Later, the scientists went back to the data and found that the same sequence results were achieved using results from only one of the SMRT Cells. A comparison of the new assembly to the previous one, as well as to a reference assembly of the virulent M. tuberculosis strain, found that the Sanger assembly overstated the genetic differences between the two microbes.
“Our assembly reveals that the number of H37Ra-specific variants is less than half of what the Sanger-based H37Ra reference sequence indicates, undermining and, in some cases, invalidating the conclusions of several studies,” the authors report. Many of the previous sequencing errors were found in genes known to be repetitive and GC-rich. “Our results constrain the set of genomic differences possibly affecting virulence by more than half, which focuses laboratory investigation on pertinent targets and demonstrates the power of SMRT sequencing for producing high-quality reference genomes,” they add.
Elghraoui et al. note that SMRT Sequencing offers significant advantages in accuracy and read length. “The random error profile of this technology allows for consensus accuracy to increase as a function of sequencing depth,” they write, reporting a QV greater than 60 for their assembly. In addition, the long reads “allowed us to easily and unambiguously capture known structural variants in H37Ra, as well as two novel to the strain.”
These results lead the authors to “advise caution when analyzing GC-rich and repetitive sequences among reference genomes, not to mention draft genomes,” they write. “As de novo assembly can be routinely performed for microbes using single-molecule sequencing, we strongly recommend this for mycobacteria.”
Microbiology fans can find the PacBio team at the upcoming ASM conference in booth #1328.
A new publication in BMC Genomics explores the use of RNA normalization and 5’ cap selection to enhance results from Iso-Seq studies using SMRT Sequencing. Scientists from the University of Edinburgh report that these modifications significantly boosted transcriptome coverage in a study of chicken.
“Normalized long read RNA sequencing in chicken reveals transcriptome complexity similar to human” comes from lead author Richard Kuo, senior author David Burt, and collaborators. The team chose this project because existing chicken annotation resources have far fewer genes than expected, with very little evidence of alternative splicing. This situation was believed to have stemmed from prior technology limitations.
In earlier studies, “researchers had to choose between low-throughput, costly methods to generate accurate full-length transcript models, such as cDNA cloning or high-throughput, cheaper methods to generate imprecise transcript models, such as short read RNA sequencing,” Kuo et al. write. “The current status of chicken annotation represents a prime example of this trade off.” The annotation has just over 17,000 genes and fewer than 18,000 transcripts, far less on both counts than other vertebrates.
RNA sequencing based on short-read data is particularly challenged in identifying essential transcription characteristics, the authors note: transcript start and termination sites, transcriptional noise, and exon chaining. These problems “are practically eliminated with long read sequencing where the full-length of a transcript may be sequenced in a single read,” they add.
For this project, scientists deployed SMRT Sequencing, tweaking the Iso-Seq protocol to incorporate RNA normalization as well as 5’ cap selection. They analyzed RNA from chicken brain and embryonic tissues, normalizing both libraries but using 5’ cap selection only for embryo samples. They also collected short-read data to compare results.
This approach yielded some 60,000 transcripts and 29,000 genes, including more than 20,000 novel lncRNA transcripts. The team also found nearly 15,000 unmapped reads from both libraries, likely representing “a significant number of genes that are not currently represented in the Chicken annotations due to gaps in the genome assembly,” they report. They compared their findings to results from Thomas et al., an earlier publication using SMRT Sequencing for chicken transcriptome analysis that did not include the modifications. Kuo et al. estimate that their normalization protocol “appears to have provided a transcriptome coverage efficiency of more than 5 times that of the previous study,” they write. “This means that for every SMRT cell used with the normalization method, 5 SMRT cells would be required without normalization to achieve the same amount of transcriptome coverage.”
The team’s new PacBio-based transcriptome “suggests a level of transcriptional complexity that is more consistent with expectations based on the well-characterised human genome,” the scientists conclude. “Using PacBio sequencing to create a high quality transcriptome annotation can correct [underrepresentation] issues that are common in many of the public annotations.”
Richard Kuo will be speaking about this research at the SMRT Leiden conference taking place this week in the Netherlands. Follow along at #SMRTLeiden!
Screening for pathogenic variants associated with polycystic kidney disease is now more accurate and affordable with SMRT Sequencing. A new paper in Human Mutation from scientists at Leiden University Medical Center and other institutes reports the evaluation of long-read PacBio sequencing as a potential replacement for costly, time-consuming Sanger pipelines.
“Detecting PKD1 variants in polycystic kidney disease patients by single-molecule long-read sequencing” comes from lead author Daniel Borràs, senior author Seyed Yahya Anvar, and collaborators. The team notes that previous efforts to get away from conventional tools by implementing short-read sequencing were never successful enough for clinical use. “A genetic diagnosis of autosomal dominant polycystic kidney disease (ADPKD) is challenging due to allelic heterogeneity, high GC-content, and homology of the PKD1 gene with six pseudogenes,” the scientists explain. In earlier studies, ambiguities from short-read sequencing “produced low true positive variant detection rates of 28% to 50% for the duplicated region of PKD1, and many false positives (10%) due to misalignments, low quality alignments and contamination by residual amplification of pseudogenes.”
The team predicted that long-read sequencing could address these issues, and evaluated the technology on 19 previously analyzed samples. They designed long-range PCR products to cover the coding regions of PKD1 and PKD2, and used PacBio’s Long Amplicon Analysis tool to reconstruct alleles from reads 3 kb or longer. Results were compared to those obtained previously from Sanger sequencing (requiring laborious long-range PCR, followed by many nested PCR reactions) and multiplex ligation-dependent probe amplification (MLPA). An initial examination of coverage found that “all PKD1 and PKD2 exons … from 19 ADPKD patients could be completely covered using long-reads.”
Variants detected with SMRT Sequencing and other approaches were compared; scientists determined that 17 high-confidence variants were detected by PacBio but not by Sanger. PacBio sequencing missed one pathogenic insertion, resulting in accurate calls for 18 of the 19 samples tested. “This provided a diagnosis for 94.7% of the patients, resulting in the correct detection of all PKD1 substitutions, single-nucleotide deletions, large deletions, one deletion-insertion, and 3 out of 4 insertions or duplications,” the scientists report.
These results point to SMRT Sequencing as an excellent replacement for older technologies to scan PKD1 and other medically relevant genes for pathogenic variants. “On top of reducing the PCR amplification steps required and limiting the implicit PCR artifacts, single molecule sequencing improves sequence alignments and aids in discriminating between homologous or repeated sequences, such as PKD1 pseudogenes,” the scientists write. “This provides a cleaner dataset for variant calling.”
The scientists conclude, “This method is highly valuable for a diagnostic setting, as it increases the resolution power of clinically relevant but difficult to sequence or to resolve genomic regions.”
Senior author Anvar from Leiden University Medical Center will be presenting at the SMRT Leiden events taking place this week. Follow along at #SMRTLeiden!
It’s DNA Day, the annual celebration of the discovery of the double helix, the completion of the Human Genome Project, and all things genetic. We like to take the opportunity to look back at DNA-based advances from the past year, and progress has been truly stunning. Just when we think it couldn’t get more awe-inspiring, scientists generate new results that prove us wrong.
One of the most impressive feats in the past year has been the proliferation of population-specific, reference-grade human genomes. From the Chinese genome assembly that recovered nearly 13 Mb of sequence missed in GRCh38 and produced new insights around alternative splicing to the diploid Korean genome assembly that detected nearly 12,000 novel structural variants — including several specific to Asian populations — these new resources are showing us how much sequencing must be done to represent the universe of natural human genetic variation. Several other country or population genome projects have reported results or are in the works, and we’re eager to see how this data fills in the blanks to help us better understand the human genome. Structural variation in particular is being detected more comprehensively than ever, with even small amounts of long-read sequencing helping scientists to connect these elements to their likely function.
We’ve also seen compelling work from the plant and animal research community. Just in the past year, scientists have published new high-quality genome assemblies for quinoa and goat, shattering contiguity records even for challenging genomes. In maize, researchers reported new studies that produced accurate gene copy number counts and a more complex transcriptome than anticipated. Alternative splicing was also the focus of a sorghum study. And we were delighted to learn that the Genome 10K (G10K) and Bird 10,000 Genomes (B10K) initiatives announced plans to ramp up their efforts to generate high-quality de novo genome assemblies.
On the microbial front, we were especially fascinated by a new report detailing the epigenetic changes that occur as free-living bacteria morph into symbiotic bacteria associated with a host. There was also a project that investigated how drug-resistance plasmids are swapped across bacterial species by analyzing the entire “mobilome” of carbapenemase-producing Enterobacteriaceae. And since we’re suckers for extremophile research, we couldn’t resist this genome profile of a single-celled diatom living in the Antarctic Ocean.
All of these projects were accomplished with SMRT Sequencing. On DNA Day, we’d like to congratulate the entire research community working to improve our understanding of genomics.
Today is Earth Day, a great time to reflect on the growing trend of conversation genomics. We are proud that many scientists are using PacBio long-read sequencing for the goal of rescuing endangered species and preserving delicate ecosystems around the world.
One of the first examples we saw of this approach came from Oliver Ryder at the San Diego Zoo Institute for Conservation Research. Ryder and his team performed SMRT Sequencing for the ‘alalā, a Hawaiian crow, which no longer existed in the wild. In this video, he describes how having a high-quality genome assembly for this bird will have a significant impact on biologists’ ability to breed and reintroduce healthy crows back to their native environment. Ryder is also a founder of the Genome 10K (G10K) project, which aims to create high-quality assemblies for 10,000 vertebrate species as part of a large-scale conservation effort.
We’ve also been impressed by public support for a crowdfunded conservation genomics project — this one for the kākāpō bird, a critically endangered species found only in New Zealand. David Iorns, founder of the Genetic Rescue Foundation, is using SMRT Sequencing to build a reference-grade de novo genome assembly for the bird, followed by resequencing all 125 remaining kākāpōs. These members of the parrot family are facing fertility issues, a major population bottleneck, and other challenges that make a conservation effort necessary to prevent them from going extinct.
Recently, conservation expert Rebecca Johnson from the Australian Museum Research Institute gave a talk on the de novo genome assembly of a koala. This lovable marsupial species has been on the radar of conservation biologists who want to protect it in part because it has a number of unique and interesting features. Johnson used SMRT Sequencing to analyze the 3.6 Gb genome, yielding what she calls the best marsupial assembly to date.
This year, Earth Day is also marked by the first-ever March for Science, including more than 500 marches across the globe to support better research funding and pro-science policies. We’ll be cheering on all the scientists involved in conservation genomics and other important efforts to protect our planet and all the creatures that call it home!
Researchers from the Okinawa Institute of Advanced Sciences published a compelling review article describing several recent clinically relevant projects they have completed using SMRT Sequencing. Released in the journal Human Cell, “Advantages of genome sequencing by long-read sequencer using SMRT technology in medical area” comes from lead author Kazuma Nakano, senior author Takashi Hirano, and collaborators.
The team adopted long-read PacBio sequencing as an alternative to short-read sequencers that missed too many important genomic elements. “PacBio RS II confers four major advantages compared to other sequencing technologies: long read lengths, high consensus accuracy, a low degree of bias, and simultaneous capability of epigenetic characterization,” they write. “These advantages surmount the obstacle of sequencing genomic regions such as high/low G+C, tandem repeat, and interspersed repeat regions.”
The scientists present several examples of how this technology has made a difference in their work. Many of these studies were previously unpublished. While we can’t cover them all, here are a few vignettes that caught our attention:
- The team fully sequenced the genome of the Kurono strain of Mycobacterium tuberculosis, yielding a single, circular contig. GC content was as high as 80% across the genome, which also featured “117 sets of >1000 bp identical sequence pairs.”
- They sequenced the genome of a multidrug-resistant isolate of Acinetobacter baumanniicollected in a Nepalese hospital. The assembly, represented in two circular contigs for a chromosome and its plasmid, included several genes conferring drug resistance.
- SMRT Sequencing allowed the scientists to perform de novo assembly and methylation detection for several variants of Leptospira interrogans in a study designed to identify mechanisms underlying virulence in the zoonotic disease leptospirosis.
- A flu study relied on SMRT Sequencing for whole genome analysis of 48 influenza viruses isolated in Okinawa. The study included at least one sample from the H1N1 pandemic in 2009. “Our genomic data set contained temporal and spatial information about the seasonal and pandemic prevalence of flu in Okinawa,” the authors report. “Such insight gleaned will help elucidate the mechanism of acquired resistance to vaccines and drugs and thus inform future drug and vaccine development.”
- The team used long-read sequencing to explore why the incidence of gastric cancer is lower in Okinawa than anywhere else in Japan, despite consistent prevalence of Helicobacter pyloriacross the country. By conducting whole genome sequencing and methylation detection for eight pylori strains, the scientists spotted virulence factor-dependent motifs.
This review demonstrates several of the clinically relevant applications for SMRT Sequencing. PacBio “has significantly impacted basic science and biology and is reaching its influence into the clinical/medical atmosphere,” the scientists write. The technology is “ideal for whole genome sequencing, targeted sequencing, complex population analysis, RNA sequencing, and epigenetics characterization.”
A preprint from scientists at the University of Florida, Centro de Investigaciones Principe Felipe, and other institutes describes a new analysis tool to help boost quality of transcriptome studies. “SQANTI: extensive characterization of long read transcript sequences for quality control in full-length transcriptome identification and quantification” comes from lead author Manuel Tardaguila, senior author Ana Conesa, and collaborators.
The automated pipeline for Structural and Quality Annotation of Novel Transcript Isoforms (SQANTI) was developed as a quality-assessment tool for transcripts discovered with SMRT Sequencing. SQANTI “calculates up to 35 different descriptors of transcript quality and creates a wide range of summary graphs to aid in the interpretation of the sequencing output,” the authors report.
Development of this new pipeline was spurred by the realization that different transcript analysis tools yielded different results, even for the same data set. “As an example, sequencing the mouse neural transcriptome with PacBio long reads, we obtained ~ 80,000, 12,000 and 16,000 different transcripts when applying Tapis, IDP or the ToFU pipeline, respectively,” the scientists write. “Implementing a comprehensive, quality aware analysis of PacBio reads is fundamental at a time when long read transcriptome sequencing is becoming more popular and important conclusions on transcriptome diversity will be drawn from these data.”
SQANTI consists of tools to classify transcripts by comparison to a reference annotation, analyze data by more than 30 metrics, and generate graphs to report results. The team tested it using neural tissue from mice, performing extensive RT-PCR validation to measure transcript expression. PacBio sequencing of the tissue identified many novel transcripts, but “an important fraction of the novel sequences are presumably bioinformatics or retrotranscription artifacts that can be removed by using SQANTI descriptors,” the scientists report.
They also evaluated results against data from short-read sequencing. “A comparison of Iso-Seq over the classical RNA-seq approaches solely based on short-reads demonstrates that the PacBio transcriptome not only succeeds in capturing the most robustly expressed fraction of transcripts, but also avoids quantification errors caused by unaccounted 3’ end variability in the reference,” Tardaguila et al. write. “SQANTI allows the user to maximize the analytical outcome of long read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes.”
A new publication in Genome Research shows how the use of SMRT Sequencing, in combination with other technologies, can reveal far more about repetitive DNA and structural variants than short-read sequencing alone. In this paper, scientists compared genome assemblies produced with short reads, long reads, and optical maps to understand the performance of each approach.
From Uppsala University, the University of Munich, and Bionano Genomics, the team studied the Eurasian crow for this project. The resulting paper, “Combination of short-read, long-read and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications,” comes from lead author Matthias Weissensteiner, senior author Jochen Wolf, and collaborators. They used an existing short-read assembly and generated a de novo PacBio long-read assembly and an optical map with Bionano Genomics, all from the same individual.
The PacBio-only assembly alone delivered a major improvement over the short-read assembly. Contiguity increased by almost 90-fold, with the long-read assembly featuring a contig N50 longer than 8.5 Mb. The SMRT Sequencing assembly also resolved more than 70 Mb of sequence missed in the short-read assembly, including nearly 16 Mb of repetitive elements.
The various assemblies were then compared and joined to determine how each source of information contributed to a final, high-quality genome resource. This step allowed the team to spot mis-assemblies, which occurred more frequently in the short-read assembly. They detected 43 mis-joins in the short-read assembly, and fewer than half that number in the long-read assembly.
One of the motivating factors for this project was an interest in understanding the repetitive DNA associated with constitutive heterochromatin, which has an influence on recombination. To that end, the team analyzed large tandem repeat arrays in the crow genome and used population resequencing data to estimate effects on recombination rate. “We characterized 36 previously unidentified large repetitive regions in the proximity of sequence assembly breakpoints, the majority of which contained complex arrays of a 14-kb satellite repeat or its 1.2-kb subunit,” the scientists report. They determined that the recombination rate was “significantly reduced in these regions.”
“Our results demonstrate the potential of combining independent technologies to discover previously inaccessible genomic features,” Weissensteiner et al. write. “With an emerging picture of genome architecture affecting the distribution of genetic diversity across genomes, the integration of large tandem repeat arrays into genome assemblies constitutes an important improvement.”
We’re delighted to see the release of another high-quality avian genome, which will support ongoing efforts in the B10K and G10K projects to represent as many species as possible.
We are excited to announce our 2017 series of SMRT Community Events and User Group Meetings (UGMs), where you can learn first-hand how members of the scientific community are leveraging the latest capabilities of SMRT Sequencing for a growing number of applications.
Our vibrant community of users are enthusiastic about sharing tips, exchanging ideas, and developing new applications. These upcoming events will facilitate just that — and we hope you can join us!
We are now accepting registrations for our SMRT Leiden, Americas East Coast UGM and Asia Pacific UGM. Please save the date for our Americas West Coast UGM, SMRT Developers Meeting, and EMEA UGM, taking place later in the year.
- May 2 – 4, SMRT Leiden: SMRT Scientific Symposium & Informatics Developers Meeting, Leiden, The Netherlands
- May 31 – June 1, APAC User Group Meeting, Seoul, South Korea
- June 27 – 28, Americas East Coast User Group Meeting & Workshops, Baltimore, MD
- September 6 – 7, Americas West Coast User Group Meeting & Workshops, Palo Alto, CA
- Fall 2017, SMRT Informatics Developers Meeting, TBD, MD
- November 2 – 3, EMEA User Group Meeting, Barcelona, Spain
Call for Speakers
Our scientific advisory committee is currently reviewing speakers for the East Coast UGM & Workshops. If you are interested in sharing your latest research, please submit a proposal when you register. The deadline for consideration is Wednesday, May 17.
We look forward to seeing you at our upcoming SMRT Community Events & User Group Meetings!
A new preprint offers an enticing look at transcriptome results from analysis of a hummingbird using SMRT Sequencing. In this study, scientists found new clues to explain unique attributes of the bird’s metabolism. The work was made possible through full-length isoform sequencing, which allowed deep, assembly-free analysis even though no reference genome was available.
“Single molecule, full-length transcript sequencing provides insight into the extreme metabolism of ruby-throated hummingbird Archilochus colubris” is now available on BioRxiv. From Rachael Workman, Alexander Myrka, Elizabeth Tseng, William Wong, Kenneth Welch, and Winston Timp, the paper describes a project designed to better understand how hummingbirds switch metabolic gears to focus on sugars or lipids as needed. “This metabolic flexibility is remarkable both in that the birds can switch between exclusive use of each fuel type within minutes,” they write, “and in that de novo lipogenesis from dietary sugar precursors is the principle way in which fat stores are built, sometimes at exceptionally high rates, such as during the few days prior to a migratory flight.”
The team used the Iso-Seq method with long-read PacBio data to generate full-length isoform sequences, focusing on the liver of Archilochus colubris. According to the paper, this represents “the first high-coverage transcriptome of any single avian tissue.” They also aligned transcripts to Calypte anna, a recently completed hummingbird assembly that also made use of SMRT Sequencing.
Workman et al. report that the use of long-read PacBio data allowed for more accurate views of isoforms and alternative splicing, even without a reference genome. “Using full-length transcript data, we found alignment unnecessary to generate clear pictures of the gene isoforms,” they note. “The long reads negate the need for transcript assembly, a precarious analysis in the absence of a genome.” Nearly half of the reads in the final analysis covered full-length genes, including the 5’ and 3’ ends as well as the polyA tail.
The team used the COGENT pipeline to assign transcripts to gene families and focus on unique isoforms. “COGENT is specifically designed for transcriptome assembly in the absence of a reference genome, allowing for isoforms of the same gene to be distinctly identified from different gene families,” the scientists write. Their analysis generated a highly diverse set of isoforms, which the authors believe “represents a nearly complete transcriptome of the hummingbird liver.”
With that dataset, the scientists found genes unique to hummingbird. “These genes showed a specific enrichment for pathways involved in lipid metabolism — suggesting that the hummingbird has evolved variants of these genes to achieve its high levels of metabolic efficiency,” they report.
The scientists note that follow-up functional assays will be an important next step in understanding and verifying the function of many genes of interest.
We’re excited to be heading to Washington, DC, for the annual meeting of the American Association for Cancer Research. The PacBio team always enjoys hearing about the latest in cancer translational research at AACR, along with thousands of leading scientists in the field.
Many of those scientists have already learned that SMRT Sequencing provides a unique view into cancer, revealing structural variation, phasing distant variants, and delivering full-length isoform sequences. With uniform coverage, industry-leading consensus accuracy, and reads extending to tens of kilobases, PacBio long-read sequencing gives researchers the ability to monitor and make sense of even the most complex changes in tumor DNA.
If you’ll be attending AACR, stop by booth #1617 to get a first look at the new Integrative Genomics Viewer (IGV v3) featuring greatly improved support for SMRT Sequencing data. We’ll be demonstrating the new features in IGV v3 with a PacBio whole genome sequencing dataset (the SK-BR-3 Human Breast Cancer cell line). Visit us to see how PacBio data visualized in IGV v3 reveals the hidden landscape of somatic structural variants in a cancer genome including translocations, gene fusions, and novel mobile element insertion sites.
In addition, check out these posters from our scientists to see SMRT Sequencing data in action for cancer studies:
SMRT Sequencing of Full-length Androgen Receptor Isoforms in Prostate Cancer Reveals Previously Hidden Drug Resistant Variants
Tyson Clark, Ph.D., PacBio
Sunday, April 2, 1 p.m. – 5 p.m., Abstract #425/25, Section 17
Simplified Sequencing of Full-length Isoforms in Cancer on the PacBio Sequel System
Meredith Ashby, Ph.D., PacBio
Monday, April 3, 1 p.m. – 5 p.m., Abstract #2442/29, Section 17
Detection of Low-frequency Somatic Variants using Single-molecule, Real-time Sequencing
Primo Baybayan, PacBio
Wednesday, April 5, 8 a.m. – 12 p.m., Abstract #5366/22, Section 15
Finally, we’ll be launching a new SMRT Grant program at AACR. Just tell us how full-length isoform sequencing of your cancer samples will drive new discoveries in your research for a chance to win library construction, PacBio sequencing, and bioinformatics analysis for your project. Check out the rules and submit your 250-word proposal by May 15. Many thanks to our partner GENEWIZ for helping us make this grant program possible!
The recent beta release of version 3 of the popular genome browser IGV greatly improves support for PacBio data . The long reads (up to 50 kb) and random error profile of PacBio SMRT® sequencing facilitate new applications in genome assembly, structural variant discovery, and haplotype phasing. These unique properties and applications benefit from customized data visualization.
IGV 3 extends support for PacBio long reads with: performance improvements to enable viewing variants at multi-kilobase scales; a “quick consensus” mode that suppresses single read random errors; labels for large insertion and deletion structural variants; and “group by base” to explore haplotype phase. The new capabilities are featured in a 4-minute tutorial video. To try them yourself, install IGV 3, and then load this IGV session (File > Open Session) with a sample dataset of 70-fold sequencing of a human genome, HG002 from NIST Genome in a Bottle .
It is visually challenging to identify biological variation (single nucleotide variants occur about every 1,000 basepairs in humans) among the more frequent sequencing errors in PacBio reads. However, because PacBio errors are random, quality is extremely high in a consensus of independent reads, often surpassing the quality from next-generation sequencing . A mismatch that is consistent across reads indicates biological variation .
IGV has added two features, “quick consensus mode” and “hide indels”, to reveal biological variation in PacBio reads. The quick consensus mode shows mismatches only at positions where more than a specified fraction of reads disagrees with the reference (recommended setting: 25%). The logic is the same as used by the coverage track. The “hide indels” feature (recommended setting: <10 bp) suppresses the most common error in raw PacBio reads, random small indels. Both features are available in the “Alignment” tab of the IGV preferences (View menu > Preferences).
Figure 1. Quick consensus mode and indel hiding reveal biological variation in PacBio long reads. (a) Both quick consensus and indel hiding are available in the “Alignment” tab of the IGV preferences (View menu > Preferences). Recommended settings are to hide mismatches at below 25% coverage allele fraction and indels shorter than 10 bases. (b) Raw PacBio reads with no consensus error correction. (c) The same read data with consensus mode and indel hiding activated reveals a number of homozygous and heterozygous single nucleotide variants.
Each human genome has approximately 20,000 structural variants (differences ≥50 basepairs with the reference), most of which require PacBio long reads to detect . For variants contained in a single read alignment, IGV 3 adds an option to “label large indels”, which lists the basepair size of the variant on a colored block whose width is proportional to the size of the indel. The “label large indels” feature is available in the “Alignment” tab of the preferences (View menu > Preferences). The recommended setting is to label indels larger than 10 basepairs.
Figure 2. Label insertion and deletion structural variants. (a) The option to label large indels is available in the “Alignment” tab of the IGV preferences (View menu > Preferences). Recommended settings are to label indels larger than 10 bases. (b) An insertion larger than the defined threshold is indicated by a purple box. The width of the box is proportional to the size of the insertion, and the basepair size is written on the box if it fits. (c) A deletion is indicated by a black line. The basepair size of the deletion is written on a white box at the center of the line. Examples are from HG002 sequenced by Genome in a Bottle.
For reads with very large structural variants or which contain inversions, mappers like BWA produce separate primary and supplementary alignments. IGV 3 adds an option to “link supplementary alignments” to visually connect separate alignments from the same read. Reads that have alignments to both strands, which can indicate an inversion, are colored turquoise. “Link supplementary alignments” is available in the right-click menu for each alignment track.
Figure 3. Link alignments from the same read. (a) The option to link primary and supplementary alignments from a read is available in the right click menu for the alignment track. (b) When “link alignments” is active, separate alignments from the same read are drawn on a single row and connected by a thin line. Reads with alignments to both strands, which can indicate an inversion, are colored turquoise. The example shows reads that support an inversion in HG002.
PacBio long reads can span multiple single nucleotide and structural variants, which directly phases the variants into haplotypes . To support visual exploration of haplotypes, IGV 3 adds an option to “group by base,” which categorizes reads by the basepair at a selected position. “Group by base” is available by right clicking on the basepair position by which to group. IGV 3 also includes performance improvements that enable variation to be shown at zoom levels of 10 kb and larger, which is critical to view haplotype structure.
Figure 4. Explore haplotype phase by grouping alignments by basepair. (a) The option to group alignments by the basepair at a selected position is available in the right click menu for the alignment track. (b) Ungrouped alignments from a locus in HG002 with a heterozygous deletion and several heterozygous single nucleotide variants. (c) Grouping the alignments by a heterozygous single nucleotide variant reveals two clear haplotypes.
To utilize the new capabilities, install IGV 3, and then load this IGV session (File > Open Session) with a sample dataset of 70-fold sequencing of a human genome, HG002. Congratulations to Jim Robinson, Helga Thorvaldsdóttir, and the rest of the IGV team and community for the release of IGV 3!
 Robinson JT, et al. (2011). Nat Biotechnol, 29(1):24-6.
 Zook JM, et al. (2016). Sci Data, 3:160025
 Roberts RJ, et al. (2013). Genome Biol, 14(7):405.
 Chin CS, et al. (2013). Nat Methods, 10(6):563-9.
 Chaisson MJ, et al. (2016). Nature, 517(7536):608-11.
 Chin CS, et al. (2016). Nat Methods, 13(12):1050-4.
A new PLoS One publication cites the use of SMRT Sequencing to clarify the transmission path of infection in a transplant recipient. This work is an excellent example of the clinical utility offered by long-read PacBio sequencing.
The project was spurred by the frustrating inability to distinguish between hospital-acquired infections and donor-to-recipient infections through solid organ transplants. Scientists and clinicians from the Icahn School of Medicine at Mount Sinai and the University of Texas Medical School teamed up to apply advanced sequencing technologies in the case of a liver transplant recipient infected with vancomycin-resistant Enterococcus. In their report, lead author Ali Bashir, senior author Shirish Huprikar, and collaborators describe the use of whole genome sequencing to pinpoint the likely means of infection.
The scientists note that cultures taken during the donor’s hospitalization prior to death were negative for Enterococcus until days after the transplant occurred. They analyzed bacterial samples from the donor, the recipient, and hospital isolates collected at the same time using SMRT Sequencing technology and other methods. “Automated de novo construction of high-quality bacterial genomes using long-read whole genome sequencing (WGS) is a powerful tool that can aid in donor transmission epidemiology,” they write.
The resulting Enterococcus genome assemblies “were highly contiguous; in all cases, the assemblies contained fewer than 10 contigs with the largest contig representing more than 50% of the total genome length,” the scientists report. They produced a phylogenetic tree for the samples and found that the bacterial genomes collected from donor and recipient were most closely related. However, other types of analysis — such as multilocus sequence typing and pulse-field gel electrophoresis — generated more ambiguous results. “Only the full de novo assembly was able to clarify the unique structural differences between the donor and recipient isolate,” Bashir et al. report. “Our data suggest that WGS may be increasingly necessary to unambiguously confirm transmission for structurally mutable genomes.”
Because long-read sequencing is uniquely able to resolve large structural elements, the scientists suggest that it will become more commonly used for studies like this one. “We expect that WGS and assembly of pathogen genomes will be increasingly important not only for understanding pathogen biology and evolution, but also become a routine and essential tool for investigation of potential organ transplant transmissions in many settings,” they conclude.
Blog readers may recall that last year’s SMRT Grant winner was Renying Zhuo from the Chinese Academy of Forestry. We’re pleased to report that the project is now complete!
Zhuo proposed sequencing the genomes of two strains of the Sedum alfredii plant from the same ecosystem — one that accumulates cadmium ions from polluted soil and one that doesn’t. The goal was to use high-quality assemblies for comparative genomic analysis to determine the genetic mechanisms responsible for this remediation effect.
Plant DNA was sequenced on the Sequel System by RTL Genomics, and genome assembly was performed by Computomics. (We’re also grateful to Sage Science and Experiment, the other co-sponsors of the SMRT Grant program in making this a worldwide democratic event.) Both plant genomes made it into the “1 Mb contig N50 club” (#1MbCtgClub on Twitter), with contig N50s of 1.08 Mb for Sedum alfredii HE and 1.26 Mb for Sedum alfredii NHZ.
Zhuo and his team will now dive into a deep, detailed comparative analysis between the two genomes to identify genes associated with metal accumulation. Ultimately, they hope the results can be used to improve bioremediation efforts for soils contaminated with heavy metals.
Detailed stats for the two plant assemblies from Computomics:
|Sedum alfredii HE||Sedum alfredii NHZ|
|Contig size [bp]||235739357||397076979|
|Longest Contigs [bp]||3521758||5719050|
|Contigs > 1 M [#]||74||117|
|N50 contig length [bp]||1087129||1256010|
|L50 contig count [#]||65||91|
|BUSCO complete [%]||88.5||90.1|
|BUSCO complete single copy [%]||60.9||21.8|
|BUSCO complete duplicated [%]||27.6||68.4|
|BUSCO fragmented [%]||2.4||1.7|
|BUSCO missing [%]||9.1||8.1|
Voting is now open for this year’s Plant and Animal SMRT Grant program. Check out the five finalists and cast your vote by April 5!
Efforts to produce a reference-grade goat genome assembly for improved breeding programs have paid off. A new Nature Genetics publication reports a high-quality, highly contiguous assembly that can be used to develop genotyping tools for quick, reliable analysis of traits such as milk and meat quality or adaptation to harsh environments. The program also offers a look at how different scaffolding approaches perform with SMRT Sequencing data.
“Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome” comes from lead authors Derek Bickhart, Benjamin Rosen, and Sergey Koren; senior author Tim Smith; and collaborators. The large team of scientists is affiliated with the USDA Agricultural Research Service, National Human Genome Research Institute, the University of Washington, and many other organizations.
The project was motivated by a clear need to develop methods for high-quality livestock genome assemblies to benefit breeding communities. Goat offers a particular boost to developing countries, where these animals are a primary source of textile fiber, milk, and meat. “A finished, accurate reference genome is essential for advanced genomic selection of productive traits and gene editing in agriculturally relevant plant and animal species,” the scientists report. Previous efforts to sequence the goat genome with short reads resulted in a highly fragmented assembly that could not resolve repetitive and other challenging regions. For this work, the team analyzed the genome of a highly homozygous male San Clemente goat (Capra hircus) using a number of technologies.
They chose SMRT Sequencing because its long reads could characterize even the most difficult genomic regions. “Initial assembly of the PacBio data alone resulted in a contig NG50 … of 3.8 Mb,” the team reports. PacBio contigs were then connected with optical mapping and Hi-C data to create extremely long scaffolds in the final 2.92 Gb assembly. “These combined technologies produced what is, to our knowledge, the most continuous de novo mammalian assembly to date, with chromosome-length scaffolds and only 649 gaps,” they write. The assembly is 400 times more continuous than the previous short-read assembly.
To learn more about how these technologies complement each other, the scientists analyzed results from optical mapping and Hi-C data separately. They found that Hi-C data yielded a tenth the number of scaffolds that optical mapping did, but it led to more misoriented contigs, which were correlated with restriction site density. “Ultimately, we found that sequential scaffolding with optical mapping data followed by Hi-C data yielded an assembly with the highest continuity and best agreement with the [radiation hybrid] map,” the team reports, noting that this approach is significantly less expensive than generating a short-read draft genome assembly and manually finishing it to high quality.
The final assembly includes notoriously difficult regions, such as centromeric DNA and the Y chromosome. Two chromosomes appear to be completely assembled, and two others seem to include “the elusive p arm,” Bickhart et al. write.
Of course, since the scientists were focused on building a resource that would help breeding programs, they also assessed its potential impact in that space. “Chromosome-scale continuity of the ARS1 assembly was found to have appreciable positive impact on genetic marker order for the existing C. hircus 52K SNP chip3,” they report.
Going forward, the team hopes to generate a phased diploid assembly for C. hircus.
Our team of scientist reviewers has considered hundreds of submissions for the latest SMRT Grant award and narrowed the selection to five finalists. Now it’s your turn! We welcome the community to vote for their favorite project now through April 5th. The winner will receive SMRT Sequencing and genome assembly or Iso-Seq analysis sponsored by PacBio and our partners, the Arizona Genomics Institute and Computomics.
Here’s a look at the entries from our five finalists:
Project: Temple Pitviper
Principal investigators: Mrinalini Mrinalini, National University of Singapore; Ryan McCleary, Utah State University; Manjunatha Kini, National University of Singapore
The highly venomous snake Tropidolaemus wagleri, common to southeast Asia, has a number of unique features that merit further study. Its venom contains toxic proteins not found in other species of snake, including a group of novel toxins that have not been well characterized. This reptile also has sex-specific phenotypes, which is unusual for snakes; interestingly, these differences are not seen until the snake reaches sexual maturity, but the biological trigger for this is not understood.
Project: Solar-powered Slug
Principal investigators: Carola Greve, Zoological Research Museum A. Koenig; Alexander Donath, Zoological Research Museum A. Koenig
Scientists propose sequencing the genome of Elysia timida, a Mediterranean sea slug that has the rare ability to consume algae and keep the ingested chloroplasts functioning. Inside the slug, these chloroplasts continue photosynthesis, building up a starch reservoir that can feed the slug for three months. The project aims to scour the genome for genes associated with this unique ability, to understand the mollusk’s eco-friendly biology, as well as the process of incorporating organelles.
Project: Pink Pigeon
Principal investigators: Matthew Clark, Earlham Institute; Cock Van Oosterhout, University of East Anglia
This effort would use the Iso-Seq method to generate the transcriptome of the pink pigeon, an endangered species native to Mauritius. The species suffers high levels of infertility and pathogen susceptibility, possibly related to a population bottleneck. Scientists would use SMRT Sequencing data to study the bird’s loss of genetic variation and to find variants associated with fitness and pathogen resistance.
Project: Explosive Beetle
Principal investigators: Tanya Renner, San Diego State University; Aman Gill, University of California, Berkeley; Wendy Moore, University of Arizona; Kipling Will, University of California, Berkeley; Athula Attygalle, Stevens Institute of Technology
The bombardier beetle (Brachinus elongatulus) is known for its ability to “explosively discharge a toxic mix of quinones, oxygen, and water vapor at over 100°C,” this proposal says. Scientists would sequence the insect’s 500 Mb genome to understand insect chemical biosynthesis and biodiversity. This would represent the first genome sequence for the beetle suborder Adephaga.
Project: Dancing with Dingoes
Principal investigators: Bill Ballard, University of New South Wales; Claire Wade, University of Sydney, Sydney, Australia
This team aims to sequence the 2.5 Gb genome of the Australian dingo and compare it to that of the wild wolf and domestic dog to understand the evolutionary process that led from wild animal to pet. According to the proposal, this project will also “inform aspects of indigenous Australian culture and advance our understanding of the Australian continent’s top-level predator.”
Congratulations to all five finalists for their excellent proposals – may the most interesting genome win! Help support your favorite project now until April 5th.
A new genome assembly has remarkable promise to boost the global food supply. Scientists from King Abdullah University of Science and Technology and other institutions sequenced quinoa, a nutritious grain that can grow in marginal lands and other suboptimal environments. Their assembly offers new clues that could help improve breeding efforts to make the plant more accessible worldwide.
“The genome of Chenopodium quinoa” was published recently in Nature by lead author David Jarvis, senior author Mark Tester, and a large group of collaborators. They focused on this plant, which is believed to have been domesticated more than 7,000 years ago in South America, because it is rapidly becoming accepted as a superfood with potential to address the growing food supply challenge. Quinoa is a relatively low-sugar, gluten-free grain with lots of nutrients. But expanding its use as a crop around the world requires new breeding efforts, the authors report. They used SMRT Sequencing to generate a high-quality, chromosome-scale genome assembly for the allotetraploid plant, a valuable resource that can now be used by breeding programs to produce shorter, higher-yielding plants with increased stress tolerance and other desirable traits.
The team sequenced a plant from coastal Chile, followed by scaffolding with Bionano Genomics and Dovetail Genomics tools. The assembly is 1.39 Gb, represented in fewer than 3,500 scaffolds. Ninety percent of the genome is covered in just 439 scaffolds. “This assembly represents a substantial improvement over the previously published quinoa draft genome sequence, which contained more than 24,000 scaffolds with 25% missing data,” the scientists report. Iso-Seq analysis and other annotation methods resulted in nearly 45,000 gene models, while a BUSCO analysis found that more than 97% of reported genes were included in the assembly. The group also sequenced two diploids from ancestral quinoa relatives.
One of the most exciting findings from the project was the discovery of a transcription factor that is believed to regulate production of saponins, bitter-tasting molecules in the quinoa shell. A premature stop codon found in sweet quinoa strains suggests that it may be possible to breed these saponins out to produce a plant more amenable for farming.
“These resources provide the foundation for accelerating the genetic improvement of the crop, with the objective of enhancing global food security for a growing world population,” Jarvis et al. write.
A recent effort to understand the genetic mechanisms behind swappable elements of drug-resistance among bacteria built on previous studies of Enterobacteriaceae isolates collected at the National Institutes of Health Clinical Center. The work was made possible by high-quality genome assemblies of these organisms generated earlier with SMRT Sequencing technology.
In this project, scientists from the U.S., France, and Brazil teamed up to learn precisely how drug-resistance plasmids are spread from one species to another. They report the results of that investigation in mBio with the publication “Mechanisms of Evolution in High-Consequence Drug Resistance Plasmids” from lead author Susu He, senior author Fred Dyda, and collaborators. The team focused on the full complement of mobile elements (or the “mobilome”) found in carbapenemase-producing Enterobacteriaceae. “The availability of highly accurate plasmid assemblies for these strains based on long-read PacBio SMRT sequencing allows for the unambiguous and precise annotation of mobile elements,” they report.
The scientists analyzed plasmid evolution from isolates collected during an outbreak of carbapenem-resistant Klebsiella pneumoniae at the NIH Clinical Center in 2011 and 2012 as well as from other samples collected at the center over several years. By tracking target site duplications in samples, the team could infer the evolution of drug resistance. “We are able to propose the exact historical molecular events underlying plasmid rearrangements which provide a basis for understanding how antibiotic-resistant strains change over time, with significant implications for combating plasmid-mediated antimicrobial resistance,” they write.
Of course, that raises the question of which evolutionary mechanisms are causing the changes they characterized. The scientists found two mobile element types — IS26 and Tn3 transposons — that appeared to be driving drug resistance evolution in the K. pneumoniae samples studied. However, they note, there was no clear explanation for that discovery. “This analysis revealed that plasmid reorganizations occurring in the natural context of colonization of human hosts were overwhelmingly driven by genetic rearrangements carried out by replicative transposons working in concert with the process of homologous recombination,” the authors report, adding that perhaps this kind of information will one day inform new approaches to combat antibiotic resistance.
“The rapidly decreasing cost of high-quality, long-read sequencing will enable the type of analysis described here to be applied more broadly to the problem of how resistance plasmids evolve in patients, hospitals, and the environment,” the scientists conclude.
Now, with the Sequel System and the recently released protocols for multiplexed microbial genome assembly (template preparation and data analysis), this application is even more accessible for the scientific community.
A recent Nature publication from a large team of scientists in Europe, Canada, and the US reports the use of SMRT Sequencing to elucidate the genome of Fragilariopsis cylindrus, a single-celled eukaryotic diatom adapted to living in polar waters of the Antarctic Ocean. The work has implications for the biotechnology industry, which looks to extremophiles as a potential source of important enzymes.
“Evolutionary genomics of the cold-adapted diatom Fragilariopsis cylindrus” comes from lead author Thomas Mock, senior author Igor Grigoriev, and many collaborators at the University of East Anglia, Earlham Institute, Joint Genome Institute, University of California, Berkeley, and several other organizations. The project investigated how this diatom evolved to thrive in its extreme environment, frequently living in high salinity directly under sea ice.
To achieve this, the team started by sequencing the F. cylindrus genome using both Sanger and PacBio systems. For SMRT Sequencing, the scientists produced two libraries with different insert sizes (4 kb and 20 kb) and ran seven SMRT cells, which yielded 63-fold coverage of the genome. The team used the diploid-aware FALCON assembler, which generated a 59.7 Mb assembly with 745 primary contigs. In an analysis and comparison to the Sanger assembly, the scientists determined the PacBio assembly was highly accurate in sequence (ranging from 99.65% to 100%) and structure (through validation fosmid comparison).
F. cylindrus is characterized by highly divergent alleles, which represent nearly a quarter of its genome. An analysis of those genes determined that the “divergent alleles were differentially expressed across environmental conditions, including darkness, low iron, freezing, elevated temperature and increased CO2,” the scientists report. “Alleles with the largest ratio of non-synonymous to synonymous nucleotide substitutions also show the most pronounced condition-dependent expression, suggesting a correlation between diversifying selection and allelic differentiation.” The team hypothesized that allele diversification took place after the last glacial period and has been maintained because the variety of gene content allows for rapid adaptation to a changing environment.
The Earlham Institute issued a press release about the project, including this comment from scientist Pirita Paajanen: “This is the first time at EI that a genome of this type was assembled into chromosomes. It is only very recently that the technology has been developed to cope with such a highly heterozygous organism and the data show that this diatom does actually have a large amount of variation within their genes.”