Quality Statement

Pacific Biosciences is committed to providing high-quality products that meet customer expectations and comply with regulations. We will achieve these goals by adhering to and maintaining an effective quality-management system designed to ensure product quality, performance, and safety.


Image Use Agreement

By downloading, copying, or making any use of the images located on this website (“Site”) you acknowledge that you have read and understand, and agree to, the terms of this Image Usage Agreement, as well as the terms provided on the Legal Notices webpage, which together govern your use of the images as provided below. If you do not agree to such terms, do not download, copy or use the images in any way, unless you have written permission signed by an authorized Pacific Biosciences representative.

Subject to the terms of this Agreement and the terms provided on the Legal Notices webpage (to the extent they do not conflict with the terms of this Agreement), you may use the images on the Site solely for (a) editorial use by press and/or industry analysts, (b) in connection with a normal, peer-reviewed, scientific publication, book or presentation, or the like. You may not alter or modify any image, in whole or in part, for any reason. You may not use any image in a manner that misrepresents the associated Pacific Biosciences product, service or technology or any associated characteristics, data, or properties thereof. You also may not use any image in a manner that denotes some representation or warranty (express, implied or statutory) from Pacific Biosciences of the product, service or technology. The rights granted by this Agreement are personal to you and are not transferable by you to another party.

You, and not Pacific Biosciences, are responsible for your use of the images. You acknowledge and agree that any misuse of the images or breach of this Agreement will cause Pacific Biosciences irreparable harm. Pacific Biosciences is either an owner or licensee of the image, and not an agent for the owner. You agree to give Pacific Biosciences a credit line as follows: "Courtesy of Pacific Biosciences of California, Inc., Menlo Park, CA, USA" and also include any other credits or acknowledgments noted by Pacific Biosciences. You must include any copyright notice originally included with the images on all copies.


You agree that Pacific Biosciences may terminate your access to and use of the images located on the PacificBiosciences.com website at any time and without prior notice, if it considers you to have violated any of the terms of this Image Use Agreement. You agree to indemnify, defend and hold harmless Pacific Biosciences, its officers, directors, employees, agents, licensors, suppliers and any third party information providers to the Site from and against all losses, expenses, damages and costs, including reasonable attorneys' fees, resulting from any violation by you of the terms of this Image Use Agreement or Pacific Biosciences' termination of your access to or use of the Site. Termination will not affect Pacific Biosciences' rights or your obligations which accrued before the termination.

I have read and understand, and agree to, the Image Usage Agreement.

I disagree and would like to return to the Pacific Biosciences home page.

Pacific Biosciences

PacBio blog

This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.

Search PacBio’s Blog

Friday, January 15, 2021

PacBio and Invitae Team Up to Develop Ultra-High-Throughput Clinical Whole Genome Sequencing Platform

PacBio and Invitae PartnershipThe power of PacBio HiFi reads has enabled transformative research into human disease. A new collaboration with Invitae, a leader in medical genetics, is intended to help harness the technology for use in mainstream medicine. 

The ability of HiFi reads to detect genetic variants, even in hard-to-sequence regions of the genome, has already shown clinical utility. In a recent research collaboration with Invitae, announced in October 2020, the comprehensive, highly accurate reads were used to explore clinically relevant molecular targets for use in the development of advanced diagnostic testing for epilepsy. 

We are thrilled to announce a new collaboration with Invitae to develop an ultra-high-throughput clinical whole genome sequencing platform. Read more about it here.

Learn more about the benefits and workflows of PacBio whole genome sequencing

Read More »

Monday, January 11, 2021

Copy Cats: Improved Cat Genomes Highlight Remarkable Similarities Within the Feline Family

Felix, Garfield, Leo the Lion — despite their differences, the genomes of these frisky felines are highly conserved across the family, even among its most divergent members.

A new set of highly contiguous haploid Felis and Prionailurus assemblies provides further proof that although the genomes appear karyotypically distinct, they are grossly collinear, and any cytogenetic differences represent centromere repositioning rather than chromosomal rearrangements.

Improved Cat Genomes Highlight Remarkable Similarities Within the Feline Family

New cat assemblies, including the Asian leopard cat (Prionailurus bengalensis) reveal similarities between species

A team of researchers at Texas A&M University, led by Kevin R Bredemeyer, Andrew Harris and William J Murphy, worked with colleagues in the United States, China and Russia to create a de novo assembly of a Bengal hybrid cat, as well as phased haplotypes of its parents, a random-bred domestic cat (Felis catus) and an Asian leopard cat (Prionailurus bengalensis).

As reported in the Journal of Heredity, the assemblies offer significant improvements over the previous domestic cat reference genome, with a 100% increase in contiguity and the capture of the vast majority of chromosome arms in one or two large contigs.

Previous diploid-based genome assemblies for the domestic cat suffered from poor resolution of complex and highly repetitive regions, with substantial amounts of unplaced sequence that is polymorphic or copy number variable, the authors noted.

“These difficult to assemble regions are increasingly understood as playing important roles in disease biology, genome organization, gene regulation, and speciation,” they added.

By using highly contiguous PacBio long reads, the team was able to capture complex repetitive regions previously un-spanned due to insufficient read lengths and/or high haplotype divergence, as well as resolve multicopy gene families with high allelic diversity (such as the Major Histocompatibility Locus and olfactory receptors).

“Furthermore, we have provided a genome assembly from a random-bred domestic cat, which is more representative of the domestic cat pet population,” they wrote.

Adding to a growing collection of assembly methods, they also demonstrated that comparably accurate F1 haplotype phasing can be achieved with members of the same species when one or both parents of the trio are not available — an important ability, since F1 interspecies hybrids are rare biological resources, and in many cases it may be logistically impossible to obtain the actual parents of the cross.

As they noted in their paper, cats are not only some of the most popular companion animals — species from the cat family Felidae serve as a powerful system for genetic analysis of inherited and infectious disease– but the study of domestic cats can also help in the conservation of their wild cousins.

“These novel genome resources will empower studies of feline precision medicine, adaptation and speciation,” the authors wrote.

To hear from fellow scientists about their latest plant & animal discoveries, register to attend PAGBio Day, a virtual half-day event on January 19th. Explore how to use PacBio whole genome sequencing for your project.

Read More »

Thursday, December 31, 2020

The Most Wonderful Webinars of the Year

The year the world went virtual is virtually over, so what better time to reflect on all the great online offerings featuring SMRT Sequencing this year. While we would have rather gathered in exotic locales to see you in person and share our science, 2020 did provide some amazing opportunities to go global and broadcast worldwide.

Here are some of the highlights:


Sequencing 101: How Long-Read Sequencing Improves Access to Genetic Information

What better place to start than with an introduction to our technology, followed by a panel of sequencing experts — Melissa Laird-Smith (@SmithLab_UofL), Michael Hartigan, and Olga Vinnere Pettersson (@OlgaVPettersson) — with some sequencing basics: explaining long reads and their utility, how PacBio long-read sequencing differs from other technologies, and the applications PacBio offers and how they can benefit scientific research.



PacBio Global Summit

Our popular regional user group gatherings were combined into one mega meeting this year, which meant nearly 30 hours of talks across time zones, over the course of two days. With so many sessions, it’s impossible to cover them all here, but suffice it to say, there were many magnificent presentations by our users, plus hands-on workshops, live Q&As, and a meet-and-greet with our new CEO, Christian Henry. Well worth spending time catching up!

Register here to watch these presentations on-demand


Plant and Animal Genomics (PAG) Conference

We rang in 2020 at the ever-popular PAG meeting. In addition to an overview of the year ahead by CSO Jonas Korlach, our workshop featured several great talks by our users, including an update on the Sanger Institute’s Darwin Tree of Life project by Mark Blaxter (@blaxterlab); a talk about the tetraploid rose assembly by Bart Nijland (@bart3601) of Genetwister Technologies; great apes work by Zev Kronenberg (@zevkronenberg); and a discussion of plant-living funghi by Jana U’Ren (@you_wren) of the University of Arizona.

Missing PAG 2021 in January? Join us for PAGBio Day, our online alternative, on Jan. 19. Save your seat!


SMRT Leiden

Although we sure missed spending a few glorious spring days in the stunning Dutch city of Leiden, we were delighted to make our annual European user gathering international. This fave meet-up included not only top talks by plant, animal, human and microbial scientists, but also a strong offering of bioinformatics sessions. Keynotes included Vertebrate Genomes Project leaders Erich Jarvis (@erichjarvis) and Sergey Koren (@sergekoren) from the National Institute of Health.

Register here to watch these presentations on-demand.


Advances in Genome Biology and Technology (AGBT) Conference

One of the last conferences we were able to attend in person, AGBT featured several informative sessions. Tina Graves-Lindsay from the McDonnell Genome Institute (@GenomeInstitute) described how her team is using PacBio sequencing to produce reference-grade human genome assemblies. Adam Ameur (@_adameur) from Uppsala University discussed the use of long-read sequencing to detect off-target results from CRISPR-Cas9 gene editing studies. And Brenda Oppert from the USDA made a convincing argument for developing insect-based food sources for people.



PacBio Neuroscience Day

Our first all-day event dedicated to neuroscience, #PBNeuroDay showcased a lot of emerging rare disease research. From unravelling repeat expansions to creating new methods of carrier screening, the 25 sessions tackled a wide array of topics and diseases, from ALS, Alzheimers and Ataxia to Muscular Dystrophy, Parkinson’s, Progressive Supranuclear Palsy and Schizophrenia.

Register here to watch these presentations on-demand


American Society of Human Genetics (ASHG)

ASHG featured a wide variety of talks and video poster presentations covering a range of applications using PacBio long-read sequencing technology, from single-cell isoform analysis of the nervous system (by Hagen Tilgner @hagentilgner of Weill Cornell) to solving rare disease cases in children (by Emily Farrow of Children’s Mercy). After hearing from our users, be sure to check out the handy overviews by PacBio experts Aaron Wenger and Liz Tseng (@magdoll).



Introducción a la Secuenciación Larga y Precisa con PacBio

This Spanish language webinar was a huge hit. Carmen Guarco, Senior Field Application Scientist specializing in bioinformatics, was joined by Álvaro G. Hernandez (@UIUC_DNAseq), Director of DNA services at the Roy J Carver Biotechnology Center at the University of Illinois at Urbana-Champaign to offer an overview of PacBio technology and tips for getting the most out of HiFi reads.



Beyond a Single Reference Genome – The Advantages of Sequencing Multiple Individuals

As it becomes increasingly clear that single reference genomes for each species are not enough, many scientists are interested in creating pangenome collections. So we brought together two experts — Kevin Fengler of Corteva and Matthias H. Weissensteiner (@MWeissensteiner) of Penn State to discuss the advantages of sequencing multiple individuals to gain comprehensive views of genetic variation, and the speed, cost, and accuracy benefits of using HiFi reads to sequence species of interest.



A HiFi View: Sequencing the Gut Microbiome with Highly Accurate Long Reads

How can the Sequel II System help with complex metagenomics projects? Meredith Ashby (@AsbhyMere), Director of Microbial Genomics at PacBio, was joined by Bing Ma of the Institute of Genome Science at the University of Maryland, who discussed her work using long-read sequencing to identify high-resolution microbial biomarkers associated with leaky gut syndrome in premature infants. George Weinstock (@geowei) of The Jackson Laboratory, talked about the potential of highly accurate long reads enabling strain-level resolution of the human gut microbiome by resolving intraspecies variation in multiple copies of the 16S gene.



Long-Read Sequencing in COVID-19 Research

We’d be remiss not to mention our talks covering COVID-19 itself. In the Labroots webinar Opportunities for using PacBio Long-read Sequencing for COVID-19 Research, Meredith Ashby, Director of Microbial Genomics at PacBio, described how HiFi sequencing could be used for mutation phasing and rare variant detection to understand viral stability and mutation rates, as well as providing insights into viral population structure for monitoring viral evolution. In Understanding SARS-CoV-2 and Host Immune Response to COVID-19 with PacBio Sequencing, Melissa Laird-Smith (@SmithLab_UofL) discussed her work evaluating the impact of host immune restriction in health and disease with high resolution HLA typing and Corey Watson (@ctwatson29) of the University of Louisville School of Medicine talked about overcoming complexity to elucidate the role of IGH haplotype diversity in antibody-mediated immunity.



We look forward to plenty more new discoveries in 2021 – Happy New Year!


Read More »

Tuesday, December 29, 2020

A Glimpse into the Gut: HiFi Sequencing Enables Strain-Level Study of Intestinal and Breastmilk Microbiota in Celiac Disease

SMRT Grant winner Ali R. Zomorrodi of Harvard Medical School

Celiac disease happens in the gut, but scientists still don’t fully understand the complex interplay between host genetics and the environmental factors that lead to the development of the autoimmune digestive disease.

Researchers at the Mucosal Immunology and Biology Research Center of MassGeneral Hospital for Children and Harvard Medical School are hoping to shed light on the ‘microbial dark matter’ in the breastmilk of mothers with celiac disease and in the intestine of celiac children using full-length 16S rRNA and metagenome sequencing — they will be supported in their efforts by the 2020 Microbial Genomics SMRT Grant.

Ali R. Zomorrodi (@arzomorrodi), the research center’s computational and systems biology lead and an Instructor of Pediatrics at Harvard Medical School, has teamed up with Alessio Fasano, a renowned celiac disease specialist, chief of Pediatric Gastroenterology and Nutrition at Massachusetts General Hospital and a Professor of Pediatrics at Harvard Medical School, to tackle these questions.

By leveraging a large-scale prospective birth cohort study referred to as the Celiac Disease Genomic, Environmental, Microbiome, and Metabolomic (CDGEMM) study, led by Dr. Fasano, they have a unique opportunity to delve more deeply into the role of microbiota in the etiology of the disease. The CDGEMM study has been banking stool, breastmilk and other biospecimens, as well as clinical metadata from ~500 infants with high risk of celiac disease from birth through childhood.

HiFi Sequencing for Strain-Level Resolution

In celiac disease, one of the most common forms of food intolerance worldwide, the ingestion of gluten-containing grains triggers an immune response that attacks and progressively damages the small intestine. It is a unique model of autoimmune diseases since it is the only such disease for which the environmental trigger (exposure to gluten) and genetic risk factors are well-characterized.

Research shows that 30% of the population are genetically susceptible for celiac disease and are exposed to gluten, yet only 2-3% of them develop the disease. Recent studies suggest a critical role for the gut microbiota in celiac disease pathogenesis, but how exposure to environmental risk factors other than gluten early in life may alter the engraftment of the gut microbiota in infants at risk of the disease is still poorly understood. So, one aspect of Zomorrodi’s project involves the investigation of the role breast milk may play on the engraftment of the gut microbiota in CDGEMM infants.

Zomorrodi and colleagues will use full-length 16S rRNA HiFi sequencing to see whether the composition of breast milk microbiota in mothers of CDGEMM infants with a history of celiac disease is different from those of mothers without the disease. They will also use this technology to study fecal microbiota from infants of these mothers and fecal microbiota of at-risk infants who consume formula, to explore whether the breast milk microbiota has any effects on the composition of the babies’ intestinal microbiota. While the conventional short-read 16S sequencing can identify microbes at genus or sometimes species level, the 16S HiFi sequencing would allow the team to profile the microbiota at strain-level resolution. This will enable researchers to gain deeper insights into the role of breast milk microbiota in shaping the intestinal microbiota of infants at risk of celiac disease.

A Deeper Understanding of the Intestinal Microbiome

In another project, Zomorrodi and his colleagues will be applying HiFi metagenomic sequencing to fecal samples from CDGEMM children who developed celiac disease and matched controls who did not. They hope to comprehensively characterize the intestinal microbiome composition of these subjects at strain-level resolution and to identify celiac-specific biomarkers in the microbiome.

“This is important, because we know that many diseases are driven by specific strains within the same species. Healthy people may carry the same species, but do not become ill. We want to know why,” Zomorrodi said.

Zomorrodi also wants to capture as much of the microbiome as possible using this innovative technology.

“A good proportion (up to 50%) of the short-read metagenomic data we collect cannot be mapped to any database during the taxonomic or functional profiling processes. This means that we are losing a significant portion of the data that could contain a lot of valuable information about the microbiome,” he said.

In addition to increasing the chance of identifying low-abundance microbes that may not be otherwise identified using short-read methods, the HiFi reads will enable a finer-level functional characterization of the microbiota and the assembly of closed genomes for novel microbial strains that do not exist in databases. These closed genomes can serve as a basis for downstream computational investigation of the microbiota function, such as constructing computational genome-scale models of metabolism.

“This study could go a long way towards finding celiac-specific biomarkers and designing targeted microbiota intervention strategies to treat celiac disease and other autoimmune diseases.”

Zomorrodi said he is looking forward to his first experience using PacBio long-read sequencing. Once a skeptic, he was sold on the value of full-length 16S rRNA and metagenomic HiFi sequencing after seeing data presented at a Cold Spring Harbor microbiome conference.

“It is really amazing,” he said. “I believe the field of the microbiome will be moving forward into using long-read technology. There are really lots of exciting opportunities that didn’t exist before at this level of resolution.”


We’re excited to support this research and look forward to seeing the results. Thank you to our co-sponsor and Certified Service Provider, Maryland Genomics, for supporting the 2020 Microbial Genomics SMRT Grant Program. Explore the 2020 HiFi For All – Collaborations SMRT Grant Program to apply to have your project funded.

Learn more about using HiFi reads to explore microbiology and infectious disease.

Read More »

Wednesday, December 16, 2020

With Highly Accurate Variant Calling and Phasing, SMRT Sequencing Advances PGx Studies of SLC6A4

Scientists at Stanford University and the Icahn School of Medicine at Mount Sinai have made impressive strides in resolving variants in the SLC6A4 promoter associated with susceptibility to psychiatric disorders and response to antidepressants. This progress was made possible with highly accurate, long-read sequencing, known as HiFi sequencing.

Published in the journal Genes, the paper comes from lead author Mariana Botton, senior author Stuart Scott, and collaborators. It describes a SMRT Sequencing-based approach to analyzing amplicons of the SLC6A4 promoter region, which is noted for “a variable number of homologous 20–24 bp repeats,” the authors write, as well as long, extra-long, short, and extra-short alleles with differing expression. The gene itself is important for pharmacodynamics of antidepressants, one of the most frequently prescribed class of drugs.

As Botton et al. note, identifying key variants within the promoter region is most valuable in the context of haplotypes showing whether variants share an allele. Unfortunately, that information is not easy to access. “Short-read sequencing is not effective at accurately interrogating the SLC6A4 promoter, particularly across the VNTR that includes the 5-HTTLPR insertion/deletion (L>S) polymorphism,” the scientists report. “This overarching limitation of short-read sequencing has previously been acknowledged, as low complexity regions and tandem repeats in the human genome are notoriously challenging for short-read platforms.”

With that in mind, the team turned to long-read sequencing data from PacBio. They designed four overlapping primer sets to span the SLC6A4 promoter region and the necessary oligo tags for barcoding, and then sequenced the resulting amplicons for 120 samples with SMRT Sequencing. They also performed Sanger sequencing and gathered publicly available short-read data for many of the samples and compared genotype results across platforms.

The scientists found that three of the key variants “were either not detectable or incorrectly genotyped among the [various short-read data sets] in 32/32 (100%), 60/68 (88%) and 17/21 (81%) samples and 87/96 (91%), 85/204 (42%) and 34/63 (54%) variant sites, respectively.” PacBio sequencing, on the other hand, allowed for the detection of all variants, including rare extra-long alleles. “In addition to being more accurate at this locus than short-read sequencing, long-read SMRT sequencing also unambiguously phased the polymorphic SLC6A4 promoter in all samples, including complex compound heterozygous diplotypes,” the team adds. Sanger sequencing results for six samples confirmed the variants identified by SMRT Sequencing.

PacBio long reads enable detection and phasing of an allele missed by short-read sequencing. Botton, et al. (2020) Genes.

To assess the reproducibility of the SMRT Sequencing workflow, the team evaluated reference samples with known SLC6A4 variants in triplicate. “The intra- and inter-run genotype and diplotype concordances for the 15 control samples were both 100%,” the researchers write.

“Our innovative method enabled the phased resolution of complex SLC6A4 promoter diplotypes, which was not possible using short-read WGS data (~5X and ~45X) or high-depth capture-based short-read sequencing data (~330X),” the team notes. “SLC6A4 long-read SMRT sequencing is a reliable and validated third-generation sequencing technique that can accurately interrogate the low-complexity homologous SLC6A4 promoter region.”

Learn more about best practices and workflows for targeted sequencing in human biomedical research.

Read More »

Monday, December 14, 2020

SMRT Sequencing Offers a Universal Approach for Thalassemia Carrier Testing

Scientists in China have used SMRT Sequencing to demonstrate the value of highly accurate long reads for identifying, linking, and phasing variants associated with a group of blood disorders known collectively as thalassemia. Ultimately, they predict in the Journal of Molecular Diagnostics, long-read sequencing could support a new carrier screening approach for prospective parents interested in knowing their risk of passing these diseases on to their children.

Long molecule sequencing: a new approach for identification of clinically significant DNA variants in alpha and beta thalassemia carriers” comes from lead authors Liangpu Xu, Aiping Mao, and Hui Liu and collaborators at Fujian Provincial Maternity and Children’s Hospital, Berry Genomics, and other institutions.

The scientists undertook this project because of the need for improved carrier screening tools for ethnicities where thalassemia is prevalent, such as in Southeast Asian, Southern Chinese, Middle Eastern, Mediterranean and North African populations. While current screening methods have been fairly effective, the authors write, they “can be laborious and difficult to perform well at the laboratory level.”

Current screening techniques are unable to detect the broad range of variants and variant types associated with thalassemia. The usually fatal disease type known as α-thalassemia has been linked to nearly 130 pathogenic variants across the HBA1 and HBA2 genes, while its milder β-thalassemia counterpart can be caused by more than 200 pathogenic variants in the HBB gene. Seeking a potential approach that would provide more information about the entire spectrum of variants associated with thalassemia, the scientists turned to long-read SMRT Sequencing.

David Cram - Rare Variants

David Cram of Berry Genomics presents examples of rare variants that were detected using SMRT Sequencing, but missed by PCR testing.

The team produced amplicons for all relevant genes. They then sequenced those regions on a Sequel System in a blinded study of 74 samples: 64 from known carriers, and 10 non-carrier controls. Results showed that “all HBA1/2 and HBB variants detected by [long-read sequencing] were concordant with those independently assigned by targeted PCR assays,” the authors report. SMRT Sequencing data correctly called the 20 known pathogenic variants in these samples. Importantly, the long-read sequencing technology “was able to discriminate compound heterozygous SNVs (trans configuration) and identify variants linked to benign SNPs (cis configuration),” the scientists add, noting that it also pinpointed linked variants which may increase accuracy of interpretation.

Based on these results, the team highlights some advantages offered by SMRT Sequencing as a possible carrier screening tool. “Since the entire gene regions are analyzed, the test has the potential to detect other HBA1/2 and HBB variants that may be outside the scope or difficult to accurately detect by traditional tests,” Xu et al. write. Also, the technology “should be capable of detecting other mild and silent HBB variants located in regulatory regions as well as HBB gene deletions that occur in approximately 1% of carriers.”

Finally, they note, being able to detect all of these variants in a single workflow with barcoded libraries makes the approach more scalable. SMRT Sequencing “displayed the hallmarks of a scalable, accurate and cost-effective genotyping methodology” that could “eventually serve as a comprehensive method for large-scale thalassemia carrier screening,” the team concludes. They also report that such an approach will become even more cost-effective when performed on the Sequel II System.

In a presentation at the PacBio Global Summit, paper co-author David Cram of Berry Genomics stated: “This type of technology would be very useful for carrier testing, not only thalassemia, but of other genetic conditions that involve complex genomic regions or copy number variants.”

Watch Cram’s full presentation: Comprehensive Analysis of Thalassemia Alleles (CATSA): A Universal Approach to Thalassemia Carrier Testing


Read More »

Thursday, December 10, 2020

Sequencing 101: Ploidy, Haplotypes, and Phasing – How to Get More from Your Sequencing Data

The ploidy or number of copies of each chromosome in a genome affects not only the size but also the complexity of the genome.

Geneticists often point out that a human does not have “a” genome but rather two genomes, one inherited from the mother and another from the father. The number of complete sets of chromosomes in each cell, or haplotypes, is referred to as ploidy. Humans and most other animals are diploid (2N), having two sets. Many plants have higher ploidy, for example, the hexaploid (6N) California Redwood has 6 copies of each chromosome.

The number of chromosome pairs not only increases the total amount of DNA in a genome, but it also increases the complexity of the genome – by increasing the number of alleles, or alternate forms of genes. Although the majority of the sequence between paired chromosomes are identical, it’s the differences that provide the breadth of biological variation within a species.


Phasing Haplotypes to Get a Complete Picture of Genetic Variation

Whether sequencing a giant polyploid or diploid, the goal remains the same – to get a complete and accurate representation of each copy of the genome or region of interest. This is often achieved by assembling a haploid (single copy) genome and then identifying variants, locations where the alleles differ. Many well-studied organisms, like humans, have standard haploid references against which other individuals are compared.

But identifying variants does not provide the complete sequence of the genome. That requires phasing, or determining which variants are from the same copy of a chromosome (in “cis”) and which are from different copies (in “trans”). One approach to phasing is to use mother-father-child trios: variants in the child’s genome that that are only present in one parent must be on the same chromosome. A second approach is population inference, which deduces that variants often seen in the same people are likely in phase. Both trio and population phasing are imperfect, as they require additional information and are only able to phase some variants.

Phasing involves separating maternally and paternally inherited copies of each chromosome into haplotypes to get a complete picture of genetic variation.


Recent advances in DNA sequencing technology and the tools used to assemble and phase genomes allow large blocks of the sequence to be phased directly from DNA sequencing reads of one individual. Highly accurate long reads, known as HiFi reads, are uniquely suited to phasing haplotypes as they provide the high accuracy needed to detect single nucleotide variants (SNVs) and the read length to connect these variants over a long range.

Using HiFi reads, either alone or in combination with other technologies like Hi-C and Strand-seq, scientists have been able to produce phased genome assemblies of the rose – a complex tetraploid; the California redwood; and humans, including on of Puerto Rican decent, and one of Korean decent, and a cognitively healthy supercentenarian. The phased genomes have each provided novel insights into functionally important variants.

Phasing Genes to Identify Allelic Configuration of Variants

Phasing of breast cancer tumors revealed allelic configuration that impacts treatment response. Vasan et al. (2019) Science.

Scientists analyzing variants in the PIK3CA oncogene found a compound mutation — a double mutation that appears to give breast cancer patients an overwhelmingly positive response to the targeted PI3Kα inhibitor alpelisib. By sequencing and phasing the entire gene, the researchers were able to show that having both variants on the same allele (cis) led to a super-responder phenotype; when those variants were on separate alleles (trans), that was not the case. This information will have clinical relevance for many cancer patients and would never have been known without the ability to phase sequence data.

For recessive disease genes, it also is critical to know whether two variants seen in a gene are in trans (thus breaking both copies of a gene) or cis (thus leaving one copy intact). For example, in the case of a 9-year-old boy with multiple types of cancer, phasing of the MSH6 gene revealed that both maternal and paternal alleles carried mutations resulting in constitutive mismatch repair deficiency syndrome.

Haplotype Phasing to Explore the Genetic Origins of Species

The genetic distance between cultivated apple and wild progenitors identified a large portion of the Gala genome as hybrid in origin. Sun et al. (2020) Nature Genetics.

Researchers exploring apple domestication used haplotype-resolved assemblies of cultivated and wild species to better understand the genetic history of the crop. They were able to sequence and assemble full “haplomes” (haploid genomes) and showed high levels of heterozygosity with more than 20% of the Gala apple genome containing alleles derived from different wild progenitors, showing the Gala was hybrid in origin. Further, they found that introgression of new genes and alleles was a critical component to the domestication of the cultivar. This information provides better understanding of trait variability and will assist in efforts to breed for desirable traits like fruit weight and sweetness.

Allele Phasing to Resolve Variants Missed by Short Reads

Long reads enable detection and phasing of an allele missed by short-read sequencing. Botton, et al. (2020) Genes.

Scientists assessing the role of the promoter of the SLC6A4 gene that is thought to play a role in psychiatric disorder susceptibility, found long-read sequencing critical for interrogating a low-complexity repeat region. The length of a repeat in the gene’s promoter affects gene expression levels. Phasing the repeat length with variants in the coding region of the gene indicates whether a coding variant will have high or low expression. The authors found the repeat region was missed by short read approaches; long-read sequencing both characterized the repeat and unambiguously phased clinically significant variants that may improve pharmacogenetic testing.


How to Obtain Phasing Information with HiFi Reads?

Now that you’ve seen how phasing can provide valuable insights, here is how to obtain phasing information:

  1. Sequence an individual with HiFi reads, which have the accuracy needed to resolve differences and the long read length to phase large haplotype blocks.
  2. Use a diploid-aware assembler like IPA, hifiasm, or HiCanu for genome assembly.
  3. Detect variants with an accurate variant caller like Google Deep Variant and then phase haplotypes with WhatHap.
  4. Combine HiFi data with additional technologies to extend haplotype phasing to the chromosome scale. HiFi data in combination with Hi-C or Strand-seq can phase entire genomes. If a family trio sample is available, short read data from the parents can be used to separate HiFi reads into parental bins before genome assembly (HiCanu, or during genome assembly).



To learn more about how phasing could make a difference for your research contact a PacBio scientist to get started with your next sequencing project.


Explore other posts in the Sequencing 101 series:

The Evolution of DNA Sequencing Tools

Introduction to PacBio Sequencing and the Sequel II System

Why Are Long Reads Important for Studying Viral Genomes?

What’s the Value of Sequencing Full-length RNA Transcripts?

Looking Beyond the Single Reference Genome to a Pangenome for Every Species

Understanding Accuracy in DNA Sequencing

From DNA to Discovery – The Steps of SMRT Sequencing

Read More »

Wednesday, November 25, 2020

Better Apple Pies: HiFi Reads are a Perfect Recipe for High-Quality Apple Genome and Pangenome Assemblies

Zhangjun Fei inspects a Mutsu apple at Indian Creek Farm in Ithaca, NY. Image credit: Boyce Thompson Institute

Scientists at the Boyce Thompson Institute, Cornell University and the USDA Agricultural Research Service have reported significant progress in understanding the genomic features of domestic and wild apples. They used HiFi reads, highly accurate long reads, generated by the Sequel II System to build phased, diploid genome assemblies, as well as apple pangenomes to represent more of the remarkable genetic diversity in this lineage and better characterize its historic domestication.

The paper, published in Nature Genetics, comes from lead authors Xuepeng Sun (@XuepengBio), Chen Jiao, and Heidi Schwaninger; senior author Zhangjun Fei (@fei_lab), and collaborators.

We asked Fei about the highlights of the team’s efforts, and here’s how he summed it up: “We assembled phased diploid genomes of modern apple (Malus domestica) and the two major wild progenitors, M. sieversii and M. sylvestris using PacBio HiFi reads and Illumina short reads, and constructed pan-genomes of the three species. We inferred the genetic contributions of the two wild progenitors to the cultivated apple, and identified genome regions under selection during apple domestication and associated with important traits such as fruit size, texture and taste.

The team focused on the tasty Gala apple, knowing that producing an accurate genome assembly would require more than short-read sequencing data.

Most crop plants have complex genomes characterized by large size, high heterozygosity level and polyploidy,” they write in the paper.The apple genome is highly heterozygous, posing a major challenge for earlier genome assemblies.”

To address those challenges, the scientists incorporated HiFi reads into their strategy, sequencing the Gala apple and its wild progenitors at coverages ranging from 37-fold to 81-fold. These HiFi reads were then assembled using hifiasm and HiCanu, respectively (read more about these and other options for HiFi assemblers in this blog post). Those results were merged with orthogonal data sets to create diploid genomes for each of the three apples, with final assemblies reaching about 1.3 Gb.

Despite high heterozygosity rates (0.85–1.28%), all assemblies showed high contiguity, with the scaffold N50 of 3.3–4.3 Mb in diploid assemblies and 16.8–35.7 Mb in haploid consensuses,” they add.

The extremely high quality of the final assemblies allowed the scientists to identify an error in previously published apple genomes associated with a 5 Mb inversion on Chromosome 1.

But the team also wanted to go beyond just one high-quality assembly for the Gala apple, pointing out that “a single reference genome can by no means represent a whole population.” To that end, they constructed a pangenome for each of the three apple types, using 91 accessions to capture natural genetic diversity. Through this work, they added between 89 Mb and 212 Mb of novel sequence data to each genome, covering thousands of new genes.

Unlike annual crops such as the tomato, the pan-genome size of the cultivated apple is larger than that of wild progenitors, possibly due to the outcrossing nature and extensive introgression from wild species,” Sun et al. write. This distinctive feature suggests that introgression of new genes/alleles is possibly a hallmark of crops domesticated through hybridization.

One of the most important motivations for this study was to support apple breeding programs through a deeper understanding of trait variability.

Traits introgressed in the hybrid are often not fixed and could be lost when propagated by seeds,” the authors note. Understanding of the molecular basis of trait variability, which requires the knowledge of the diploid alleles, is critical for fixation of desirable traits in apple breeding.”

See additional examples of the use of SMRT Sequencing for the generation of pangenomes:

Read More »

Tuesday, November 17, 2020

Scientists Pinpoint Pathogenic Inversion in Intellectual Disability Case Using HiFi Sequencing

Scientists at Yokohama City University Graduate School of Medicine and Osaka Women’s and Children’s Hospital have discovered a novel pathogenic variant associated with intellectual disability. They made the discovery using HiFi sequencing after previous short-read investigations failed to produce an answer.

In the journal Genomics, the team reports the case of 12-year-old monozygotic twin girls who exhibited developmental delays, severely drooping eyelids, and seizures since the age of 5 months. Clinical symptoms matched Dravet syndrome, but no molecular evidence was available to confirm that diagnosis. Their case had previously been analyzed with short-read exome sequencing, but no pathogenic variants were uncovered. Lead author Takeshi Mizuguchi, senior author Naomichi Matsumoto, and collaborators turned to HiFi sequencing and the Sequel II System “to search for variants that are unrecognized by exome sequencing,” they write.

While intellectual disability (ID) has been linked to variants in more than 500 genes, even the best analytical methods have a diagnostic success rate of less than 30%. “There are still many cases for which no molecular diagnosis has been possible,” the authors note. “Therefore, it is important to determine the molecular genetics of unsolved ID cases using new technologies.”

The scientists sequenced 15 kb size-selected libraries for one of the twins and both parents to generate highly accurate (>99% or Q20) long reads, known as HiFi reads. Next, the team used pbsv to call structural variants, and Google’s DeepVariant to call small variants and indels. This process highlighted hundreds of deletions, insertions, and duplications, plus seven inversions, in the twin’s genome that were potential de novo structural variants. “A 12-kb inversion disrupting the coding sequence of Bromodomain and PHD Finger containing 1 … immediately drew our attention,” the authors report, because variants in this region had been linked to an intellectual disability syndrome consistent with the twins’ symptoms. “Among the 16 possible de novo [structural variant] calls affecting RefSeq gene exons, no other genes were linked to an OMIM autosomal dominant disease entry,” they add.


HiFi sequencing of a trio identifies a pathogenic heterozygous 12 kb de novo inversion that disrupts the gene BRPF1. Single-nucleotide variants (marked with “*”) show that the inversion occurred on the maternal allele #3.

The 12 kb copy-neutral inversion was confirmed with Southern blot, which also showed that both parents and an unaffected older brother lacked the inversion. A breakpoint analysis found that “the two breakpoint junctions identified by Sanger sequencing and the pbsv inversion call were identical,” the team notes, “demonstrating the accuracy of HiFi long-read analysis.” The scientists also point out that not only was the inversion missed by exome sequencing due to its copy-neutral status and repetitive flanking sequence, but it also would have been missed by traditional chromosomal analysis, which has a lower limit of detection of 10 Mb. Finally, using the trio data with haplotype phasing, the team discovered that the inversion was a de novo variant on the maternally transmitted chromosome.

“Importantly, the current study demonstrates that inversions can now be accessed using an ‘unbiased-genomic’ strategy with no prior knowledge,” the authors write. “This state-of-the-art technology is advantageous for elucidating hitherto inaccessible genomic changes.”


To learn more, explore SMRT Sequencing workflows and additional resources on comprehensive variant detection and structural variant detection.

Read More »

Tuesday, November 10, 2020

Secrets to Longevity Explored in de novo Genome of 115-Year-Old Woman

Hendrikje van Andel-Schipper, at age 108

A new publication from scientists in The Netherlands and Belgium offers tantalizing insights that may shed light on age-related neurodegenerative disorders. The team used SMRT Sequencing to produce a de novo diploid assembly of the genome of a Dutch woman named Hendrikje van Andel-Schipper, who died at the age of 115 with no signs of cognitive decline, and then performed a detailed analysis of variants detected. The data are publicly available to the scientific community.

The paper, released in Translational Psychiatry, comes from lead author Jasper Linthorst and senior author Henne Holstege (@HolstegeHenne) at Amsterdam Neuroscience and their collaborators. They aimed to identify structural variants (SVs) that could be associated with the onset of neurological disorders; for this, they performed a comparison between several previously available human genome assemblies which included the centenarian genome assembly.

The team chose long-read PacBio sequencing technology because they determined that “due to their repetitive nature, [SVs] are currently underexplored in short-read whole genome sequencing approaches,” they write. Repetitive regions, particularly repeat expansions that tend to grow larger over generations, have been shown to be pathogenic for a number of neurological diseases. “Using common sequencing approaches, the assessment of large repetitive regions is difficult because short 100-150 bp sequence-reads do not span the entire structural variant,” the authors report. “The solution to this problem is to generate longer sequencing reads.”

For this project, the scientists generated a de novo, phased genome assembly for the 115-year-old woman, which they refer to as W115. This was based on sequencing genomic DNA from three tissues and relied on FALCON-Unzip to create the diploid assembly of about 2.82 Gb. This information was compared to two haploid assemblies and the latest human reference genome to search for SVs of 50 bp or longer.

The scientists used a graph-based multi-genome aligner called REVEAL and found a total of 31,680 SVs. Nearly 70% were classified as variable number tandem repeats (VNTRs). “Interestingly, we observed that VNTRs in the subtelomeric regions were composed of longer repeat subunits than VNTRs outside the subtelomeric regions, and that they had a higher GC-content,” they report. Expanded VNTRs have been linked to faulty gene transcription. “The genes that contained most VNTRs, of which PTPRN2 and DLGAP2 are the most prominent examples, were found to be predominantly expressed in the brain and associated with a wide variety of neurological disorders,” the scientists add.

Henne Holstege, Amsterdam Neuroscience

In addition, the team analyzed the list of structural variants to see how SMRT Sequencing had made a difference in detection. Using short-read data for the W115 genome only, they found just 5,826 SVs. About 83% of the SVs — that’s more than 18,000 variants — found in the PacBio assembly “were uniquely identified through long-read sequencing,” the scientists note.

The sequence data for this project was produced on a PacBio RS II system, but Holstege and her team have already acquired a Sequel II System for the next phase of this effort. That will involve a large study encompassing at least 150 cognitively healthy centenarians and 150 individuals with Alzheimer’s disease, with the goal of identifying VNTRs that have significantly different lengths between the two groups. Holstege and her team will be generating HiFi reads and they expect to cover each genome in the study with a single SMRT Cell. “We want to know what about these individuals makes them so special,” she told us.


Interested in learning more about this research? Hear Holstege present “Uncovering Neurological Disorders Through an Examination of VNTRs” at our upcoming Neuroscience Day on December 9th. Explore workflows and additional resources on comprehensive variant detection or structural variant detection.

Read More »

Thursday, October 29, 2020

Breast Cancer Research Legend Mary-Claire King Identifies New Pathogenic Mutation with HiFi Sequencing

Mary-Claire King

It’s Breast Cancer Awareness Month, and we can’t think of a better way to celebrate than to honor the passionate scientist who has perhaps single-handedly done more to advance breast cancer research than anyone else alive: Mary-Claire King, discoverer of the BRCA1 and BRCA2 genes. In recognition of her lifelong contributions, King was just awarded the prestigious William Allen Award, the top prize presented annually by the American Society of Human Genetics to recognize substantial and far-reaching scientific contributions to human genetics, carried out over a sustained period of scientific inquiry and productivity.

In a recent publication in the Journal of Medical Genetics, King and her collaborators at the University of Washington combined CRISPR-Cas9 targeting with HiFi sequencing to reveal novel and biologically relevant mutations in the BRCA1 gene.

The effort was driven by a need to better characterize the well-known BRCA1 and BRCA2 genes in families with hereditary breast cancer. Short-read sequencing “is of limited use for identifying complex insertions and deletions and other structural rearrangements,” the scientists note. “The BRCA1 genomic region is particularly challenging for short-read sequencing. It is composed of 42% Alu repeats, the second highest proportion in the genome, and a 30 kb tandem segmental duplication spanning its promoter and first two exons.” To expand the clinical utility of information about these genes in the future, much research remains to be done to characterize the many variants missed by short reads.

For this study, scientists aimed to sequence the BRCA1 and BRCA2 genes from individuals representing 19 families with a history of early-onset breast cancer. All of these individuals had previously had these genes analyzed with gene panels and whole exome sequencing, but no pathogenic mutations were found that explained the early onset breast cancer susceptibility.

To target the two genes of interest, the team used the HLS-CATCH CRISPR-based targeting method from Sage Science, extracting 200 kb of high molecular weight libraries ideal for use with PacBio sequencing. HiFi sequencing was performed on the Sequel System, with average genomic fragment length of about 10,000 bases to fully cover the two BRCA loci, including non-coding elements.

In one case, this approach unlocked a novel variant to explain the family’s history of cancer. “We identified an intronic SINE-VNTR-Alu retrotransposon insertion that led to the creation of a pseudoexon in the BRCA1 message and introduced a premature truncation,” the scientists report. The retrotransposon was nearly 3 kb long. “Multiple long reads included all elements of the mutation and of wild-type flanking BRCA1 intronic sequence, so that the mutation’s position and the sequence were clear,” the authors note, adding that the variant segregated with breast cancer throughout the family. After identifying this tough-to-find type of variant, the authors confirmed that the intronic repeat element can affect the final BRCA1 message by sequencing cDNAs from matching patient cells.

Based on these findings, the team suggests that there may be many other pathogenic complex structural variants. “It is possible, even likely, that complex mutations are common at tumour suppressor genes,” they write. “We suggest that complex mutations have thus far been rarely encountered, because they are difficult to detect with existing approaches.”

King and her collaborators believe the approach they used will be important for continuing to uncover these variants. “The genomic approach described here, integrating CRISPR–Cas9 excision of critical loci with long-read sequencing, yields complete sequence of targeted loci and thus can detect all classes of complex non-coding structural variants,” they report. “This combination of CRISPR–Cas9 excision and long-read sequencing reveals a class of complex, damaging and otherwise cryptic mutations that may be particularly frequent in r suppressor genes replete with intronic repeats.”


Listen to King share the emotional and humorous story of the events leading to the funding of the project that resulted in the discovery of the BRCA1 gene – a true testament to her persistence and the constant challenge of balancing career and family, with a cameo from Joe DiMaggio!

Read More »

Monday, October 26, 2020

Egyptian Genome Added to Growing List of Population-Specific Reference Genomes

We’re excited to report that another team has used PacBio long-read sequencing to produce a population-specific reference genome — this time an Egyptian genome that should prove valuable for boosting precision medicine for people of North African ancestry.

Lead author Inken Wohlers, senior authors Hauke Busch (@BuschLab) and Saleh Ibrahim, and their collaborators at the University of Lübeck, Mansoura University, and other institutions report their results in Nature Communications. The need for a population-specific reference was clear: the authors note that “only 2% of individuals included in [genome-wide association studies] are of African ancestry” but “genetic disease risk may differ [between populations], especially for individuals of African ancestry.” The new assembly will serve as a foundation for more accurate interpretation of genetic risk in this population in future studies.

“With the advent of personal genomics, population-based genetics as part of an individual’s genome is indispensable for precision medicine,” the scientists note. Without a reference-grade assembly to use for comparison, people of African descent — particularly those with North African ancestry — are at risk of continued health disparities. To address this issue, the team generated a complete genome assembly of an Egyptian male based on SMRT Sequencing.

A high-quality de novo assembly of one individual was combined with population-level sequencing of 101 individuals to characterize variation in the Egyptian population. (Wohlers, I et al. 2020)

The “EGYPT” assembly spans 2.8 Gb and includes 20 Mb of novel sequence not seen in GRCh38. The authors report that the assembly is “high quality… comparable to the publicly available assembly AK1 of a Korean individual and the assembly of a Yoruba individual.”

Wohlers et al. also sequenced 110 Egyptian individuals with short reads, which were compared to the reference genome to identify population-specific variation. They called nearly 20 million single nucleotide variants and small indels, and more than 120,000 structural variants. The authors explain that it is key to understand variation as “differences in allele frequencies and linkage disequilibrium between Egyptians and Europeans may compromise the transferability of European ancestry-based genetic disease risk and polygenic scores, substantiating the need for multi-ethnic genome references,” the scientists write.

“The Egyptian genome reference will be a valuable resource for precision medicine,” the team adds. “The wealth of information it provides can be immediately utilized to study in-depth personal genomics and common Egyptian genetics and its impact on molecular phenotypes and disease.”

For more information about this project and its implications for improving genomic analysis, don’t miss this webinar and this poster from the authors.


SMRT Sequencing is being used to develop population-specific reference genomes as part of several international research efforts. Learn more about these projects and explore detailed assembly information in our interactive map.


Read More »

Thursday, October 22, 2020

PacBio and Invitae Team Up to Develop Whole Genome Sequencing-Based Assays for Pediatric Epilepsy Diagnostics

We’re excited to announce a research collaboration with Invitae focused on the investigation of clinically relevant molecular targets for use in the development of advanced diagnostic testing for epilepsy. To support this collaboration, Invitae is expanding its PacBio sequencing capacity to meet the growing demand for clinical applications dependent on highly accurate genomic information.

More than half of epilepsies can be traced to a genetic cause. When a child presents with seizures, genetic testing can help identify more than 100 underlying, often rare conditions. Early genetic testing may be the most cost-effective, direct, and accurate diagnostic tool for children, shortening lengthy diagnostic odysseys. Delays in diagnosis can be devastating for children, as some genetic epilepsies are neurodegenerative and early symptoms may be subtle and easy to misdiagnose.

The Behind the Seizure program is a prominent collaborative program established by BioMarin and Invitae that was developed to provide faster diagnosis for young children with epilepsy in many regions around the world. Participants in the Behind the Seizure program are diagnosed one to two years sooner than reported averages.

The first phase of our research collaboration is focused on a whole genome sequencing study of a large pediatric epilepsy patient cohort derived from the Behind the Seizure program. HiFi sequencing will be performed to generate comprehensive variant profiles used to investigate the genetic etiology of epilepsy. The research is intended to accelerate Invitae’s development of assays to help patients who have been unable to get a diagnosis with conventional short-read sequencing technologies and facilitate improved treatment options based on specific genetic targets.

In a statement announcing this news, Invitae Chief Medical Officer Robert Nussbaum said: “Through this research collaboration with PacBio, Invitae aims to develop innovative methods that will provide more accurate answers to individuals living with epilepsy and their healthcare providers.”

Our CEO Christian Henry added: “We are honored to partner with Invitae, a recognized leader in genetics, to co-develop methods that have the potential to support earlier genetic testing and intervention to aid treatment selection for millions of people living with epilepsy worldwide. Working with leading organizations such as Invitae is an important part of our strategy to accelerate the use of our highly accurate long-read sequencing platform in large-scale whole genome sequencing initiatives.”


Learn more about the benefits of PacBio whole genome sequencing and discover how highly accurate long reads can advance your neuroscience research.


Read More »

Wednesday, October 21, 2020

The HiFi Sequencing Advantage for Metagenome Assembly

Assembly and binning of metagenome data are the first steps in many metagenomics analysis pipelines, and with good reason. Metagenome assembled genomes (MAGs) and circularized MAGs (CMAGs) allow recovery of complete genes and operons, thereby improving predictions of metabolic capacities. MAGs also provide information about gene synteny and enable better taxonomic profiling. However, as discussed in a recent review by Chen et. al. draft MAGs with poor completeness or high contamination can lead to incorrect conclusions.

One way to improve assembly completeness and contiguity is to use long-read sequencing. However, not all long reads are the same. Did you know that once read lengths are longer than most of the repeats in a genome or metagenome, incremental gains in raw read accuracy improve assemblies faster than higher coverage or even large gains in read length?

Figure 1: The need for error correction presents unique challenges for metagenome assembly, where the error rate of noisy long reads can exceed the true differences between closely related community members.

One of the main hurdles in metagenome assembly is the presence of multiple closely related strains and species in the same sample, which leads to tangled assembly graphs. While long reads are helpful in resolving these, if the difference between two bacterial species (often defined as 3%) is less than the raw error rate of your sequencing data, overlap assembly remains problematic. This is because with noisy long reads, assembly is typically preceded by an error-correction step where the raw reads are mapped against each other to produce high accuracy consensus reads.

However, with metagenome data, this has the side-effect of collapsing and averaging reads that may actually be derived from different species. The ability to distinguish reads from closely related species or strains can be effectively erased during this first step, and the purity of the resulting contigs, the completeness of the MAGs, and the total size of the metagenome assembly can all be compromised. Read on for a detailed discussion and examples of how differences in read quality impact MAG assembly.

Higher read accuracy drives assembly quality

To understand how incremental changes in accuracy and differences in coverage affect metagenome assembly quality, we generated model metagenomics datasets with community member abundances that reflect a real fecal microbiome, drawing on references from Zou, et al. and the ‘Badread’ long read simulator (Wick, 2019). Noisy long reads were simulated from 160 microbial reference genomes with accuracy modes between 87.5% and 97.5%, and HiFi reads were modeled using a typical accuracy distribution (>99%) for 8 kb -10 kb reads, an insert size commonly achievable for long read metagenome sequencing. The number of bases in each dataset was modeled after conservative Sequel II System yield of HiFi data from a metagenomics run (~20 Gb) and ONT PromethION (60 Gb) reported outputs (Shafin, 2020). The resulting model datasets were assembled with Canu 2.0, using the recommended parameters for ONT and HiFi datatypes.

Figure 2 In modelled metagenome data, raw reads with higher read accuracy generate contigs with higher purity.

With Canu, it is possible to trace which reads were used to generate each contig in the assembly, and we used this capability to calculate the purity of each contig. Specifically, we determined what fraction of reads did not originate from the reference genome that contributed the majority of reads used to assemble that contig.

As shown in Figure 2, there are limited gains in contig purity even as accuracy changes from 85% to 97.5%. However, there is a sharp transition in contig purity when read accuracy surpasses 99%, exceeding the inter-species similarity commonly seen in a complex fecal community.




High-error reads compromise the assembly of low abundance species

Another challenge with using self-error correction ahead of metagenome assembly relates to the uneven proportion of different species in the data. Error correction typically requires ~30-fold coverage to be effective. However, in metagenomes, it is common for species to be present at a wide range of relative abundances. This means that even when there is enough coverage of highly abundant species for error correction, reads from lower abundance species may fail the initial error correction step and be omitted from the assembly. In the example of our model data set, even with three times more raw data, the 87.5% accuracy mode dataset assembles to less than half of the expected assembly size, with contigs that are significantly shorter than with more accurate reads. When the data accuracy surpasses the threshold of microbial interspecies differences, contiguity and assembly size leap dramatically despite lower sample coverage.

An example of how this limitation plays out in a real-world sample can be seen in a cow rumen assembly that used self-corrected PacBio CLR reads with ~89% median accuracy (Bickhart, 2019). While the PacBio CLR assembly had higher contiguity than the Illumina assembly despite a 3-fold lower depth of sequencing, the Illumina assembly had superior completeness.

Closer inspection of the PacBio CLR data revealed that “the correction step removed 10% of the total reads for being singleton observations (zero overlaps with any other read) and trimmed the ends of 26% of the reads for having fewer than 2 overlaps.” The authors further noted that “this may have also impacted the assembly of low abundance or highly complex genomes in the sample by removing rare observations of DNA sequence”.

Figure 5. The proportion of same-sample Illumina reads that map to the cow rumen CLR assembly versus a sheep fecal HiFi assembly. Since HiFi reads do not require an error correction step, more data from low abundance species is available for the assembly step. (Bickhart, D., SMRT Leiden 2020 presentation)

In contrast, since HiFi reads do not need error correction, all the data, including observations from low abundance species, can be used in the assembly step. Accordingly, a more recent assembly of a sheep fecal sample that used HiFi data had significantly improved performance. In his SMRT Leiden talk, Derek Bickhart noted that while cow rumen and sheep fecal samples are different communities and therefore their assemblies are not an “apple to apples” comparison, the sheep fecal assembly, done with HiFi data, appears to have a significantly improved representation of low abundance species as gauged by the proportion of same-sample short read data that maps to the long read assembly.

One possible method for overcoming the long-read coverage bottleneck is to use short read data for error correction. However, this approach suffers from the same factors that limit short read metagenome assembly. Namely, short read data has GC bias and cannot be mapped uniquely to repetitive regions. Given that bacterial genomes can range from 13-75% GC, error correcting low accuracy long reads from all the species in a metagenome sample with short read data can be problematic.


The power of HiFi reads

With the unique combination of high accuracy and long read length, HiFi data shows promise for overcoming some of the longstanding challenges in metagenome assembly. Unlike noisy long reads, assembly of HiFi reads is unencumbered by an error correction step that can erase the variation needed to correctly assemble closely related species in complex communities and generate high quality MAGs and CMAGs. Furthermore, they show potential for improving the representation and contiguity of low abundance species in metagenome assemblies.

HiFi data has already been making waves in the world of large genome assembly, first at PAGXXVIII in January 2020 and more recently at the precision FDA Truth Challenge V2, which evaluated methods for variant calling in human genomes. We are excited to see what HiFi data will do for metagenome assembly as more researchers become aware of its potential.


Learn more about HiFi sequencing for metagenomics. To start planning your metagenome assembly experiment connect with a PacBio scientist.


Chen L-X, et. al. (2020) Accurate and complete genomes from metagenomes. Genome Research 30:1-19.

Bickhart, D., et. al. (2019) Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex microbial community by combined long-read assembly and proximity ligation. Genome Biology 20:153.

Wick RR. (2019) Badread: simulation of error-prone long reads. Journal of Open Source Software. 4(36):1316.

Shafin, K., Pesout, T., Lorig-Roach, R. et al. (2020) Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol.

Zou, Y., Xue, W., Luo, G. et al. (2019) 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat Biotechnol 37, 179–185.


Read More »

Thursday, October 15, 2020

Pediatric Partnership Powered by PacBio Aims to Solve Difficult Rare Disease Cases

Kids have lots of questions. But even the world’s top scientists don’t have all the answers — especially when it comes to rare genetic disorders afflicting children.

Our HiFi reads, highly accurate long reads, generated by our Sequel II and new Sequel IIe Systems, are helping researchers uncover disease-causing genetic variants that had previously gone undetected by other technology, contributing to increased solve rates for rare diseases.

We’re particularly excited to see this technology applied to translational research in children. We will be collaborating with Children’s Mercy Kansas City as part of its Genomic Answers for Kids (GA4K) program, which aims to collect genomic data and health information for 30,000 children and their families over the next seven years, ultimately creating a database of nearly 100,000 genomes.

“We are delighted to be collaborating with the innovative scientists at PacBio as we bring their long-read sequencing data to bear on some of our most difficult cases of rare pediatric disease to give patients and families the answers they deserve,” said Tomi Pastinen, director of the Center for Pediatric Genomic Medicine at Children’s Mercy.

It is estimated that as many as 25 million Americans — approximately 1 in 13 people — are affected by a rare condition. Whole-genome and whole-exome sequencing is often employed to try to diagnose these conditions, but often this involves short-read sequencing, and causes are found in only ~25% to 50% of cases — leaving the majority of cases unsolved.

Hoping to overcome these odds, Children’s Mercy has recently invested in Sequel II Systems, with plans to use our Single Molecule, Real-Time (SMRT) Sequencing technology to generate HiFi reads to detect what the short-read methods might have missed. Early results are encouraging, and have already demonstrated increases in pathogenic variant and disease-gene discovery beyond what was possible with short-read methods.

The researchers will also be working with the Microsoft Genomics team to build Microsoft Azure cloud-based analysis solutions and a data repository for this unique dataset.

“The diagnosis journey for a child with a rare disease and their families can be long and often inconclusive. We believe the advancement of precision medicine with specialized technologies will be key to gaining a better understanding and early diagnosis of these debilitating and deadly diseases,” said Gregory Moore, corporate vice president, Microsoft Health.

We look forward to making a meaningful impact by increasing solve rates through this important partnership.


More information about how Children’s Mercy scientists are using HiFi sequencing will be presented in PacBio’s ancillary workshop Monday, October 26 from 1:00-2:00 pm ET during the American Society of Human Genetics (ASHG) Annual Meeting. Emily Farrow, Director of Laboratory Operations at the Genomic Medicine Center at Children’s Mercy, will give a talk entitled “Applications of Third Generation Sequencing in Unsolved Disease.” Free virtual event registration is available here.


See additional examples of the use of SMRT Sequencing in rare disease research and learn more about structural variant detection:

Read More »

Subscribe for blog updates: