This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
Sunday is Rare Disease Day – a time to honor the patients, families, caregivers, and healthcare professionals who are part of the rare disease community.
At PacBio, we are passionate about supporting this community and providing tools that help improve the ability of scientists and clinicians to deliver valuable answers to families and reduce what can be a years-long diagnostic odyssey. And while each ‘rare’ disease may affect a limited number of people, collectively these diseases affect hundreds of millions of people around the world.
Since we last celebrated this special day, we’ve been particularly excited by the progress made by cutting-edge scientists and clinicians who are applying new technologies to find the genetic root causes of these diseases. Leading into Rare Disease Day, we’d like to highlight and acknowledge the work of these scientists who are striving to improve the lives of those affected by rare diseases.
In Missouri, the team at Children’s Mercy Kansas City recently announced the opening of a massive new pediatric research facility housing the Children’s Mercy Research Institute (CMRI). The institute, established in 2015 to accelerate precise diagnoses and treatments for complex childhood diseases, is built on a translational approach that brings science and medicine together seamlessly.
One of the institute’s most important research projects is Genomic Answers for Kids (GA4K), a first-of-its-kind pediatric data repository that is collecting genomic data and health information from 30,000 children and their families during the next seven years to create a database of 100,000 genomes. More than 2,230 families with rare disease have enrolled in the program to-date, which has resulted in more than 10,200 new genomic analyses, more than 250 genetic diagnoses and already contributed to the reporting of 10 new disease genes.
GA4K focuses on rare diseases and has been solving previously unsolvable cases by implementing highly accurate long–read sequencing, known as HiFi sequencing. Based on early successes, the team has scaled up its capacity with additional Sequel IIe Systems and aims to use HiFi whole genome sequencing for approximately 1,000 cases that went unsolved after the preliminary short-read exome analysis.
Meanwhile, in Alabama, scientists at the HudsonAlpha Institute for Biotechnology recently announced that they found likely pathogenic variants in two pediatric rare disease cases that had remained unsolved using short-read sequencing. In both cases, the patients suffered from neurodevelopmental disorders. The scientists were able to pinpoint the disease–causing genetic variants through whole genome sequencing of parent-proband trios. One of the pathogenic variants was a 7 kb insertion in the CDKL5 gene, while in the other instance an extensive structural variation was highlighted. Both variant types are known to be challenging for short-read sequencing technologies and were therefore not discovered in the preliminary analysis.
“The ability to find so many variants that were previously missed is exciting, and holds great promise for diagnostic testing in the future,” says HudsonAlpha Faculty Investigator Greg Cooper, PhD. “Long-read genome sequencing will become a powerful tool for research and clinical testing over the next few years.”
One of the earliest examples of how PacBio sequencing technology could make a difference for rare disease cases came from the Stanford lab of Euan Ashley, a noted cardiologist who just released a new book, The Genome Odyssey: Medical Mysteries and the Incredible Quest to Solve Them. The book includes, among many others, a fascinating case of Carney complex in an individual who had suffered a series of tumors in his heart and glands, for whom eight years of genetic analyses had produced no firm answers.
These are just a few of the many great advancements among rare disease experts that are making new inroads into tough cases with HiFi sequencing. It is critical to remember that each of these explained cases represents a family that is now closer to the end of their diagnostic odyssey, potential treatment options, and renewed hope for healthier futures. We send our sincere gratitude to them and everyone working hard to accelerate the development of medical advancements in rare disease research.
If you’d like to participate in this wonderful community, take a look at these upcoming events in support of rare disease research awareness, funding and education:
- Rare Disease Day strives to raise awareness amongst the public and decision-makers about rare diseases and their impact on patients’ lives – to show your support, you can use your social channels to amplify and tag #RareDiseaseDay
- Collaborate with and support Children’s Mercy’s Genomics Answers for Kids program, by nominating a patient for participation and/or donating to support their vision
- The HudsonAlpha team is hosting the Double Helix Dash in April, a virtual 5K to support childhood genetic disorders research – anyone can participate!
- PacBio is hosting a 3-day virtual event in April focused on the genetics of rare disease – register to attend and hear firsthand from scientists and clinicians on their recent discoveries
To learn more about how PacBio HiFi sequencing is helping advance our understanding of rare disease, visit our rare disease resource page.
What does the ideal genome assembly look like? High-quality, free of errors, with no gaps, and all haplotypes resolved.
It’s a big ask, especially with challenging genomes like plants that are rich in repetitive content with high levels of heterozygosity and complex polyploidy. Moreover, such assemblies often require a combination of technologies, such as sequencing plus optical mapping.
But a team of scientists at the King Abdullah University of Science and Technology (KAUST) Core Labs (@kaust_corelabs), proved it is possible by using one technology — PacBio HiFi Sequencing — in just seven days.
Their recent preprint introduced LeafGo, a streamlined workflow able to produce a high-quality draft plant genome from plant tissue without using additional scaffolding technologies.
The rapid, one-pass approach was tested on two different Eucalyptus species, E. rudis, and E. camaldulensis.
There are more than 800 eucalypt species, but only three genomes have been published: E. grandis, E. pauciflora and E. camaldulensis. The LeafGo produced high-quality draft E. camaldulensis genome is an improvement upon those highly fragmented genomes, the KAUST team wrote.
Their assembly of E. rudis, a close relative of E. camaldulensis that inhabits a different ecological niche, is the first for that species.
“The two genomes sequenced here will improve our genomic knowledge of eucalypts, which at the moment is relatively sparse, and will assist with conservation issues and commercial uses,” they wrote.
The team tested both continuous long read (CLR) and HiFi circular consensus sequencing (CCS) data, and were especially impressed with the results from HiFi reads — “the higher base-level accuracy given by HiFi improves the assembly considerably, thus removing the need for polishing with short-read sequencing.”
“HiFi assemblies demanded less computational requirements, had higher BUSCO scores, showed several fold improvement of contig N50/N90 and L50/L90, and generated more complete genome assemblies,” the authors wrote.
“In fact, our HiFi sequencing data, assembled with hifiasm, produced near-chromosome level haploid draft genomes,” they added.
“One of the main advantages for our chosen genome assembly workflow, using hifiasm with HiFi reads, are the savings in time and compute requirements, all with minimal manual intervention.”
The estimated total time from raw reads to HiFi data to the assembly of a high-quality contiguous draft for a haploid genome of 0.6 to 1.0 Gb is approximately one day, they wrote. Assembling the HiFi data using hifiasm took 80 minutes for E. rudis (23x coverage) and 120 minutes for E. camaldulensis (27x coverage).
“When combined with time estimates of HMW DNA extraction (one day), HiFi library preparation and sequencing (five days) and assembly; a high-quality draft genome can be prepared from plant samples in seven days, depending on available compute resources,” the authors stated.
The team also created a modified Qiagen Genomic protocol in order to tackle the challenge of extracting high molecular weight DNA from the Eucalyptus species, which is difficult due to their high phenolic and polysaccharide content.
“Our extraction protocol generated high pure and copious amounts of HMW DNA within a day and using minimal resources and effort,” they wrote.
The authors say they hope LeafGo will be a valuable tool for global initiatives to sequence and assemble genomes for many thousands of eukaryotic life forms that do not yet have published standardized workflows.
They may not be as well known as our chimpanzee or gorilla cousins, but macaques have played many key roles in scientific progress over the last half century. From launching into orbit during the early days of space travel to revealing the genetics of neurodevelopmental disorders and infectious diseases today, the rhesus macaque remains a key research primate around the world.
A new, comprehensively annotated reference genome unveiled last month boosts the potential of the most widely used non-human primate in biomedical research even further, with new insight into gene functions and disease susceptibility.
A large team of researchers — led by Wesley C. Warren of the University of Missouri; Evan E. Eichler of the University of Washington; and Jeffrey Rogers of Baylor College of Medicine — has released an updated rhesus macaque (Macaca mulatta) reference genome that increases the sequence contiguity 120-fold over previous assemblies.
They also used PacBio full-length isoform sequencing, the Iso-Seq method, to analyze 6.5 million full-length transcripts and create a comprehensive set of protein-coding and non-coding gene models. This provided vital information about gene content, organization and isoform diversity, and led to the identification of new macaque isoforms and gene candidates.
“With the improved assembly of segmental duplications, we discovered new lineage-specific genes and expanded gene families that are potentially informative in studies of evolution and disease susceptibility,” the authors wrote.
In addition to the new reference, the team gathered whole genome sequencing data from 850 rhesus macaques from U.S. research colonies and three wild-caught Chinese samples. Similar to the human 1000 Genomes project, they wanted to catalog genetic variation within the species, and ended up identifying 85.7 million single-nucleotide variants (SNVs) and 10.5 million indels — the most extensive collection of segregating genetic variants for any non-human primate species.
The researchers found that an average research macaque carried 9.7 million SNVs — more than twice as diverse per individual as humans — and they now plan on using the information to better understand aspects of genome function, such as gene regulation, and to generate new genetic models of disease.
By studying what these nucleotide changes might do to genes implicated in autism spectrum disorders and predicting likely gene disruption (LGD) variants, for example, the macaques may offer new clues to the often heterogeneous genetic condition.
These naturally occurring mutations could provide an opportunity to develop noninvasive models of human disease without the expense of CRISPR engineering of embryos, they added, which would be particularly useful in relation to phenotypes that are not readily reproduced in non-primate knockout models and for evaluating the effect of genetic variation on the efficacy of treatments before human trials.
“This new macaque reference genome and the genetic characterization of research populations will substantially advance biomedical research and studies of primate genome evolution by providing an improved framework for more complete studies of genetic variation and its phenotypic consequence,” the authors concluded.
Scientists have long struggled to explain the success of Mycobacterium tuberculosis in the face of effective therapeutics. Tuberculosis (TB) kills more than 1 million people annually, and has a remarkable ability to develop resistance to drugs despite its stable genome. But now, a new study from researchers at San Diego State University and other institutions strongly suggests that methylation rather than genome sequence gives M. tuberculosis its broad phenotypic range.
Lead author Samuel Modlin (@sam_modlin), senior author Faramarz Valafar (@FaramarzValafar), and collaborators report using SMRT Sequencing technology to characterize the DNA adenine methylomes of 93 clinical TB isolates. They chose samples representing diverse phylogenetic and geographic sources, and focused on methylation because of previous, smaller studies suggesting the importance of methyltransferases in M. tuberculosis. They aimed to delve deeper than those studies to see if whole methylome data could answer lingering questions about the pathogen.
“It is unclear how such a genetically static organism adapts so rapidly to drug treatment and varied immune pressures,” the scientists note. “DNA methylation is a plausible yet scarcely explored alternative mechanism for phenotypic variation in M. tuberculosis.”
In addition to producing highly accurate genetic data, SMRT Sequencing also measures epigenetic activity through kinetic changes as the DNA molecule is sequenced. The use of PacBio’s long-read technology proved critical: the long-read data enabled de novo assembly, unlike short reads that must be used with a reference-based variant calling approach. The new assemblies of the 93 isolates revealed an insertion, missed by previous studies, that is associated with an inactive methyltransferase.
The team deployed several techniques to analyze the samples comprehensively. They produced complete, de novo genome assemblies for all isolates with SMRT Sequencing — identifying all mutations in the three methyltransferases present — and also used the kinetic data to assess methyltransferase motif sites. Phylogenetic analysis allowed them to identify epigenomic diversity across seven lineages. Finally, the team used existing transcriptomic data sets to layer onto the methylome information for a deeper analysis.
Perhaps most interestingly, the scientists used an analysis pipeline to analyze SMRT Sequencing kinetic data from each individual read, rather than in bulk. The results indicate a phenomenon the team refers to as intercellular mosaic methylation, or IMM, in which methylation is not strictly turned on or off but rather affects a subset of motif sites that vary from one cell to another.
“Mutation-driven IMM was nearly ubiquitous in the globally prominent Beijing sublineage,” they report. They also identified more than 350 hypervariable sites across the isolates where there appeared to be no consistency in methylation patterns. All told, they add, the results represent “the largest survey of methylomic diversity in [TB pathogens] to date.”
“This multi-omic integration revealed features of methylomic variability in clinical isolates and provides a rational basis for hypothesizing the functions of DNA adenine methylation in [M. tuberculosis] physiology and adaptive evolution,” the authors conclude.
“These findings add to the growing body of literature demonstrating bacterial epigenomics is an important complementary focus to genetic and phenotypic analysis in studying microbial diversity, gene regulation, and evolution.”
Learn more about methylation detection using PacBio sequencing for your research.
The power of PacBio HiFi reads has enabled transformative research into human disease. A new collaboration with Invitae, a leader in medical genetics, is intended to help harness the technology for use in mainstream medicine.
The ability of HiFi reads to detect genetic variants, even in hard-to-sequence regions of the genome, has already shown clinical utility. In a recent research collaboration with Invitae, announced in October 2020, the comprehensive, highly accurate reads were used to explore clinically relevant molecular targets for use in the development of advanced diagnostic testing for epilepsy.
We are thrilled to announce a new collaboration with Invitae to develop an ultra-high-throughput clinical whole genome sequencing platform. Read more about it here.
Learn more about the benefits and workflows of PacBio whole genome sequencing
Felix, Garfield, Leo the Lion — despite their differences, the genomes of these frisky felines are highly conserved across the family, even among its most divergent members.
A new set of highly contiguous haploid Felis and Prionailurus assemblies provides further proof that although the genomes appear karyotypically distinct, they are grossly collinear, and any cytogenetic differences represent centromere repositioning rather than chromosomal rearrangements.
A team of researchers at Texas A&M University, led by Kevin R Bredemeyer, Andrew Harris and William J Murphy, worked with colleagues in the United States, China and Russia to create a de novo assembly of a Bengal hybrid cat, as well as phased haplotypes of its parents, a random-bred domestic cat (Felis catus) and an Asian leopard cat (Prionailurus bengalensis).
As reported in the Journal of Heredity, the assemblies offer significant improvements over the previous domestic cat reference genome, with a 100% increase in contiguity and the capture of the vast majority of chromosome arms in one or two large contigs.
Previous diploid-based genome assemblies for the domestic cat suffered from poor resolution of complex and highly repetitive regions, with substantial amounts of unplaced sequence that is polymorphic or copy number variable, the authors noted.
“These difficult to assemble regions are increasingly understood as playing important roles in disease biology, genome organization, gene regulation, and speciation,” they added.
By using highly contiguous PacBio long reads, the team was able to capture complex repetitive regions previously un-spanned due to insufficient read lengths and/or high haplotype divergence, as well as resolve multicopy gene families with high allelic diversity (such as the Major Histocompatibility Locus and olfactory receptors).
“Furthermore, we have provided a genome assembly from a random-bred domestic cat, which is more representative of the domestic cat pet population,” they wrote.
Adding to a growing collection of assembly methods, they also demonstrated that comparably accurate F1 haplotype phasing can be achieved with members of the same species when one or both parents of the trio are not available — an important ability, since F1 interspecies hybrids are rare biological resources, and in many cases it may be logistically impossible to obtain the actual parents of the cross.
As they noted in their paper, cats are not only some of the most popular companion animals — species from the cat family Felidae serve as a powerful system for genetic analysis of inherited and infectious disease– but the study of domestic cats can also help in the conservation of their wild cousins.
“These novel genome resources will empower studies of feline precision medicine, adaptation and speciation,” the authors wrote.
To hear from fellow scientists about their latest plant & animal discoveries, register to attend PAGBio Day, a virtual half-day event on January 19th. Explore how to use PacBio whole genome sequencing for your project.
The year the world went virtual is virtually over, so what better time to reflect on all the great online offerings featuring SMRT Sequencing this year. While we would have rather gathered in exotic locales to see you in person and share our science, 2020 did provide some amazing opportunities to go global and broadcast worldwide.
Here are some of the highlights:
What better place to start than with an introduction to our technology, followed by a panel of sequencing experts — Melissa Laird-Smith (@SmithLab_UofL), Michael Hartigan, and Olga Vinnere Pettersson (@OlgaVPettersson) — with some sequencing basics: explaining long reads and their utility, how PacBio long-read sequencing differs from other technologies, and the applications PacBio offers and how they can benefit scientific research.
Our popular regional user group gatherings were combined into one mega meeting this year, which meant nearly 30 hours of talks across time zones, over the course of two days. With so many sessions, it’s impossible to cover them all here, but suffice it to say, there were many magnificent presentations by our users, plus hands-on workshops, live Q&As, and a meet-and-greet with our new CEO, Christian Henry. Well worth spending time catching up!
Register here to watch these presentations on-demand
We rang in 2020 at the ever-popular PAG meeting. In addition to an overview of the year ahead by CSO Jonas Korlach, our workshop featured several great talks by our users, including an update on the Sanger Institute’s Darwin Tree of Life project by Mark Blaxter (@blaxterlab); a talk about the tetraploid rose assembly by Bart Nijland (@bart3601) of Genetwister Technologies; great apes work by Zev Kronenberg (@zevkronenberg); and a discussion of plant-living funghi by Jana U’Ren (@you_wren) of the University of Arizona.
Missing PAG 2021 in January? Join us for PAGBio Day, our online alternative, on Jan. 19. Save your seat!
Although we sure missed spending a few glorious spring days in the stunning Dutch city of Leiden, we were delighted to make our annual European user gathering international. This fave meet-up included not only top talks by plant, animal, human and microbial scientists, but also a strong offering of bioinformatics sessions. Keynotes included Vertebrate Genomes Project leaders Erich Jarvis (@erichjarvis) and Sergey Koren (@sergekoren) from the National Institute of Health.
Register here to watch these presentations on-demand.
One of the last conferences we were able to attend in person, AGBT featured several informative sessions. Tina Graves-Lindsay from the McDonnell Genome Institute (@GenomeInstitute) described how her team is using PacBio sequencing to produce reference-grade human genome assemblies. Adam Ameur (@_adameur) from Uppsala University discussed the use of long-read sequencing to detect off-target results from CRISPR-Cas9 gene editing studies. And Brenda Oppert from the USDA made a convincing argument for developing insect-based food sources for people.
Our first all-day event dedicated to neuroscience, #PBNeuroDay showcased a lot of emerging rare disease research. From unravelling repeat expansions to creating new methods of carrier screening, the 25 sessions tackled a wide array of topics and diseases, from ALS, Alzheimers and Ataxia to Muscular Dystrophy, Parkinson’s, Progressive Supranuclear Palsy and Schizophrenia.
Register here to watch these presentations on-demand
ASHG featured a wide variety of talks and video poster presentations covering a range of applications using PacBio long-read sequencing technology, from single-cell isoform analysis of the nervous system (by Hagen Tilgner @hagentilgner of Weill Cornell) to solving rare disease cases in children (by Emily Farrow of Children’s Mercy). After hearing from our users, be sure to check out the handy overviews by PacBio experts Aaron Wenger and Liz Tseng (@magdoll).
This Spanish language webinar was a huge hit. Carmen Guarco, Senior Field Application Scientist specializing in bioinformatics, was joined by Álvaro G. Hernandez (@UIUC_DNAseq), Director of DNA services at the Roy J Carver Biotechnology Center at the University of Illinois at Urbana-Champaign to offer an overview of PacBio technology and tips for getting the most out of HiFi reads.
As it becomes increasingly clear that single reference genomes for each species are not enough, many scientists are interested in creating pangenome collections. So we brought together two experts — Kevin Fengler of Corteva and Matthias H. Weissensteiner (@MWeissensteiner) of Penn State to discuss the advantages of sequencing multiple individuals to gain comprehensive views of genetic variation, and the speed, cost, and accuracy benefits of using HiFi reads to sequence species of interest.
How can the Sequel II System help with complex metagenomics projects? Meredith Ashby (@AsbhyMere), Director of Microbial Genomics at PacBio, was joined by Bing Ma of the Institute of Genome Science at the University of Maryland, who discussed her work using long-read sequencing to identify high-resolution microbial biomarkers associated with leaky gut syndrome in premature infants. George Weinstock (@geowei) of The Jackson Laboratory, talked about the potential of highly accurate long reads enabling strain-level resolution of the human gut microbiome by resolving intraspecies variation in multiple copies of the 16S gene.
Long-Read Sequencing in COVID-19 Research
We’d be remiss not to mention our talks covering COVID-19 itself. In the Labroots webinar Opportunities for using PacBio Long-read Sequencing for COVID-19 Research, Meredith Ashby, Director of Microbial Genomics at PacBio, described how HiFi sequencing could be used for mutation phasing and rare variant detection to understand viral stability and mutation rates, as well as providing insights into viral population structure for monitoring viral evolution. In Understanding SARS-CoV-2 and Host Immune Response to COVID-19 with PacBio Sequencing, Melissa Laird-Smith (@SmithLab_UofL) discussed her work evaluating the impact of host immune restriction in health and disease with high resolution HLA typing and Corey Watson (@ctwatson29) of the University of Louisville School of Medicine talked about overcoming complexity to elucidate the role of IGH haplotype diversity in antibody-mediated immunity.
We look forward to plenty more new discoveries in 2021 – Happy New Year!
Celiac disease happens in the gut, but scientists still don’t fully understand the complex interplay between host genetics and the environmental factors that lead to the development of the autoimmune digestive disease.
Researchers at the Mucosal Immunology and Biology Research Center of MassGeneral Hospital for Children and Harvard Medical School are hoping to shed light on the ‘microbial dark matter’ in the breastmilk of mothers with celiac disease and in the intestine of celiac children using full-length 16S rRNA and metagenome sequencing — they will be supported in their efforts by the 2020 Microbial Genomics SMRT Grant.
Ali R. Zomorrodi (@arzomorrodi), the research center’s computational and systems biology lead and an Instructor of Pediatrics at Harvard Medical School, has teamed up with Alessio Fasano, a renowned celiac disease specialist, chief of Pediatric Gastroenterology and Nutrition at Massachusetts General Hospital and a Professor of Pediatrics at Harvard Medical School, to tackle these questions.
By leveraging a large-scale prospective birth cohort study referred to as the Celiac Disease Genomic, Environmental, Microbiome, and Metabolomic (CDGEMM) study, led by Dr. Fasano, they have a unique opportunity to delve more deeply into the role of microbiota in the etiology of the disease. The CDGEMM study has been banking stool, breastmilk and other biospecimens, as well as clinical metadata from ~500 infants with high risk of celiac disease from birth through childhood.
HiFi Sequencing for Strain-Level Resolution
In celiac disease, one of the most common forms of food intolerance worldwide, the ingestion of gluten-containing grains triggers an immune response that attacks and progressively damages the small intestine. It is a unique model of autoimmune diseases since it is the only such disease for which the environmental trigger (exposure to gluten) and genetic risk factors are well-characterized.
Research shows that 30% of the population are genetically susceptible for celiac disease and are exposed to gluten, yet only 2-3% of them develop the disease. Recent studies suggest a critical role for the gut microbiota in celiac disease pathogenesis, but how exposure to environmental risk factors other than gluten early in life may alter the engraftment of the gut microbiota in infants at risk of the disease is still poorly understood. So, one aspect of Zomorrodi’s project involves the investigation of the role breast milk may play on the engraftment of the gut microbiota in CDGEMM infants.
Zomorrodi and colleagues will use full-length 16S rRNA HiFi sequencing to see whether the composition of breast milk microbiota in mothers of CDGEMM infants with a history of celiac disease is different from those of mothers without the disease. They will also use this technology to study fecal microbiota from infants of these mothers and fecal microbiota of at-risk infants who consume formula, to explore whether the breast milk microbiota has any effects on the composition of the babies’ intestinal microbiota. While the conventional short-read 16S sequencing can identify microbes at genus or sometimes species level, the 16S HiFi sequencing would allow the team to profile the microbiota at strain-level resolution. This will enable researchers to gain deeper insights into the role of breast milk microbiota in shaping the intestinal microbiota of infants at risk of celiac disease.
A Deeper Understanding of the Intestinal Microbiome
In another project, Zomorrodi and his colleagues will be applying HiFi metagenomic sequencing to fecal samples from CDGEMM children who developed celiac disease and matched controls who did not. They hope to comprehensively characterize the intestinal microbiome composition of these subjects at strain-level resolution and to identify celiac-specific biomarkers in the microbiome.
“This is important, because we know that many diseases are driven by specific strains within the same species. Healthy people may carry the same species, but do not become ill. We want to know why,” Zomorrodi said.
Zomorrodi also wants to capture as much of the microbiome as possible using this innovative technology.
“A good proportion (up to 50%) of the short-read metagenomic data we collect cannot be mapped to any database during the taxonomic or functional profiling processes. This means that we are losing a significant portion of the data that could contain a lot of valuable information about the microbiome,” he said.
In addition to increasing the chance of identifying low-abundance microbes that may not be otherwise identified using short-read methods, the HiFi reads will enable a finer-level functional characterization of the microbiota and the assembly of closed genomes for novel microbial strains that do not exist in databases. These closed genomes can serve as a basis for downstream computational investigation of the microbiota function, such as constructing computational genome-scale models of metabolism.
“This study could go a long way towards finding celiac-specific biomarkers and designing targeted microbiota intervention strategies to treat celiac disease and other autoimmune diseases.”
Zomorrodi said he is looking forward to his first experience using PacBio long-read sequencing. Once a skeptic, he was sold on the value of full-length 16S rRNA and metagenomic HiFi sequencing after seeing data presented at a Cold Spring Harbor microbiome conference.
“It is really amazing,” he said. “I believe the field of the microbiome will be moving forward into using long-read technology. There are really lots of exciting opportunities that didn’t exist before at this level of resolution.”
We’re excited to support this research and look forward to seeing the results. Thank you to our co-sponsor and Certified Service Provider, Maryland Genomics, for supporting the 2020 Microbial Genomics SMRT Grant Program. Explore the 2020 HiFi For All – Collaborations SMRT Grant Program to apply to have your project funded.
Scientists at Stanford University and the Icahn School of Medicine at Mount Sinai have made impressive strides in resolving variants in the SLC6A4 promoter associated with susceptibility to psychiatric disorders and response to antidepressants. This progress was made possible with highly accurate, long-read sequencing, known as HiFi sequencing.
Published in the journal Genes, the paper comes from lead author Mariana Botton, senior author Stuart Scott, and collaborators. It describes a SMRT Sequencing-based approach to analyzing amplicons of the SLC6A4 promoter region, which is noted for “a variable number of homologous 20–24 bp repeats,” the authors write, as well as long, extra-long, short, and extra-short alleles with differing expression. The gene itself is important for pharmacodynamics of antidepressants, one of the most frequently prescribed class of drugs.
As Botton et al. note, identifying key variants within the promoter region is most valuable in the context of haplotypes showing whether variants share an allele. Unfortunately, that information is not easy to access. “Short-read sequencing is not effective at accurately interrogating the SLC6A4 promoter, particularly across the VNTR that includes the 5-HTTLPR insertion/deletion (L>S) polymorphism,” the scientists report. “This overarching limitation of short-read sequencing has previously been acknowledged, as low complexity regions and tandem repeats in the human genome are notoriously challenging for short-read platforms.”
With that in mind, the team turned to long-read sequencing data from PacBio. They designed four overlapping primer sets to span the SLC6A4 promoter region and the necessary oligo tags for barcoding, and then sequenced the resulting amplicons for 120 samples with SMRT Sequencing. They also performed Sanger sequencing and gathered publicly available short-read data for many of the samples and compared genotype results across platforms.
The scientists found that three of the key variants “were either not detectable or incorrectly genotyped among the [various short-read data sets] in 32/32 (100%), 60/68 (88%) and 17/21 (81%) samples and 87/96 (91%), 85/204 (42%) and 34/63 (54%) variant sites, respectively.” PacBio sequencing, on the other hand, allowed for the detection of all variants, including rare extra-long alleles. “In addition to being more accurate at this locus than short-read sequencing, long-read SMRT sequencing also unambiguously phased the polymorphic SLC6A4 promoter in all samples, including complex compound heterozygous diplotypes,” the team adds. Sanger sequencing results for six samples confirmed the variants identified by SMRT Sequencing.
To assess the reproducibility of the SMRT Sequencing workflow, the team evaluated reference samples with known SLC6A4 variants in triplicate. “The intra- and inter-run genotype and diplotype concordances for the 15 control samples were both 100%,” the researchers write.
“Our innovative method enabled the phased resolution of complex SLC6A4 promoter diplotypes, which was not possible using short-read WGS data (~5X and ~45X) or high-depth capture-based short-read sequencing data (~330X),” the team notes. “SLC6A4 long-read SMRT sequencing is a reliable and validated third-generation sequencing technique that can accurately interrogate the low-complexity homologous SLC6A4 promoter region.”
Learn more about best practices and workflows for targeted sequencing in human biomedical research.
Scientists in China have used SMRT Sequencing to demonstrate the value of highly accurate long reads for identifying, linking, and phasing variants associated with a group of blood disorders known collectively as thalassemia. Ultimately, they predict in the Journal of Molecular Diagnostics, long-read sequencing could support a new carrier screening approach for prospective parents interested in knowing their risk of passing these diseases on to their children.
“Long molecule sequencing: a new approach for identification of clinically significant DNA variants in alpha and beta thalassemia carriers” comes from lead authors Liangpu Xu, Aiping Mao, and Hui Liu and collaborators at Fujian Provincial Maternity and Children’s Hospital, Berry Genomics, and other institutions.
The scientists undertook this project because of the need for improved carrier screening tools for ethnicities where thalassemia is prevalent, such as in Southeast Asian, Southern Chinese, Middle Eastern, Mediterranean and North African populations. While current screening methods have been fairly effective, the authors write, they “can be laborious and difficult to perform well at the laboratory level.”
Current screening techniques are unable to detect the broad range of variants and variant types associated with thalassemia. The usually fatal disease type known as α-thalassemia has been linked to nearly 130 pathogenic variants across the HBA1 and HBA2 genes, while its milder β-thalassemia counterpart can be caused by more than 200 pathogenic variants in the HBB gene. Seeking a potential approach that would provide more information about the entire spectrum of variants associated with thalassemia, the scientists turned to long-read SMRT Sequencing.
The team produced amplicons for all relevant genes. They then sequenced those regions on a Sequel System in a blinded study of 74 samples: 64 from known carriers, and 10 non-carrier controls. Results showed that “all HBA1/2 and HBB variants detected by [long-read sequencing] were concordant with those independently assigned by targeted PCR assays,” the authors report. SMRT Sequencing data correctly called the 20 known pathogenic variants in these samples. Importantly, the long-read sequencing technology “was able to discriminate compound heterozygous SNVs (trans configuration) and identify variants linked to benign SNPs (cis configuration),” the scientists add, noting that it also pinpointed linked variants which may increase accuracy of interpretation.
Based on these results, the team highlights some advantages offered by SMRT Sequencing as a possible carrier screening tool. “Since the entire gene regions are analyzed, the test has the potential to detect other HBA1/2 and HBB variants that may be outside the scope or difficult to accurately detect by traditional tests,” Xu et al. write. Also, the technology “should be capable of detecting other mild and silent HBB variants located in regulatory regions as well as HBB gene deletions that occur in approximately 1% of carriers.”
Finally, they note, being able to detect all of these variants in a single workflow with barcoded libraries makes the approach more scalable. SMRT Sequencing “displayed the hallmarks of a scalable, accurate and cost-effective genotyping methodology” that could “eventually serve as a comprehensive method for large-scale thalassemia carrier screening,” the team concludes. They also report that such an approach will become even more cost-effective when performed on the Sequel II System.
In a presentation at the PacBio Global Summit, paper co-author David Cram of Berry Genomics stated: “This type of technology would be very useful for carrier testing, not only thalassemia, but of other genetic conditions that involve complex genomic regions or copy number variants.”
Watch Cram’s full presentation: Comprehensive Analysis of Thalassemia Alleles (CATSA): A Universal Approach to Thalassemia Carrier Testing
Geneticists often point out that a human does not have “a” genome but rather two genomes, one inherited from the mother and another from the father. The number of complete sets of chromosomes in each cell, or haplotypes, is referred to as ploidy. Humans and most other animals are diploid (2N), having two sets. Many plants have higher ploidy, for example, the hexaploid (6N) California Redwood has 6 copies of each chromosome.
The number of chromosome pairs not only increases the total amount of DNA in a genome, but it also increases the complexity of the genome – by increasing the number of alleles, or alternate forms of genes. Although the majority of the sequence between paired chromosomes are identical, it’s the differences that provide the breadth of biological variation within a species.
Phasing Haplotypes to Get a Complete Picture of Genetic Variation
Whether sequencing a giant polyploid or diploid, the goal remains the same – to get a complete and accurate representation of each copy of the genome or region of interest. This is often achieved by assembling a haploid (single copy) genome and then identifying variants, locations where the alleles differ. Many well-studied organisms, like humans, have standard haploid references against which other individuals are compared.
But identifying variants does not provide the complete sequence of the genome. That requires phasing, or determining which variants are from the same copy of a chromosome (in “cis”) and which are from different copies (in “trans”). One approach to phasing is to use mother-father-child trios: variants in the child’s genome that that are only present in one parent must be on the same chromosome. A second approach is population inference, which deduces that variants often seen in the same people are likely in phase. Both trio and population phasing are imperfect, as they require additional information and are only able to phase some variants.
Recent advances in DNA sequencing technology and the tools used to assemble and phase genomes allow large blocks of the sequence to be phased directly from DNA sequencing reads of one individual. Highly accurate long reads, known as HiFi reads, are uniquely suited to phasing haplotypes as they provide the high accuracy needed to detect single nucleotide variants (SNVs) and the read length to connect these variants over a long range.
Using HiFi reads, either alone or in combination with other technologies like Hi-C and Strand-seq, scientists have been able to produce phased genome assemblies of the rose – a complex tetraploid; the California redwood; and humans, including on of Puerto Rican decent, and one of Korean decent, and a cognitively healthy supercentenarian. The phased genomes have each provided novel insights into functionally important variants.
Phasing Genes to Identify Allelic Configuration of Variants
Scientists analyzing variants in the PIK3CA oncogene found a compound mutation — a double mutation that appears to give breast cancer patients an overwhelmingly positive response to the targeted PI3Kα inhibitor alpelisib. By sequencing and phasing the entire gene, the researchers were able to show that having both variants on the same allele (cis) led to a super-responder phenotype; when those variants were on separate alleles (trans), that was not the case. This information will have clinical relevance for many cancer patients and would never have been known without the ability to phase sequence data.
For recessive disease genes, it also is critical to know whether two variants seen in a gene are in trans (thus breaking both copies of a gene) or cis (thus leaving one copy intact). For example, in the case of a 9-year-old boy with multiple types of cancer, phasing of the MSH6 gene revealed that both maternal and paternal alleles carried mutations resulting in constitutive mismatch repair deficiency syndrome.
Haplotype Phasing to Explore the Genetic Origins of Species
Researchers exploring apple domestication used haplotype-resolved assemblies of cultivated and wild species to better understand the genetic history of the crop. They were able to sequence and assemble full “haplomes” (haploid genomes) and showed high levels of heterozygosity with more than 20% of the Gala apple genome containing alleles derived from different wild progenitors, showing the Gala was hybrid in origin. Further, they found that introgression of new genes and alleles was a critical component to the domestication of the cultivar. This information provides better understanding of trait variability and will assist in efforts to breed for desirable traits like fruit weight and sweetness.
Allele Phasing to Resolve Variants Missed by Short Reads
Scientists assessing the role of the promoter of the SLC6A4 gene that is thought to play a role in psychiatric disorder susceptibility, found long-read sequencing critical for interrogating a low-complexity repeat region. The length of a repeat in the gene’s promoter affects gene expression levels. Phasing the repeat length with variants in the coding region of the gene indicates whether a coding variant will have high or low expression. The authors found the repeat region was missed by short read approaches; long-read sequencing both characterized the repeat and unambiguously phased clinically significant variants that may improve pharmacogenetic testing.
How to Obtain Phasing Information with HiFi Reads?
Now that you’ve seen how phasing can provide valuable insights, here is how to obtain phasing information:
- Sequence an individual with HiFi reads, which have the accuracy needed to resolve differences and the long read length to phase large haplotype blocks.
- Use a diploid-aware assembler like IPA, hifiasm, or HiCanu for genome assembly.
- Detect variants with an accurate variant caller like Google Deep Variant and then phase haplotypes with WhatHap.
- Combine HiFi data with additional technologies to extend haplotype phasing to the chromosome scale. HiFi data in combination with Hi-C or Strand-seq can phase entire genomes. If a family trio sample is available, short read data from the parents can be used to separate HiFi reads into parental bins before genome assembly (HiCanu, or during genome assembly).
To learn more about how phasing could make a difference for your research contact a PacBio scientist to get started with your next sequencing project.
Explore other posts in the Sequencing 101 series:
Scientists at the Boyce Thompson Institute, Cornell University and the USDA Agricultural Research Service have reported significant progress in understanding the genomic features of domestic and wild apples. They used HiFi reads, highly accurate long reads, generated by the Sequel II System to build phased, diploid genome assemblies, as well as apple pangenomes to represent more of the remarkable genetic diversity in this lineage and better characterize its historic domestication.
We asked Fei about the highlights of the team’s efforts, and here’s how he summed it up: “We assembled phased diploid genomes of modern apple (Malus domestica) and the two major wild progenitors, M. sieversii and M. sylvestris using PacBio HiFi reads and Illumina short reads, and constructed pan-genomes of the three species. We inferred the genetic contributions of the two wild progenitors to the cultivated apple, and identified genome regions under selection during apple domestication and associated with important traits such as fruit size, texture and taste.“
The team focused on the tasty Gala apple, knowing that producing an accurate genome assembly would require more than short-read sequencing data.
“Most crop plants have complex genomes characterized by large size, high heterozygosity level and polyploidy,” they write in the paper. “The apple genome is highly heterozygous, posing a major challenge for earlier genome assemblies.”
To address those challenges, the scientists incorporated HiFi reads into their strategy, sequencing the Gala apple and its wild progenitors at coverages ranging from 37-fold to 81-fold. These HiFi reads were then assembled using hifiasm and HiCanu, respectively (read more about these and other options for HiFi assemblers in this blog post). Those results were merged with orthogonal data sets to create diploid genomes for each of the three apples, with final assemblies reaching about 1.3 Gb.
“Despite high heterozygosity rates (0.85–1.28%), all assemblies showed high contiguity, with the scaffold N50 of 3.3–4.3 Mb in diploid assemblies and 16.8–35.7 Mb in haploid consensuses,” they add.
The extremely high quality of the final assemblies allowed the scientists to identify an error in previously published apple genomes associated with a 5 Mb inversion on Chromosome 1.
But the team also wanted to go beyond just one high-quality assembly for the Gala apple, pointing out that “a single reference genome can by no means represent a whole population.” To that end, they constructed a pangenome for each of the three apple types, using 91 accessions to capture natural genetic diversity. Through this work, they added between 89 Mb and 212 Mb of novel sequence data to each genome, covering thousands of new genes.
“Unlike annual crops such as the tomato, the pan-genome size of the cultivated apple is larger than that of wild progenitors, possibly due to the outcrossing nature and extensive introgression from wild species,” Sun et al. write. “This distinctive feature suggests that introgression of new genes/alleles is possibly a hallmark of crops domesticated through hybridization.”
One of the most important motivations for this study was to support apple breeding programs through a deeper understanding of trait variability.
“Traits introgressed in the hybrid are often not fixed and could be lost when propagated by seeds,” the authors note. “Understanding of the molecular basis of trait variability, which requires the knowledge of the diploid alleles, is critical for fixation of desirable traits in apple breeding.”
See additional examples of the use of SMRT Sequencing for the generation of pangenomes:
- Webinar Summary: Crops and Corvids get the Pangenome Treatment with HiFi Sequencing
- Pangenome of Soybean Generated to Capture Genomic Diversity
- Project to Rapidly Sequence Maize Pangenome Delivers Publicly Available Resource
- Sequencing 101: Looking Beyond the Single Reference Genome to a Pangenome for Every Species
- Case Study: Pioneering a Pan-Genome Reference Collection
Scientists at Yokohama City University Graduate School of Medicine and Osaka Women’s and Children’s Hospital have discovered a novel pathogenic variant associated with intellectual disability. They made the discovery using HiFi sequencing after previous short-read investigations failed to produce an answer.
In the journal Genomics, the team reports the case of 12-year-old monozygotic twin girls who exhibited developmental delays, severely drooping eyelids, and seizures since the age of 5 months. Clinical symptoms matched Dravet syndrome, but no molecular evidence was available to confirm that diagnosis. Their case had previously been analyzed with short-read exome sequencing, but no pathogenic variants were uncovered. Lead author Takeshi Mizuguchi, senior author Naomichi Matsumoto, and collaborators turned to HiFi sequencing and the Sequel II System “to search for variants that are unrecognized by exome sequencing,” they write.
While intellectual disability (ID) has been linked to variants in more than 500 genes, even the best analytical methods have a diagnostic success rate of less than 30%. “There are still many cases for which no molecular diagnosis has been possible,” the authors note. “Therefore, it is important to determine the molecular genetics of unsolved ID cases using new technologies.”
The scientists sequenced 15 kb size-selected libraries for one of the twins and both parents to generate highly accurate (>99% or Q20) long reads, known as HiFi reads. Next, the team used pbsv to call structural variants, and Google’s DeepVariant to call small variants and indels. This process highlighted hundreds of deletions, insertions, and duplications, plus seven inversions, in the twin’s genome that were potential de novo structural variants. “A 12-kb inversion disrupting the coding sequence of Bromodomain and PHD Finger containing 1 … immediately drew our attention,” the authors report, because variants in this region had been linked to an intellectual disability syndrome consistent with the twins’ symptoms. “Among the 16 possible de novo [structural variant] calls affecting RefSeq gene exons, no other genes were linked to an OMIM autosomal dominant disease entry,” they add.
The 12 kb copy-neutral inversion was confirmed with Southern blot, which also showed that both parents and an unaffected older brother lacked the inversion. A breakpoint analysis found that “the two breakpoint junctions identified by Sanger sequencing and the pbsv inversion call were identical,” the team notes, “demonstrating the accuracy of HiFi long-read analysis.” The scientists also point out that not only was the inversion missed by exome sequencing due to its copy-neutral status and repetitive flanking sequence, but it also would have been missed by traditional chromosomal analysis, which has a lower limit of detection of 10 Mb. Finally, using the trio data with haplotype phasing, the team discovered that the inversion was a de novo variant on the maternally transmitted chromosome.
“Importantly, the current study demonstrates that inversions can now be accessed using an ‘unbiased-genomic’ strategy with no prior knowledge,” the authors write. “This state-of-the-art technology is advantageous for elucidating hitherto inaccessible genomic changes.”
A new publication from scientists in The Netherlands and Belgium offers tantalizing insights that may shed light on age-related neurodegenerative disorders. The team used SMRT Sequencing to produce a de novo diploid assembly of the genome of a Dutch woman named Hendrikje van Andel-Schipper, who died at the age of 115 with no signs of cognitive decline, and then performed a detailed analysis of variants detected. The data are publicly available to the scientific community.
The paper, released in Translational Psychiatry, comes from lead author Jasper Linthorst and senior author Henne Holstege (@HolstegeHenne) at Amsterdam Neuroscience and their collaborators. They aimed to identify structural variants (SVs) that could be associated with the onset of neurological disorders; for this, they performed a comparison between several previously available human genome assemblies which included the centenarian genome assembly.
The team chose long-read PacBio sequencing technology because they determined that “due to their repetitive nature, [SVs] are currently underexplored in short-read whole genome sequencing approaches,” they write. Repetitive regions, particularly repeat expansions that tend to grow larger over generations, have been shown to be pathogenic for a number of neurological diseases. “Using common sequencing approaches, the assessment of large repetitive regions is difficult because short 100-150 bp sequence-reads do not span the entire structural variant,” the authors report. “The solution to this problem is to generate longer sequencing reads.”
For this project, the scientists generated a de novo, phased genome assembly for the 115-year-old woman, which they refer to as W115. This was based on sequencing genomic DNA from three tissues and relied on FALCON-Unzip to create the diploid assembly of about 2.82 Gb. This information was compared to two haploid assemblies and the latest human reference genome to search for SVs of 50 bp or longer.
The scientists used a graph-based multi-genome aligner called REVEAL and found a total of 31,680 SVs. Nearly 70% were classified as variable number tandem repeats (VNTRs). “Interestingly, we observed that VNTRs in the subtelomeric regions were composed of longer repeat subunits than VNTRs outside the subtelomeric regions, and that they had a higher GC-content,” they report. Expanded VNTRs have been linked to faulty gene transcription. “The genes that contained most VNTRs, of which PTPRN2 and DLGAP2 are the most prominent examples, were found to be predominantly expressed in the brain and associated with a wide variety of neurological disorders,” the scientists add.
In addition, the team analyzed the list of structural variants to see how SMRT Sequencing had made a difference in detection. Using short-read data for the W115 genome only, they found just 5,826 SVs. About 83% of the SVs — that’s more than 18,000 variants — found in the PacBio assembly “were uniquely identified through long-read sequencing,” the scientists note.
The sequence data for this project was produced on a PacBio RS II system, but Holstege and her team have already acquired a Sequel II System for the next phase of this effort. That will involve a large study encompassing at least 150 cognitively healthy centenarians and 150 individuals with Alzheimer’s disease, with the goal of identifying VNTRs that have significantly different lengths between the two groups. Holstege and her team will be generating HiFi reads and they expect to cover each genome in the study with a single SMRT Cell. “We want to know what about these individuals makes them so special,” she told us.
Interested in learning more about this research? Watch Holstege’s presentation “Uncovering Neurological Disorders Through an Examination of VNTRs” now available on demand. Explore workflows and additional resources on comprehensive variant detection or structural variant detection.
It’s Breast Cancer Awareness Month, and we can’t think of a better way to celebrate than to honor the passionate scientist who has perhaps single-handedly done more to advance breast cancer research than anyone else alive: Mary-Claire King, discoverer of the BRCA1 and BRCA2 genes. In recognition of her lifelong contributions, King was just awarded the prestigious William Allen Award, the top prize presented annually by the American Society of Human Genetics to recognize substantial and far-reaching scientific contributions to human genetics, carried out over a sustained period of scientific inquiry and productivity.
In a recent publication in the Journal of Medical Genetics, King and her collaborators at the University of Washington combined CRISPR-Cas9 targeting with HiFi sequencing to reveal novel and biologically relevant mutations in the BRCA1 gene.
The effort was driven by a need to better characterize the well-known BRCA1 and BRCA2 genes in families with hereditary breast cancer. Short-read sequencing “is of limited use for identifying complex insertions and deletions and other structural rearrangements,” the scientists note. “The BRCA1 genomic region is particularly challenging for short-read sequencing. It is composed of 42% Alu repeats, the second highest proportion in the genome, and a 30 kb tandem segmental duplication spanning its promoter and first two exons.” To expand the clinical utility of information about these genes in the future, much research remains to be done to characterize the many variants missed by short reads.
For this study, scientists aimed to sequence the BRCA1 and BRCA2 genes from individuals representing 19 families with a history of early-onset breast cancer. All of these individuals had previously had these genes analyzed with gene panels and whole exome sequencing, but no pathogenic mutations were found that explained the early onset breast cancer susceptibility.
To target the two genes of interest, the team used the HLS-CATCH CRISPR-based targeting method from Sage Science, extracting 200 kb of high molecular weight libraries ideal for use with PacBio sequencing. HiFi sequencing was performed on the Sequel System, with average genomic fragment length of about 10,000 bases to fully cover the two BRCA loci, including non-coding elements.
In one case, this approach unlocked a novel variant to explain the family’s history of cancer. “We identified an intronic SINE-VNTR-Alu retrotransposon insertion that led to the creation of a pseudoexon in the BRCA1 message and introduced a premature truncation,” the scientists report. The retrotransposon was nearly 3 kb long. “Multiple long reads included all elements of the mutation and of wild-type flanking BRCA1 intronic sequence, so that the mutation’s position and the sequence were clear,” the authors note, adding that the variant segregated with breast cancer throughout the family. After identifying this tough-to-find type of variant, the authors confirmed that the intronic repeat element can affect the final BRCA1 message by sequencing cDNAs from matching patient cells.
Based on these findings, the team suggests that there may be many other pathogenic complex structural variants. “It is possible, even likely, that complex mutations are common at tumour suppressor genes,” they write. “We suggest that complex mutations have thus far been rarely encountered, because they are difficult to detect with existing approaches.”
King and her collaborators believe the approach they used will be important for continuing to uncover these variants. “The genomic approach described here, integrating CRISPR–Cas9 excision of critical loci with long-read sequencing, yields complete sequence of targeted loci and thus can detect all classes of complex non-coding structural variants,” they report. “This combination of CRISPR–Cas9 excision and long-read sequencing reveals a class of complex, damaging and otherwise cryptic mutations that may be particularly frequent in r suppressor genes replete with intronic repeats.”
Listen to King share the emotional and humorous story of the events leading to the funding of the project that resulted in the discovery of the BRCA1 gene – a true testament to her persistence and the constant challenge of balancing career and family, with a cameo from Joe DiMaggio!