This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
Seven years after the ALS Ice Bucket Challenge soaked the world, the pace of discovery in sporadic amyotrophic lateral sclerosis has increased tremendously, with more than $115 million dollars in donations funding research that has led to the identification of several genes implicated in both familial and sporadic cases of the neurodegenerative disease.
While the social campaigns have generated much needed awareness around the disease, there are other challenges – one of which can be addressed with long-read sequencing.
As detailed in a new, interactive case study, PacBio SMRT Sequencing is helping researchers at the University of Washington unravel repeat regions of key genes linked to ALS and other disorders.
“We’ve made great strides over the past two decades in identifying genes and loci involved in ALS, mainly via GWAS studies,” said Paul Valdmanis (@pvaldmanis), assistant professor of medical genetics.
“We’re now at an exciting time where we have these new technologies available to allow us to identify novel risk factors. Through single cell sequencing and long read sequencing, in particular, we can ID some of these regions that were previously hidden or intractable to short read sequencing.”
Starting the search
Tandem repeats (and variable number tandem repeats, VNTRs) are snippets of DNA that are repeated multiple times within a gene, anywhere from a handful of times to more than a hundred. Sometimes these repeat sequences expand to long stretches, and these expanded repeats have been implicated in many diseases, including 40 linked to neurological disease.
Some of the open questions about these repeats include:
● How do they expand from a short repeat copy with four or five CAGs to an allele of over 60 base pairs?
● Does it make a difference where the repeat occurs–the beginning, middle, or end of the gene?
● What role does the internal sequence of these repeats play in disease pathogenesis?
In their search for answers, Valdmanis and postdoctoral fellow Meredith Course started with a multigenerational family that had several cases of ALS. Many of the family members had variants of a gene (FUS) linked to the disease, but not all of them exhibited symptoms. Why? Is there an additional genetic modifier that influences pathogenesis?
Using old-school linkage study approaches, the Valdmanis Lab identified a region on Chromosome 18 that also seemed to play a role. All members of the family affected by ALS shared a 4 megabase segment of DNA within this region. So, they took a look inside this region to see if there were other risk factors. Then, they zoomed in further, using HiFi sequencing as their magnifying glass.
Homing in on the culprit
Through HiFi sequencing, researchers pinpointed a 69-bp VNTR in the WDR7 gene that was found to be enriched in individuals with ALS. The reference genome has about 6 copies of this repeat, but each individual in the ALS family had more than 30 copies.
They also performed multiplexed barcoded sequencing to resolve the complete internal structure of the WDR7 repeat in 288 geographically diverse individuals, and found striking variability in both repeat length and internal nucleotide composition. Some of the 69 bp repeat motifs were specifically present or absent in certain geographic populations.
They created maps to help visualize the data, and started to notice patterns emerging.
“Every time we looked at this repeat, we learned more,” Valdmanis said.
They were able to identify features associated with repeat expansion dynamics, the mechanistic consequences of repeat expansions to ALS susceptibility, and the structure of repeats in geographically diverse populations.
Further investigation of 15 samples from the Human Genome Project that had undergone long read phased sequencing suggested that the WDR7 gene was not alone in terms of the extreme variability in the length of its tandem repeats. They explored many other genes, including NWD2, VPS53, SLC22A1 and ART, and discovered various categories of repeats.
“We truly believe that long-read sequencing of tandem repeats can provide a lot of information about both human evolutionary events as well as risk factors for neurodegenerative disease. And we believe that VNTR expansions can represent novel disease risk factors not only in ALS but in other neurodegenerative diseases as well.”
Read our newest groundbreaking case study to learn more: ‘How SMRT Sequencing Helped Researchers at University of Washington Uncover a Tandem Repeat Linked to ALS’
Interested in learning more about our technology and ALS research?
Watch Marka van Blitterswijk from the Mayo Clinic present, ‘Applying Targeted Long-read Sequencing to Assess an Expanded Repeat in C9orf72’
See Meredith Course from the University of Washington present, ‘The Evolution and Function of a Large Tandem Repeat Associated with ALS’.
Visit our Neuroscience Research page to learn how PacBio sequencing provides a comprehensive understanding of the genetic basis of neurological disease.
It’s a challenge that has haunted rare disease researchers for years: how to increase solve rates in rare and Mendelian disease. Currently, the genetic cause of more than half of rare disease cases worldwide remain unexplained.
In a series of talks and posters presented at the 2021 annual meeting of The European Society of Human Genetics (#ESHG21), PacBio experts and users described how HiFi sequencing could help close the gap by providing more comprehensive, accurate and high-definition coverage of the gaps in the human genome.
Here is a summary of the discussions that took place and the posters that were presented:
In the first workshop, Susan Hiatt (@suzieqhiatt) from HudsonAlpha Institute for Biotechnology discussed how she and her team used HiFi sequencing in their rare disease research to discover genomic variation missed by whole-exome or genome sequencing studies using short reads.
PacBio’s Chief Scientific Officer Jonas Korlach also provided a summary about how advancements in sequencing technology, coupled with improved analytical tools and databases, have made it possible to finally examine hard-to-reach regions of the human genome, with huge implications for rare disease research.
“PacBio continues this march of technology evolution, and soon more than half, and perhaps up to two-thirds of unexplained rare disease cases can be explained through HiFi sequencing,” – Jonas Korlach
He went on to say: “We believe that high quality whole genome sequencing is the future of medicine and that this will be the first of many impactful demonstrations.”
In a poster presentation, researchers at PacBio showed how there are many clinically important genes in “dark” regions of the human genome. They explained how these often exist because of a paucity of NGS coverage or mapping difficulties, often complicated by the presence of various repeat elements or segmental duplications.
View the poster to see how long-read sequencing coupled with a long-PCR targeted enrichment method has the potential to illuminate these dark regions, using examples from CYP21A2, responsible for congenital adrenal hyperplasia, and GBA, responsible for Gaucher’s disease.
In another poster presentation, experts outlined how many genetic diseases are mapped to structurally complex regions containing highly similar paralogous alleles (>99% identity) that span kilobases and how, as a result, comprehensive screening for pathogenic variants is incomplete and labor intensive using short-reads or optical mapping.
View this poster to find out how long-range amplification and PacBio HiFi sequencing can fully resolve and phase a wide range of pathogenic variants with the help of a new amplicon analysis tool, pbAA.
And, in the final poster presentation, there was an insightful deep dive into personalized medicine. This poster reviews new discoveries around how the highly polymorphic CYP2D6 gene impacts the metabolism of 25% of the most prescribed drugs on the market. Experts reviewed this poster to inspire a discussion around how accurate identification of variant CYP2D6 alleles in individuals is necessary for personalized medicine.
View the poster to see how HiFi sequencing coupled with long-PCR targeted enrichment has successfully characterized 22 samples from a pharmacogenomics reference panel.
Comprehensive Detection of Variants in Unsolved Rare Disease Studies with PacBio HiFi Reads
Lastly, experts discussed how PacBio HiFi reads (99.9% accuracy, 15-25 kb) enable comprehensive variant detection in human genomes, extending to repetitive regions of the genome not accessible with short-read WGS (srWGS). It was revealed that HiFi reads match or surpass srWGS for single nucleotide variant and small indel detection while also improving detection of structural variants (SVs), with recall far exceeding that of srWGS.
This talk showcased how HiFi can be applied in a large-scale, reproducible way using an automated workflow, and how it performed on 80 rare disease cases unexplained by srWGS.
All in all, ESHG 2021 was full of insight, discovery and hope. With HiFi sequencing, we are excited about what the future holds.
If you would like to watch the workshop recording on demand, go here.
Or, if you’d like to learn more about highly accurate long-read sequencing and how it plays a role in human biomedical research, visit us here.
A new Nature Communications paper shows how scientists continue to make progress elucidating some of the most complex regions of the human genome by deploying long-read PacBio sequencing technology. In this case, lead author PingHsun Hsieh (@phhBenson), senior author Evan Eichler, and collaborators at the University of Washington resolved the TCAF gene locus and identified more than 100 kb that had been missing in the human reference genome.
Since the publication comes from the Eichler lab, it’s no surprise that the target genes in this project emerged in a segmental duplication (SD) region. The TCAF genes — which encode TRP channel-associated factors related to thermal sensing in a type of neuron — “originated from an ancient gene duplication event at the basal of mammalian phylogeny and remained single-copy genes throughout much of their evolution,” the scientists report. In humans, duplications of this region in the past 1.7 million years have led to more copies of TCAF1 and TCAF2.
Until now, this locus of the genome has remained intractable. “In the human reference genome GRCh38, TCAF1 and TCAF2 are embedded and span within a complex region of large, highly identical SDs (>99.5%) consisting of >250 thousand base pairs (kbp) in sequence and an annotated gap at chromosome 7q35,” the authors note.
Filling A Pesky Gap
In this project, the team paired PacBio sequencing with large-insert bacterial artificial chromosome clones to resolve the entire locus in eight humans as well as in chimpanzee, gorilla, and rhesus macaque, generating 15 haplotypes of the region and even comparing their results to those seen in ancient human genomes. They also used the Iso-Seq method to analyze gene expression in seven different tissues.
“We systematically explore the haplotype structure of the TCAF locus in order to study its diversity, annotate the genes, and infer its evolutionary history in the context of selection,” the scientists report.
“This study is one of the detailed genetic investigations of human-specific SDs shedding potential new insights into structural adaptations important in thermal regulation.”
Sequencing results revealed that TCAF paralogs were more than 99.7% identical, with sizes ranging from 10 kb to 60 kb. They also filled that pesky gap in the human reference genome by identifying the missing 103,616 bp. The team focused on haplotypes of the region. While the non-human primates had just one copy each of the TCAF1 and TCAF2 genes, the 12 resolved human haplotypes were quite different. “We identify five distinct haplogroups that carry one to three copies for the SD cassette, which range from 145–406 kbp in length,” they write.
Isoform Diversity Sheds Light on Ancestral Diversity
These haplotypes allowed the team to dive into annotation and analyze isoform diversity. Using the Iso-Seq method, they produced more than 480,000 full-length, non-chimeric transcripts from analyses of six human tissues and more than 50,000 from a chimpanzee cell line. In humans they found considerably more isoform diversity for TCAF2 than for TCAF1.
Perhaps most strikingly, though, the scientists found evidence of contrary patterns of selection.
“Our data support a model of two distinct forces of natural selection possibly operating on the same locus over the last half million years of hominin evolution,” they report.
“We propose that diversifying or balancing selection is likely acting in at least some human populations, particularly out-of-African populations such as Native Americans, to maintain and expand haplotype and structural diversity.”
The ancient human samples told a different story.
“In contrast, Neanderthal and Denisovan show a paucity of genetic variation, and while the sample size is still limited, this observation is unlikely to change with the sequencing of additional archaic genomes,” the scientists add. “We hypothesize that positive selection has reduced genetic diversity at the TCAF locus in these archaic hominin lineages.”
Interested in learning more about Iso-Seq Analysis? Go here.
At PacBio, we are passionate about accuracy in sequencing data. Our commitment to ensuring reliable results is why our HiFi reads are better than 99.9% accurate. Combined with the length of those reads — up to 25 kb — and it’s no wonder that our sequencing data generates complete, contiguous, and correct assemblies for even the most complex genomes.
While we’re proud of these technical accomplishments, our favorite thing is seeing how HiFi reads empower scientists to make new discoveries and reach novel insights. To that end, we launched our HiFi for Accuracy SMRT Grant program earlier this year and now we’re pleased to announce the winners.
From a pool of hundreds of submissions, we selected three outstanding winners from around the world who will use HiFi sequencing provided by local service providers to advance their fields of study. Congratulations to these intrepid scientists and recipients of our HiFi for Accuracy SMRT Grant:
HiFi Sequencing to Understand Rare Diseases
Winner: Claudia Gonzaga-Jauregui (@cgonzagaj)
Institution: International Laboratory for Human Genome Research, National Autonomous University of Mexico
Project goal: Understand genetic disorders that remain unanswered even after deep characterization with other molecular tools and sequencing platforms.
“Many people’s rare genetic diseases remain unsolved even after applying molecular technologies like chromosomal microarray and exome sequencing. Implementing innovative technologies like PacBio long-read sequencing to look at variation genome-wide in rare disease research offers the opportunity to look at variants beyond the constraints of other technologies and help shed light on these medical mysteries. I am grateful to PacBio for awarding my new laboratory a SMRT Grant to use their technology to help enable rare disease diagnostics in Mexico.”
– Claudia Gonzaga-Jauregui
Sequencing for this project will be provided by the PacBio Certified Service Provider DNA Sequencing Center at Brigham Young University.
HiFi Sequencing to Understand the Evolution of an Iconic Species
Winner: Sven Winter (@zoologysven)
Institution: Senckenberg Biodiversity and Climate Research Centre
Project goal: Generate high-quality assemblies of two giraffe species to facilitate analysis of structural differences
“PacBio HiFi sequencing will facilitate the detection of structural variance among giraffe genomes with high accuracy.”
– Sven Winter
Sequencing for this project will be provided by the PacBio Certified Service Provider CCGA.
HiFi Sequencing to Explore how Metagenomes Impact Infectious Disease
Winner: Charlene Kahler (@charlene_kahler)
Institution: University of Western Australia
Project goal: Conduct metagenomic analysis of oropharyngeal samples collected from individuals carrying meningococcal disease to identify factors involved in infection
“HiFi sequencing is the perfect technology for undertaking 16S microbiome survey to find signatures in the oropharyngeal microbiome that aid or prevent meningococcal colonization.”
– Charlene Kahler
Sequencing for this project will be provided by the PacBio Service Provider BIOTOOLS
Congratulations to our HiFi for Accuracy SMRT Grant winners! And thank you to our co-sponsors for teaming up with PacBio to make these SMRT Grants possible. Explore the 2021 SMRT Grant Programs for future opportunities to have your project funded.
They spoke about omentum, chemosynthesis, chromothripsis, and… Tasmanian devils? This year’s virtual two-day SMRT Leiden Scientific Symposium and Informatics Developers Meeting was certainly educational.
With the pandemic and increased difficulty in being able to connect in person, we wanted to provide a forum for young investigators, post docs, and faculty to come together and share their research experiences during these abnormal times. The result? 27 speakers—the majority of whom were young investigators—sharing data and discoveries, and their advice for early-career scientists.
There was a great spectrum of presentations. The first keynote featured fun facts about Dominette (first Bos taurus to have her genome sequenced) and efforts to create a cow pangenome. Hubert Pausch of ETH Zürich discussed the downfalls of reference-guided variant discovery and the benefits of genome graphs to overcome some of the biases of linear mapping.
The second keynote by 2019 Human Genetics SMRT Grant winner Tychele Turner (@tycheleturner) highlighted the value of HiFi reads in investigating neurodevelopmental disorders, including 9p minus disorder and autism. Turner said HiFi reads should become the new paradigm because of their ability to detect more variants, reveal novel variation, and phase even the most complex genes.
“PacBio long-read sequencing is ushering in a new era in human genomics.” — Tychele Turner, PhD., Washington University Genetics
In other sessions, early data from another SMRT Grant winning project about neurological diseases with complex structural rearrangements was reviewed. Matthew Hestand of Cincinnati Children’s Hospital Medical Center was joined by University of Louisville researcher Corey Watson (@ctwatson29), who spoke about characterizing immunoglobulin haplotype diversity and its influence on the antibody repertoire, and Gloria Sheynkman (@GSheynkman) of the University of Virginia, discussed the integration of long-read RNA-Seq and mass spectrometry.
In another session, Harald Gruber-Vodicka (@GruberVodicka) of the Max Planck Institute for Marine Microbiology imparted some useful advice about sample preparation and assembly of the tiniest samples, and Jannat Ijaz (@sciencejannat) of the Wellcome Sanger Institute described catastrophic genome fragmenting events that can occur in cancer, and how she pieced together 900 fragments from esophageal organoids.
Lastly, in view of the pandemic and the challenges associated with it, we invited Melissa Smith (@SmithLab_UofL) to speak about work in her new lab at the University of Louisville, covering both COVID-19 surveillance and research into one of the biggest hurdles in HIV therapy, HIV “reservoirs.”
Pursue Your Passions
In addition to getting a look at the research being done around the world, we understand that the pandemic has been an unprecedented time for early-stage researchers. In view of these challenges, we decided to host a session dedicated to career advice, guidance, and open discussion. This ended up being one of the event’s most popular sessions, featuring valuable learnings from speakers.
Here are a few examples of the advice they shared:
On the art and craft of being a scientist, finding a mentor, and how to lead:
Choose your mentor over your scientific subject. Learning the craft and being the best scientist you can be is most important, and can then be applied to the subject you love. Also, don’t stymy your creativity. Practice it. Try to have one new idea every day—it doesn’t even matter if it involves science. — Jonas Korlach, PacBio CSO
Think about what kind of leader you want to be. You learn how to be a good scientist or bioinformatician, but no one tells you how to lead a group. Consider taking a course, or learn from your favorite leaders. — Susan Kloet, Leiden University Medical Center
On recognizing the talents we each possess, and using those talents to succeed in science:
Drive yourself. Don’t compare yourself to anyone else. Don’t rely on anyone else to set expectations and hold yourself accountable. Remove the word ‘should’ from your career vocabulary. — Melissa Smith
Be open to trying something new. And don’t be cowed by others’ successes. They’ve encountered challenges too. — Tychele Turner, PhD., Washington University Genetics
On being open to revision and excited by the chance to iterate:
Be happy if you see your papers returned covered in red (edits) – it means your supervisor cares. Don’t be frustrated. Take their advice and use it to improve. — Hubert Pausch, Prof. Dr., ETH Zurich
And, last but not least – on passion:
You can still make great contributions to science outside of academia. Even extracurricular activities like roller derby can teach you key soft skills like leadership and organization. Pursue all your passions. — Sarah Kingan, Senior Product Manager at PacBio
And, don’t forget to leave the lab every once in a while.
Bioinformatics: Top 10 Tools
Another exciting part of the event concentrated on bioinformatics. Serendipitously, these bioinformatics sessions coincided with the Telomere-to-Telomere Consortium’s release of the first ‘complete’ human genome. In her keynote address at SMRT Leiden 2021, consortium member Arang Rhie (@ArangRhie) gave a behind-the-scenes look at how the team sequenced a human genome in its entirety for the first time ever in history. This included characterizing the final unresolved 8% of the genome.
The other keynote by Tobias Marschall (@tobiasmarschal) of Heinrich Heine University Düsseldorf provided an interesting review of haplotype-resolved assemblies and the Human Genome Structural Variation Consortium.
Still more, speakers presented the tools they’ve developed to optimize HiFi reads, including:
● Merfin – k-mer-based assembly and variant calling evaluation for improved consensus accuracy (Arang Rhie)
● PanGenie – algorithm that leverages a pangenome reference built from haplotype-resolved genome assemblies in conjunction with k-mer count information from raw, short-read sequencing data to genotype a wide spectrum of genetic variation (Tobias Marschall)
● SQANTI3 – an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline (Rocío Amorín de Hegedüs @rocioadh)
● tama (Transcriptome Annotation by Modular Algorithms) – software designed for processing Iso-Seq data and other long-read transcriptome data (Richard Kuo @GenomeRIK)
● pbaa (PacBio Amplicon Analysis) – separates complex mixtures of amplicon targets from genomic samples to cluster and generate high-quality consensus sequences from HiFi reads (Zev Kronenberg @zevkronenberg)
● bellerophon – analyzes MHC typing and other low-complexity gene amplicon data; performs allele calling while detecting polymorphic sites within the sequences and removing potential chimeric sequence variants (Yuanyuan Cheng @Yuanyuan929)
● svpack – tools for filtering, comparing, and annotating structural variant (SV) calls in VCF format (Aaron Wenger)
● JumboDB – tool for de Bruijn graph construction (Anton Bankevich @AntonBankevich)
● uLTRA – tool for splice alignment of long transcriptomic reads to a genome, guided by a database of exon annotations. (Kristoffer Sahlin @krsahlin)
● LeafGo – workflow to rapidly produce high-quality de novo plant genomes (Luca Ermini @ermini_luca)
By and large, SMRT Leiden 2021 was packed full of valuable discussion, behind-the-scenes info, and exciting revelations about the research being done via HiFi sequencing. Each session is available to watch on demand. Register for free now, before it’s too late.
As technology developers, one of our greatest joys is seeing how customers take our sequencing tools and deploy them for innovative and compelling new projects. Metagenomics has been one of those areas: our customers have recently been demonstrating the significant performance improvements enabled by our HiFi metagenome sequencing data and analysis pipelines.
But since much of that work is protected by HIPAA regulations or has not yet been published, we are now releasing a metagenomic data set to help scientists see how HiFi data can make a difference for these types of studies. This information is now available for review and analysis and can be used with existing tools or to help develop new ones.
The data set was generated from four fully consented, pooled human fecal microbiome samples made available through The BioCollective. Two samples came from vegan donors and two from omnivore donors, allowing us to see how diet influences gut microbiota. The pooling process, which creates a reference material by pooling samples from multiple donors (in this case four adults), leads to a more complex sample and a richer data set than can be obtained through mock community approaches. It also gives a more consistent composition than samples from an individual donation.
Long-Read Sequencing Produces Rich Profiling Information
HiFi sequencing gave us nearly 2 million reads per sample, with mean read length close to 10 kb for each. Median quality for the sequencing data was Q39 for two samples and Q40 for two samples. We found that species composition was consistent within diets and different between diets. Of the 76 bacterial species detected, 14 were exclusive to the omnivore samples and 21 were only found in the vegan samples.
There are a lot of exciting things to unpack in this data set. First, it demonstrates that our data analysis pipelines produce rich functional profiling information. Unlike analyses of short-read data, about 90% of HiFi reads have at least one functional annotation, with reads typically having two to five annotations. For each sample run on a single SMRT Cell 8M, we generated more than 8 million total annotations.
In addition, the data set highlights the advantage of high accuracy when assembling long-read data from metagenomes. These samples often contain closely related strains. A common cutoff for defining a distinct species is just 3%; if the difference between strains is less than the error rate, then the error correction process can erase the real differences needed to resolve and distinguish those strains.
This heightened ability to resolve strains is what drives the large number of high-quality metagenome-assembled genomes (MAGs) that can be recovered from a relatively small amount of HiFi data. For each sample, our assembly evaluation pipeline identified at least 56 — and as many as 69 — MAGs. The unique combination of high accuracy and long reads means that high-quality MAGs can be generated with less than 20-fold coverage, and many of those MAGs are represented in a single contig.
Listen to Daniel Portik talk about this new dataset in the first episode of our Metagenomics Webinar Series on demand here. We hope you get the chance to download the data and experience it for yourself.
Want to talk to us about this data set or have project ideas where you think HiFi data can make a difference? Hit us up on Twitter or reach out directly to our metagenomic specialists Meredith Ashby or Daniel Portik.
If you are interested in additional Metagenomics Webinars, register for upcoming episodes to learn about:
● How to resolve viral evolution and quasispecies diversity mechanisms of bacterial virulence and adaptation,
● Identifying key players in host-microbiome interactions with high resolution 16S sequencing, and
● Revealing mechanisms of bacterial virulence and adaptation
Here at PacBio, we have had the privilege of awarding many SMRT Grants to intrepid scientists who believe that HiFi sequencing data can help them achieve their goals. Recently, we invited people to apply for our Clinical Research SMRT Grant for projects with a link to potential clinical utility. We believe these projects could benefit tremendously from the value of HiFi reads, which offer both high accuracy and long reads to reveal genomic insights often missed by short-read sequencing.
Narrowing these applications down to just one winner is always challenging, but this time we found it to be impossible. So, for the first time ever, we planned to give one award and wound up making two awards instead. We are thrilled to announce the winners— one scientist at the start of her career and one well established in hers. We couldn’t be prouder to support the work of these two outstanding women and the questions they seek to answer.
Please join us in congratulating Danielle Brandes and Jenny Taylor on becoming our latest SMRT Grant winners! Here’s a look at what they plan to do with their awards.
Danielle Brandes, PhD Student
Institution: Pediatric Oncology, Medical Faculty, Heinrich Heine University Düsseldorf
Project Goal: Discover structural variants related to pediatric acute lymphoblastic leukemia that have been missed by other technologies.
Danielle’s proposal piqued our interest for many reasons. Acute lymphoblastic leukemia (ALL) is the most common childhood cancer to-date. Scientists understand that cancer predisposition genes (CPGs) are part of the puzzle when it comes to genetic predisposition in leukemic patients. However, CPGs are just a piece of the story. A large portion of the genome is affected by structural variants. Unfortunately, when it comes to leukemia, little is known about how structural variations play a part.
“Despite technical and analytical progress in the field of NGS, the landscape of structural variations remains largely unresolved. In this context, we are excited to see how PacBio HiFi long-read sequencing will complement our whole-genome optical mapping data set to elucidate potentially pathogenic SVs in our studies of acute lymphoblastic leukemia. This approach will give new insights on mechanisms of leukemic predisposition as well as to the spectrum of somatic structural variation in leukemia.”
— Danielle Brandes
Danielle’s team has performed whole-genome optical mapping (WGOM) to identify SVs in pediatric patient studies diseased with a high hyperdiploid or ETV6-RUNX1 translocated ALL. But, there is more to be done. Through HiFi sequencing, Danielle hopes to detect additional SVs that might have been missed by, or could complement, the WGOM data she has been gathering.
With the help of the SMRT Grant, we are excited to see how Danielle will be able to use HiFi sequencing to generate an individual comprehensive germline/leukemia genome in the pursuit of pathogenic SVs in CPGs and somatically acquired events in ALL studies.
Jenny Taylor, Associate Professor
Institution: University of Oxford
Project Goal: Use HiFi sequencing to resolve structural variants and phase variants for a few participants in the UK’s 100,000 Genomes Project as a demonstration of how this approach could potentially help address unsolved disease cases.
Our second winner, Jenny Taylor, is a seasoned scientist with years of experience in her field. Now, she is hoping to use PacBio’s technology to add value to specific samples collected as part of the 100,000 Genomes Project to further understand rare diseases.
“I am delighted to be awarded this grant from PacBio that may help our lab support Genomics England to increase understanding of undiagnosed diseases for some of those who have been referred to the 100,000 Genomes Project.”
— Jenny Taylor
The UK’s 100,000 Genomes Project completed whole genome sequencing for 73,880 genomes from rare disease patients. Jenny is hopeful that further research can be done to investigate the pathogenesis of some of the unsolved rare disease research cases to which they have access. With HiFi sequencing, she hopes to undertake comprehensive variant detection in 3 genomes to provide proof-of-principle for the PacBio platform.
Congratulations to these two outstanding scientists! We couldn’t be more excited to see what comes of these projects and are honored to sponsor each of these scientists in their pursuit of discovery.
And thank you to our co-sponsor, Icahn Institute for Data Science and Genomic Technology, for teaming up with PacBio to make this SMRT Grant possible. Explore upcoming SMRT Grant Programs to apply to have your project funded.
A new paper from scientists at the Max Planck Institute offers a great look at how HiFi sequencing delivers significantly improved results for metagenome studies compared to short-read data. In this project, HiFi reads led to higher-quality assemblies with less coverage and gave more insight into these complex microbial communities.
In the PeerJ publication, lead author Taylor Priest (@taylorpriest2), senior author Rudolf Amann, and collaborators report the analysis of 11 seawater samples collected from the Fram Strait, which connects the Arctic and Atlantic oceans and offers a unique view of how climate change is affecting marine ecosystems.
Long-Read Sequencing in Marine Ecosystems Impacted By Climate Change
They performed metagenome sequencing of all samples with short-read technology and analyzed three of them with HiFi reads using the ultra-low library protocol on the Sequel II System. For the PacBio sequencing, all three samples were pooled and sequenced together on a single SMRT Cell, leading to 4-6 Gb of HiFi data per sample.
The PacBio data set yielded 128 metagenome assembled genomes (MAGs) from about 15 Gb of total data collected. In contrast, the short-read data set was about 10 times that size, but produced just 218 MAGs, or fewer than twice as many.
“Of the species-representative MAGs recovered, those generated from the PacBio metagenomes had, on average, larger genome sizes, higher N50 values, and were less fragmented compared to those retrieved from Illumina metagenomes,” Priest et al. report.
The quality of the assemblies also allowed the researchers to simplify their metagenome assembly pipeline. Importantly, the authors note, “taxonomic reassembly was not performed for the PacBio dataset due to the high quality of generated MAGs from single metagenome assemblies.”
HiFi assemblies also showed strength for community composition analysis. In metagenomics, researchers often pull out the 16S rRNA gene sequences to identify the microbial members of a community. Unfortunately, short-read data is not well suited to this task.
“In this study, 84% of MAGs retrieved from the PacBio metagenomes contained at least one complete 16S rRNA gene sequence, highlighting another key advantage of using long Hifi reads.”
This advantage also extended to unassembled data. “A major restriction with short-read metagenomic sequencing is the limited capacity to accurately reassemble full length 16S rRNA genes,” the team notes. “With the advent of highly accurate long read sequences generated from PacBio sequel II (>99% accuracy), full length 16S rRNA genes can be retrieved from single reads without a need for assembly, thus circumventing previous limitations.”
HiFi Reads for Marine Metagenomes Reveals Phylogenetic Diversity
An analysis of these results gave the scientists an interesting look at microbial populations in an area where warming sea water is already having an impact. The recovered diversity encompassed 9 phyla, 11 classes, 27 orders, ∼51 families and ∼54 genera. The most species-rich taxa were the Flavobacteriales (41 species), Pseudomonadales (18 species) and Rhodobacterales (17 species).
This paper describes the team’s first use of HiFi sequencing for marine metagenomes, but it likely won’t be the last.
“We can conclude that HiFi read metagenomes derived from the PacBio Sequel II platform can greatly improve the number and quality of MAGs recovered, which will allow for further advancement in our understanding of the ecology of marine microbial communities,” they report.
Want to learn more about how HiFi Reads allow researchers to see metagenomes in high resolution? Visit our Microbial Applications page.
Have you heard about our Metagenomics Webinar Series?
Register now for the first episode in our series, highlighting what richer data and better assemblies reveal about metagenome structure and function. Or, stay tuned for the additional webinars in the series showcasing:
● How to resolve viral evolution and quasispecies diversity mechanisms of bacterial virulence and adaptation,
● Identifying key players in host-microbiome interactions with high resolution 16S sequencing, and
● Revealing mechanisms of bacterial virulence and adaptation
Hibernating bears have heart rates of 10-15 beats per minute, yet they do not develop congestive heart failure. Despite accumulating enormous amounts of fat and acquiring insulin resistance, they do not suffer metabolic diseases. And they maintain muscle strength in the near absence of weight-bearing activity.
If we could crack these feats of physiology, perhaps we could apply the knowledge towards therapeutic targets for the prevention and treatment of numerous human diseases.
The Project that Shed Light on the Metabolic Mystery of Brown Bears
Washington State University researchers have come several steps closer to characterizing the hibernation phenotype by analyzing differential gene expression and tissue-specific isoform changes between active, hyperphagic, and mid-hibernation physiological states in the brown bear (Ursus arctos).
Led by the lab of Joanna L. Kelley (@joannalkelley), the team first identified more than 10,000 genes differentially regulated in adipose (fat), liver and muscle tissues between active and hibernating states
The project, which was supported by a PacBio SMRT Grant, involved sequencing and analysis of tissues from three bears using the Iso-Seq method for full length RNA transcripts. The Sequencing & Genotyping Center at the University of Delaware sponsored the grant and processed the RNA with the help of the center’s director, Bruce Kingham (@bkingham).
“Single Molecule, Real-Time Sequencing Iso-Seq is ideal for identifying the full-length isoforms that are differentially expressed between seasons,” the authors wrote.
By combining the Iso-Seq data across samples and replicates, they obtained a total of 6.1 million full-length HiFi reads. After running the long reads through analysis, mapping to the reference genome, and filtering for library artifacts, they obtained 76,071 unique, full-length isoforms ranging from 150 bp – 16.5 kilobases (kb).
They merged these isoforms with the existing reference transcriptome (which contained 30,263 genes encompassing 58,335 transcripts) and found a total of 31,829 genes encompassing 107,649 transcripts, thus greatly increasing the number of known transcripts.
“Importantly, this merging of the reference transcriptome with the full-length transcriptome originating from samples of interest improves the reference and could lead to the discovery of differential isoform usage (DIU) that would otherwise be missed,” the authors note.
Tying Hibernation Biology to Human Health
Analysis of the data showed that metabolically active tissues vary dramatically in their isoform usage and underscored the complexity and importance of adipose as a dynamic tissue during hibernation. It demonstrated that both transcription and RNA processing play concerted roles.
“While differentially expressed genes have shed light on hibernation biology, determining genes where functionally distinct isoforms change between seasons is the next essential biology to uncover,” the authors wrote.
“Our study provides an unprecedented view into hibernation biology through the lens of RNA processing by producing a dataset that improved the annotation of the brown bear genome and reinforced the important role adipose plays in hibernation.”
Researchers have suspected that aspects of hibernation physiology might be applicable to solving certain types of human disease. Until now, however, little has been known about the role of differential isoform expression between a bear’s hibernation and active states.
SMRT Sequencing has allowed researchers at WSU to generate the most comprehensive analysis of isoform usage. This opens doors for further research into how things like seasonal insulin resistance and sensitivity, obesity, and urine production in bears during hibernation and activity can help inform targets for disease solutions in humans.
See what happened when PacBio scientist (and now bear enthusiast) Michelle Vierra (@the_mvierra) joined the WSU team as they collected samples for the project from bears at the WSU Bear Center. She wrote an account of the experience on Medium, and filmed her meeting with Willow the bear in the video below.
For more information about the SMRT Grant and how you can apply – go here.
If you’re interested in how to identify novel gene isoforms or have questions about Iso-Seq analysis, visit our RNA Sequencing page.
Been itching to talk about your latest single-cell experiments, your favorite differentially expressed isoforms, or your latest and greatest software for visualizing alternative splicing, but thwarted by a worldwide pandemic preventing in-person scientific events?
We were too, so we organized a virtual social club to easily enable scientists to geek out together. And we weren’t disappointed by our first event, which attracted dozens of self-proclaimed Iso-Seq analysis geeks and other curious researchers to share their work (published, unpublished and in progress) and discuss the benefits and challenges of incorporating long-read transcript sequencing into their research.
Welcome to the Iso-Seq Analysis Universe
PacBio’s own Iso-Seq analysis expert, Elizabeth Tseng (@magdoll) kicked off the Iso-Seq Social Club with an introduction to the method, which uses PacBio’s HiFi reads to characterize full-length transcript isoforms. The Iso-Seq method has been used to identify aberrant splicing in genetic diseases, characterize alternative promoter usage in cancer, and is making its way into the single-cell space for studying subregions in postnatal mouse brains and even ant brains!
But none of these studies are possible without proper tools, and as attendees learned, bioinformatics tools made specifically for long-read transcriptome data is a bustling field.
Francisco Pardo-Palacios (@FJPardoPalacios) and Ángeles Arzalluz Luque (@aarzalluz_), both from the Ana Conesa lab at Universitat Politècnica de València, presented the trilogy of SQANTI, IsoAnnot, and tappAS, which takes the output from the PacBio Iso-Seq analysis through classification, functional annotation, and differential analysis. Many of these tools are now becoming the standard workflow for Iso-Seq studies.
Fairlie Reese (@FairlieReese), a PhD candidate from UC Irvine, presented her tool, Swan. It provides a graphical representation of alternative splicing events, but can also be used to detect differential isoform usage and isoform switching events.
The Hunt For Differentially Expressed Isoforms In Bears… and Brains
Using Iso-Seq data on brown bears during hibernation and active seasons, Joanna Kelley (@joannalkelley) associate professor at Washington State University, was able to discover that fat tissue had higher levels of differential isoform usage (DIU) compared to liver and muscle tissues.
“Genes that show no change in expression levels but show major isoform switching and differential isoform usage are the ones we’re most interested in, because those are isoforms that we can’t quantify in any other way,” Kelley said.
Jack Humphrey (@JackHumphrey_), a postdoc in the Towfique Raj lab at Mount Sinai, is using Iso-Seq analysis to study complex splicing in genes associated with Alzheimer’s disease risk. Humphrey shared data from 30 post-mortem isolated microglia they collected. He also presented the processing pipelines for annotating and classifying the Iso-Seq transcripts, with an emphasis on filtering potential library artifacts – an often neglected but critical aspect of any bioinformatics work. Using a combination of existing tools and custom filtering, Humphrey showed that the curated transcriptome is high-quality and has already revealed interesting splicing events not observed with short-read data.
Single-Cell Iso-Seq Method for Precision Oncology and Hematopoietic Lineages
Arthur Dondi (@ArthurDondi), a PhD candidate from ETH Zurich, is using single-cell Iso-Seq (scIso-Seq) to study ovarian cancer. Specifically, by characterizing full-length isoforms in the omentum (fatty tissue covering the abdomen), there’s a potential for discovering neoepitopes and therapeutic targets.
Dondi and collaborators employed the HIT-scIso-Seq technique, which employs TSO artifact removal and concatenation for cDNA molecules coming out of the 10X single-cell platform, and increased the number of reads per SMRT Cell 8M by six-fold. They are planning to query this rich dataset for differential isoform expression, novel isoforms and fusion discovery.
Vladimir Souza from University of Zurich is working on calling variants from Iso-Seq data, showing that using DeepVariant or GATK with specific parameters achieved the highest precision-recall. The goal of his project is to eventually link the variations to changes in ORF predictions.
Anita Scoones (@AnitaScoonesPGR), a PhD candidate from the Earlham Institute, is studying lineage bias during hematopoietic stem cell differentiation. She wants to use single-cell Iso-Seq analysis on their plate-based single-cell libraries, similar to how her lab mate Laura Mincarelli had used long reads to look at isoform differences in aging mice.
Anne Deslattes Mays (@adeslat) and Marcel Schmidt of Georgetown University had previously used bulk Iso-Seq analysis to show that lineage-negative cells in bone marrow have higher isoform complexity than lineage-positive cells. They are now pushing the question into the single-cell space: is isoform diversity uniform at the single-cell in lineage-negative cells? Applying the scIso-Seq method, they found striking differences between the total and lineage-negative bone marrow subpopulations, where lineage-negative cells had an overwhelmingly high number of novel isoforms and were enriched in spliceosome-associated genes. This suggests that alternative splicing in lineage-negative cells is attributed to cell-fate decisions of each cell subpopulation.
What’s Next For Iso-Seq Analysis?
The event ended with a lively discussion in which attendees discussed the need for bioinformatics tools that can handle large amounts of Iso-Seq data and create reproducible workflows that others can easily adapt. They also addressed the one-size-fits-all approach of using a single reference annotation and said a re-think may be in order.
“Maybe references should be qualified by the tissues or cell types of interest,” suggested Ana Conesa (@anaconesa). “How do we use all these novel isoforms to annotate the transcriptome?”
Mays agreed that “the best reference is self.”
In neuroscience, scientists have a poor idea of what makes a cell type-specific isoform, Humphrey said. The challenge is agreeing on what a definitive reference for each cell type would be, he added.
“We’re not done at just references,” Schmidt suggested. “We need to assign a function to these isoforms, even if it’s a regulatory one.” And Conesa said a system level of analysis is necessary.
Overall, the enthusiasm around Iso-Seq analysis is consistent. The promise of a properly defined transcriptome summarized the conversation and paves the way for future discussion.
Want to learn more? Register to watch an on-demand recording of the event, or check out these resources:
PacBio Applications and Workflows
RNA Sequencing with Iso-Seq Analysis
Procedure & Checklist
The new kid on the PacBio block — The Sequel IIe System — has been receiving high marks from universities and sequencing centers around the world.
What’s it like using the instrument, which was introduced in October 2020? Several users have spoken about their experiences in a series of recent online events.
Launching PacBio Sequencing Services in a New Lab
Melissa L. Smith (@SmithLab_UofL), spoke about her experience transferring her lab from New York City to the “PacBio naive” Bluegrass State in the Unleashing the Power of HiFi webinar.
Smith admitted she faced some initial challenges in establishing her lab. Chief among them were compute capacity, data storage, ancillary equipment and staff expertise. Luckily, she was able to leverage existing campus resources to overcome many of those hurdles.
As for the computing needs, “The Sequel IIe changed everything,” she said.
Her favorite feature? On-instrument data processing, which has solved many of her compute capacity and data storage challenges. Plus, it has eliminated the need to queue or compete for compute resources with others across campus.
“The data coming off it is already collapsed, error corrected, and just 50-100 Gb, compared to ~1 TB from Sequel II,” she said.
The Sequel IIe System is not only supporting her research into immunology and infectious disease, it’s also part of a sequencing core lab, and one of PacBio’s newest Certified Service Providers. In addition to the standard sequencing pipelines, the lab will be doing assay development, SARS-CoV-2 sequencing with the new HiFiViral protocol, and other customized sequencing solutions.
Powering a Wide Range of Sequencing Applications
At the SciLifeLab in Uppsala, Sweden, the Sequel Systems are used for a whole spectrum of applications, from de novo genome assembly to BACs, YACs and filling gaps, Olga V. Pettersson told webinar attendees. Her team has been working with PacBio sequencing since 2013, initially with the PacBio RS II, and they recently upgraded to a Sequel IIe System.
In 2020, they sequenced more than 200 non-model eukaryotic genomes (around 700 individuals total), with many reaching the high quality standards of the Earth BioGenome Project.
Pettersson is also a fan of the Sequel IIe System’s advanced computing capabilities, saying it has led to a 20-fold reduction in data storage needs. HiFi reads have also helped shed light on hard-to-access “dark” regions of the human genome, she added.
Q&A with Genomics Core Facility Directors
Pettersson also appeared on a panel of expert users in SMRT Sequencing as a Service – How to Bring Long-Read Technology to Your Core Lab.
She shared tips for sample prep, instrument handling, and business planning, as well as some of the advantages of the Sequel IIe System.
“When the DNA is sufficient, we always prefer to go with PacBio because it’s so much easier, with bioinformatics off the shelf, reads of higher quality, and no need for additional polishing,” Pettersson said.
Other panelists, including Bruce Kingham of the University of Delaware Sequencing and Genotyping Center, said the Sequel System and its HiFi reads have become “the platinum standard for long read sequencing,” with extremely high demand among their users.
“There’s really no other data type like HiFi,” added Charlotte Harris, research lab supervisor at Corteva Agriscience. “Throughput has been a huge win for us. It’s allowed us to take on these much larger and more complex projects, and really benefit our profit margins.”
Want to learn more? Attend the on-demand webinar to hear from Melissa L. Smith and Olga Pettersson how the Sequel IIe System is making it easier than ever before to get started with HiFi reads or add capacity.
Want to discuss the benefits of HiFi sequencing and the Sequel IIe System for your research? Connect with a PacBio Scientist.
Interested in becoming a service provider? Visit the Sequencing for Service Provider page.
Rice was the first crop genome ever completed almost two decades ago. However, the rice reference has never been truly complete. Even improved versions of the major food staple and breeding model system Oryza sativa have contained gaps and missing sequences.
An international team of scientists from China, the United States and Saudi Arabia, has finally closed those gaps to produce two gap-free reference genome sequences of the elite O. sativa xian/indica rice varieties Zhenshan 97 (ZS97) and Minghui 63 (MH63).
How Long-Read Sequencing Fills the Gaps
As reported in Molecular Plant, Jianwei Zhang (Huazhong Agricultural University, Wuhan), Jesse Poland (Kansas State University) Rod Wing (Arizona Genomics Institute and KAUST) et al, were able to drill down to centromere level, discovering more than 395 non-TE genes located in centromere regions, of which ~41% are actively transcribed.
Previous references released in 2016 saw 10% of the genome still unassembled/unplaced, and an update in 2018 left eight and seven gaps in the ZS97 and MH63 genomes, respectively.
“To bridge all remaining assembly gaps across each genome, we incorporated high-coverage and accurate long-read sequence data and multiple assembly strategies,” the authors wrote. These strategies included both CLR and HiFi sequencing modes.
Hi-C and Bionano maps were used to validate the quality of the assemblies, and FISH and ChIP-Seq assays were utilized to discover and characterize the location and primary structure of centromeres.
The new assemblies captured a 99.88% BUSCO score and LTR assembly index (LAI) numbers that meet the standard of gold/platinum reference genomes. In addition, more than 1,500 rRNAs were identified, compared to tens in the original assemblies.
The last closed gaps in the assemblies were all in centromere regions. Centromeric regions, while critical for fidelity and segregation of chromosomes, are largely inaccessible to breeding due to greatly reduced recombination, particularly in larger genomes, the authors noted.
“The detailed understanding of centromere architecture and gene content, therefore, affords insight into the challenge of developing favorable allele combinations in the absence of natural recombination, using hybrid complementation, gene editing, or even precisely inducing recombination,” they wrote.
With its high accuracy and repeat-spanning reads, PacBio HiFi long-read sequencing was a “great resource for the assembly of complex heterozygous regions and centromeres,” the authors stated.
What the Rice Reference Genome Means for the Future
The large 10-fold variation in the number and distribution of centromeric repeats across the different chromosomes and between the genomes gives a detailed picture of the large amount of centromeric diversity both within and among plant genomes.
The new references provide a clear picture of the primary sequence architecture of the xian/indica rice genomes that feed the world, and could help in the breeding of climate resilient varieties, the authors concluded.
“Such resources will serve to develop a fundamental and comprehensive model for the study of heterosis, and other basic and applied research, and leads the path forward to a new standard for reference genomes in plant biology,” they wrote.
Interested in reading more about long-read sequencing and HiFi reads? Check out our Plant and Animal page to learn more about how they empower insect biology, crop improvements, animal health and breeding and more.
Rare diseases are defined as diseases that affect a small number of people – fewer than 1 in 2,000 in the European Union and fewer than 200,000 total people (about 1 in 1,500) in the United States. For example, Tay-Sachs disease affects 1 in 300,000 while Cystic Fibrosis is more common and affects 1 in 10,000. Though individual rare diseases affect very few people, collectively they are common and affect over 300 million people worldwide.
Advances in Sequencing Technology for Improved Understanding of Rare Diseases
With more than 70% of rare diseases being genetic in origin, scientists around the world have deployed genomic technologies to identify their causal mechanisms. Improvement in the technologies for identifying genetic variation have increased scientists’ ability to understand rare diseases. Learn more about the evolution of DNA sequencing tools.
Karyotyping was the first technology to provide a view of the genome, revealing diseases due to chromosomal abnormalities such as Turner Syndrome (1 chromosome X instead of 2 in a female). Later, microarray provided a higher-resolution view, identifying large copy number variants, as in DiGeorge Syndrome (caused by a deletion of around 2.5 Mb on Chromosome 22). Exome or whole genome sequencing based on short-read sequencing platforms enabled even more progress by detecting single nucleotide variants (SNVs), insertions and deletions, and some larger variants.
But even whole genome sequencing with short reads finds a genetic cause in less than half of all instances of rare disease–leaving the causes of many rare diseases unknown. This in part is because even whole genome sequencing with short reads does not provide a comprehensive view of variation.
Fortunately, more recent advancements have led to the introduction of long-read sequencing, which has enabled sequencing of the whole human genome – every single base – so that all types of variants can be detected from SNV up to large structural variants (SVs). Ultimately, by detecting more variants, long-read sequencing provides a more complete picture of the genome and any abnormalities that may exist.
|In case you missed it: Reaching a Genomics Milestone – The First Complete Human Genome.|
What’s the Difference between Short-Read Sequencing and Long-Read Sequencing?
Like their names suggest, short-read sequencing looks at DNA in short snippets (100-350 base pairs) while long-read sequencing measures long fragments of DNA (tens of thousands of base pairs). Why does that matter? Well, when trying to characterize a human genome that has two copies (one maternal and one paternal), each 3.2 billion base pairs in length – having longer snippets of DNA means you:
- Need fewer snippets to make up the length of the whole genome and have no gaps where the sequence is unknown
- Can more easily map how one region of the genome is connected to another region
- Have the ability to phase or determine which copy of a gene, maternal or paternal, a mutation occurs in
As it turns out, the genetic variants underlying many of these diseases are exactly the types that short-read sequencers are least able to detect. From repeat expansions to large deletions or insertions, pathogenic variants are often large and complex structural elements that cannot be spanned by short reads of just a few hundred bases. Representing these variants accurately — and capturing all types of variants — requires much longer sequence reads that cover the entire variant in a single stretch.
HiFi Sequencing – the Key to Seeing All Variant Types Involved in Rare Disease
Unlike the data produced by short-read sequencing platforms, highly accurate long-read sequencing, known as HiFi sequencing, generates extremely long reads (>25 kb) that span even the largest structural variants. HiFi sequencing provides the most comprehensive view of variation in a genome, identifying the variation found with short reads and detecting the larger and more complex variants that short reads miss.
The long reads and high accuracy (>99.9%) of HiFi sequencing provide very complete genome assemblies, comprehensive variant detection with base-pair resolution, and phasing to represent maternal and paternal haplotypes.
Unlocking the Secrets of Rare Diseases with HiFi Sequencing
HiFi sequencing has already made a substantial difference in rare disease research by identifying variants that were missed by short-read sequencing and other technologies. For more detail, check out these research studies of undiagnosed rare diseases and the types of pathogenic variants underlying them.
Structural Variant Calling in Rare Disease Studies
One of the earliest examples of how PacBio sequencing technology could play a role in rare disease research came from the Stanford lab of cardiologist Euan Ashley (@euanashley) and a young man who had suffered a series of tumors in his heart and glands. Eight years of genetic analyses had produced no firm answers. Ashley’s team used a novel method of PacBio whole genome sequencing to find a novel structural variant in a gene associated with Carney syndrome, which was later validated as the correct mutation and finding.
More recently, a group at HudsonAlpha found new evidence in the study of a young girl with intellectual disabilities, seizures, and speech delay. With HiFi sequencing, the scientists at HudsonAlpha identified a de novo heterozygous insertion of nearly 7,000 bases in an intron of the CDKL5 gene that they deemed likely pathogenic. Since CDKL5 has been associated with early infantile epileptic encephalopathy 2, a condition characterized by many symptoms experienced by the proband, “we prioritized this event as the most interesting candidate variant,” the authors reported.
|Structural variants are generally classified as being >50 bp in length and include insertions, deletions, duplications, copy-number variants, inversions, and translocations. Learn more.|
In Japan, researchers deployed HiFi sequencing to find the cause of an undiagnosed syndrome in twin 12-year-old girls. Clinical symptoms matched Dravet syndrome, but no molecular evidence was available to confirm that finding. They sequenced one of the twins and both parents, identifying a novel 12 kb inversion in a region that had previously been associated with the same symptoms affecting the girls.
In one last structural variant example, Kristen Sund (@kristen_sund) from Cincinnati Children’s Hospital identified a 13 Mb complex rearrangement that appears to be responsible for a movement disorder in a 17-year-old with chorea, myoclonus, anxiety, and hypothyroidism. The variant was found in the NKX2-1 gene.
Small Variants in Challenging Regions of the Genome
For an individual with lissencephaly (lack of folds in brain), developmental delay, and seizures, scientists at Children’s Mercy Kansas City used HiFi sequencing to reveal a pathogenic variant in a region that proved difficult for short reads to represent accurately. HiFi sequencing provided even coverage — unlike the coverage dropout seen with short-read data for the same region — which spotted the key variant.
Capturing the Full Length and Sequence of Repeat Expansions
Repeat expansions have previously been shown to cause a range of diseases and can be tough to characterize accurately with short-read sequencing tools. HiFi sequencing can get through even very long expansions. Recently, scientists from Adelaide Medical School and the Robinson Research Institute linked the expansion of an ATTTC repeat in the first intron of STARD7 with familial adult myoclonic epilepsy.
|Repeat expansions are mutations that result in repeating sequence that may extend for hundreds to thousands of bases. For example, the trinucleotide repeat expansion that causes Huntington’s disease, consists of hundreds of CAG repeats.|
Phasing Rare Disease Variants Across Alleles
|Phasing involves separating maternally and paternally inherited copies of each chromosome into haplotypes to get a complete picture of genetic variation. Learn more.|
Back at Children’s Mercy Kansas City, researchers analyzed the genome of a four-year-old girl with hepatosplenomegaly whose parental genomes were not available. The individual was believed to have Niemann Pick disease Type C, but more data was needed to support the theory. HiFi reads showed two key variants located on different alleles of the relevant gene; with the phased variants, scientists were able to confirm the original finding.
The Future of Rare Disease Research is Bright
Scientists around the world are striving to improve the lives of those affected by rare diseases, translating the latest research approaches and high-quality genomic data into insights that could enable the development of improved diagnostics for rare diseases. As HiFi sequencing continues to shed light on more areas of the genome, it should have a profound effect on our ability to diagnose, understand and ultimately improve treatment for the rare disease community.
To learn more about how PacBio HiFi sequencing is helping advance our understanding of rare disease, watch on-demand presentations from our Virtual Rare Disease Week event or visit our rare disease resource page.
Explore Other Posts in the Sequencing 101 Series
- The Evolution of DNA Sequencing Tools
- Understanding Accuracy in DNA Sequencing
- Webinar: How Long-read Sequencing Improves Access to Genetic Information
- Introduction to PacBio Sequencing and the Sequel II System
- From DNA to Discovery – The Steps of SMRT Sequencing
- DNA Extraction – Tips, Kits, & Protocols
- Sequencing 101: Ploidy, Haplotypes, and Phasing – How to Get More from Your Sequencing Data
- Looking Beyond the Single Reference Genome to a Pangenome for Every Species
- Why Are Long Reads Important for Studying Viral Genomes?
- What’s the Value of Sequencing Full-length RNA Transcripts?
An exciting new paper from scientists at the National Institute of Allergy and Infectious Diseases and the NIH Clinical Center reports on the evolution of the SARS-CoV-2 virus within individuals. The team used HiFi sequencing to make this work possible.
The paper, which was published in PLoS Pathogens, comes from lead authors Sung Hee Ko, Elham Bayat Mokhtari, Prakriti Mudvari, senior author Eli Boritz, and collaborators. They conceived the project to overcome a key challenge in tracking viral adaptation. “An important obstacle to understanding intra-individual evolution of SARS-CoV-2 is that standard sequencing and analytical procedures yield a single consensus sequence for each sample, rather than multiple sequences representing virus quasispecies diversity,” they write.
To address the issue, they developed a new method based on HiFi sequencing to focus on the 6.1 kb region of the SARS-CoV-2 genome encoding its surface proteins. They then conducted deep sequencing of eight individuals, yielding large numbers of fully phased S, E, and M gene sequences from each person. In one individual, the availability of four samples collected over time allowed for a longitudinal analysis of viral response to host immune pressure. The scientists had previously used HiFi sequencing to study the intra-individual evolution of HIV, and believed that the same approach could be useful during the COVID-19 pandemic.
The choice of HiFi sequencing, which builds a highly accurate sequence based on consensus calls from covering the same molecule over and over, gave the team an excellent view of viral evolution. When we asked senior author Eli Boritz about his choice of technology, he shared that “By early 2020, we had been working for several years to use HiFi sequencing for high-throughput, single-copy, long-read HIV genetic analysis. Our approach in the HIV studies used unique molecular identifiers (UMIs) for error correction and drew on a short-read approach from Ron Swanstrom’s group and a PacBio approach from Jim Mullins’s group. As the pandemic took off around the world, we decided to adapt our approach to SARS-CoV-2. We didn’t know if this new virus would generate enough diversity to warrant our detailed sequence analysis, but we decided that it would be important to look.”
The longitudinal analysis yielded results highly suggestive of natural selection, revealing four viral haplotypes harboring three mutations that arose independently in a single epitope. “These mutations arose coincident with a 6.2-fold rise in serum binding to spike and a transient increase in virus burden,” the scientists note. “We conclude that SARS-CoV-2 exhibits a capacity for rapid genetic adaptation that becomes detectable in vivo with the onset of humoral immunity, with the potential to contribute to delayed virologic clearance in the acute setting.”
In the other study participants for whom repeated sampling was not possible, the team found lower genetic diversity in the viruses sequenced. They hypothesize that this is likely the result of analyzing samples collected early in the infection process rather than after the host’s immune response has had time to select variants with mutated spike proteins.
We asked Eli Boritz about what’s next for his team. For future longitudinal studies, he told us, “it will be important … to sequence additional regions of the virus and to perform a comprehensive analysis of antiviral host responses, including neutralizing antibodies, T cells, and other mechanisms.” He also hopes to analyze viral samples from more complex cases, such as reinfections. “We hope these studies can teach us about the virus’s capacity for additional waves of escape variants in the future,” he said.
The team’s insights into viral evolution in a single person have important implications for COVID-19 treatment. “Our results also emphasize that early antiviral therapy or combinations of antivirals with distinct targets could have markedly higher virologic efficacy than monotherapy administered later in the disease course,” the scientists conclude.
It’s a moment three decades in the making: the first complete human genome assembly is here!
Reading this you will no doubt feel some sense of déjà vu. After all, the human genome reference was pronounced “done” in 2000, 2001, and again in 2003. But any scientist who has used the reference since then knows that there has never been a single fully sequenced human genome. Until now.
HiFi Sequencing Enables the First Complete Sequence of a Human Genome
The Telomere-to-Telomere (T2T) Consortium, a large team of scientists from the National Human Genome Research Institute and dozens of other institutions, released a new preprint titled “The complete sequence of a human genome.” Lead authors Sergey Nurk, Sergey Koren, Arang Rhie, and Mikko Rautiainen, along with corresponding authors Evan Eichler, Karen Miga, and Adam Phillippy as well as many collaborators have now vanquished gaps and errors to deliver what they call “the first truly complete human reference genome.”
This tremendous effort incorporated several cutting-edge technologies, including HiFi sequencing from PacBio, to produce a gap-free, complete haploid human genome assembly based on a complete hydatidiform mole (CHM13). The goal was to create a novel resource with comprehensive, reliable genome data that avoids the gaps and errors that still mark the latest GRCh38 reference assembly. “The resulting T2T-CHM13 reference assembly removes a 20-year-old barrier that has hidden 8% of the genome from sequence-based analysis, including all centromeric regions and the entire short arms of five human chromosomes,” Nurk et al. report.
This new reference “includes gapless assemblies for all 22 autosomes plus Chromosome X, corrects numerous errors, and introduces nearly 200 million bp of novel sequence containing 2,226 paralogous gene copies, 115 of which are predicted to be protein coding,” the authors add. This represents “the largest improvement to the human reference genome since its initial release.”
HiFi sequencing was pivotal to this achievement. The scientists note that HiFi sequencing features “20 kbp read lengths and a median accuracy of 99.9%, which has resulted in unprecedented assembly accuracy with relatively minor adjustments to standard assembly approaches. …HiFi sequencing excels at differentiating subtly diverged repeat copies or haplotypes.”
HiFi Sequencing Removes Technological Barriers
The team had initially started with a strategy of using noisy ultralong nanopore-based reads to build an assembly backbone, which was then polished with other platforms. But they subsequently switched to accurate and long HiFi reads. “We shifted to a new strategy that leverages the combined accuracy and length of HiFi reads to enable assembly of highly repetitive centromeric satellite arrays and closely related segmental duplications,” they report. The assembly is based on a string graph built from HiFi reads and has an average consensus accuracy between Q67 and Q73, “far exceed[ing] the original Q40 definition of ‘finished’ sequence,” the authors add.
The new assembly, to which a Y chromosome sequence will be added in the near future, should be used in place of the GRCh38 reference for “all studies requiring a linear reference sequence,” the scientists suggest, noting that it is “more complete, representative, and accurate” than its predecessor and “substantially increases the number of known genes and repeats in the human genome.”
The team also notes that reanalysis of short-read public data sets such as the 1000 Genomes Project using the new reference already shows improvement compared to the GRCh38 reference, and that new phenotypic associations should be expected given the more complete reference genome.
HiFi Sequencing Powers the Next Phase of Genomic Discovery
“The complete, telomere-to-telomere assembly of a human genome marks a new era of genomics where no region of the genome is beyond reach,” the authors write.
“Highly accurate, long-read sequencing, combined with tailored algorithms, promises the de novo assembly of individual haplotypes and sequence-level resolution of complex structural variation. This will require the routine and complete de novo assembly of diploid human genomes, as planned by the Human Pangenome Reference Consortium.”
Ultimately, they anticipate that highly accurate long-read sequencing will lead to a “collection of high-quality, complete reference haplotypes [that] will transition the field away from a single linear reference and towards a reference pangenome that captures the full diversity of human genetic variation,” the team reports. “Ideally, every genome could be assembled at the quality achieved here, since the small variants recovered by short-read resequencing approaches represent only a fraction of total genomic variation.”
How to Get Started with HiFi Sequencing for Any Genome
Learn more about our whole genome sequencing application.
Have your questions about HiFi sequencing answered by a PacBio scientist.
2021 HiFi for Accuracy SMRT Grant Program – Apply between June 7-25 for your chance to win free HiFi sequencing.