This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
An exciting new paper from scientists at the National Institute of Allergy and Infectious Diseases and the NIH Clinical Center reports on the evolution of the SARS-CoV-2 virus within individuals. The team used HiFi sequencing to make this work possible.
The paper, which was published in PLoS Pathogens, comes from lead authors Sung Hee Ko, Elham Bayat Mokhtari, Prakriti Mudvari, senior author Eli Boritz, and collaborators. They conceived the project to overcome a key challenge in tracking viral adaptation. “An important obstacle to understanding intra-individual evolution of SARS-CoV-2 is that standard sequencing and analytical procedures yield a single consensus sequence for each sample, rather than multiple sequences representing virus quasispecies diversity,” they write.
To address the issue, they developed a new method based on HiFi sequencing to focus on the 6.1 kb region of the SARS-CoV-2 genome encoding its surface proteins. They then conducted deep sequencing of eight individuals, yielding large numbers of fully phased S, E, and M gene sequences from each person. In one individual, the availability of four samples collected over time allowed for a longitudinal analysis of viral response to host immune pressure. The scientists had previously used HiFi sequencing to study the intra-individual evolution of HIV, and believed that the same approach could be useful during the COVID-19 pandemic.
The choice of HiFi sequencing, which builds a highly accurate sequence based on consensus calls from covering the same molecule over and over, gave the team an excellent view of viral evolution. When we asked senior author Eli Boritz about his choice of technology, he shared that “By early 2020, we had been working for several years to use HiFi sequencing for high-throughput, single-copy, long-read HIV genetic analysis. Our approach in the HIV studies used unique molecular identifiers (UMIs) for error correction and drew on a short-read approach from Ron Swanstrom’s group and a PacBio approach from Jim Mullins’s group. As the pandemic took off around the world, we decided to adapt our approach to SARS-CoV-2. We didn’t know if this new virus would generate enough diversity to warrant our detailed sequence analysis, but we decided that it would be important to look.”
The longitudinal analysis yielded results highly suggestive of natural selection, revealing four viral haplotypes harboring three mutations that arose independently in a single epitope. “These mutations arose coincident with a 6.2-fold rise in serum binding to spike and a transient increase in virus burden,” the scientists note. “We conclude that SARS-CoV-2 exhibits a capacity for rapid genetic adaptation that becomes detectable in vivo with the onset of humoral immunity, with the potential to contribute to delayed virologic clearance in the acute setting.”
In the other study participants for whom repeated sampling was not possible, the team found lower genetic diversity in the viruses sequenced. They hypothesize that this is likely the result of analyzing samples collected early in the infection process rather than after the host’s immune response has had time to select variants with mutated spike proteins.
We asked Eli Boritz about what’s next for his team. For future longitudinal studies, he told us, “it will be important … to sequence additional regions of the virus and to perform a comprehensive analysis of antiviral host responses, including neutralizing antibodies, T cells, and other mechanisms.” He also hopes to analyze viral samples from more complex cases, such as reinfections. “We hope these studies can teach us about the virus’s capacity for additional waves of escape variants in the future,” he said.
The team’s insights into viral evolution in a single person have important implications for COVID-19 treatment. “Our results also emphasize that early antiviral therapy or combinations of antivirals with distinct targets could have markedly higher virologic efficacy than monotherapy administered later in the disease course,” the scientists conclude.
It’s a moment three decades in the making: the first complete human genome assembly is here!
Reading this you will no doubt feel some sense of déjà vu. After all, the human genome reference was pronounced “done” in 2000, 2001, and again in 2003. But any scientist who has used the reference since then knows that there has never been a single fully sequenced human genome. Until now.
HiFi Sequencing Enables the First Complete Sequence of a Human Genome
The Telomere-to-Telomere (T2T) Consortium, a large team of scientists from the National Human Genome Research Institute and dozens of other institutions, released a new preprint titled “The complete sequence of a human genome.” Lead authors Sergey Nurk, Sergey Koren, Arang Rhie, and Mikko Rautiainen, along with corresponding authors Evan Eichler, Karen Miga, and Adam Phillippy as well as many collaborators have now vanquished gaps and errors to deliver what they call “the first truly complete human reference genome.”
This tremendous effort incorporated several cutting-edge technologies, including HiFi sequencing from PacBio, to produce a gap-free, complete haploid human genome assembly based on a complete hydatidiform mole (CHM13). The goal was to create a novel resource with comprehensive, reliable genome data that avoids the gaps and errors that still mark the latest GRCh38 reference assembly. “The resulting T2T-CHM13 reference assembly removes a 20-year-old barrier that has hidden 8% of the genome from sequence-based analysis, including all centromeric regions and the entire short arms of five human chromosomes,” Nurk et al. report.
This new reference “includes gapless assemblies for all 22 autosomes plus Chromosome X, corrects numerous errors, and introduces nearly 200 million bp of novel sequence containing 2,226 paralogous gene copies, 115 of which are predicted to be protein coding,” the authors add. This represents “the largest improvement to the human reference genome since its initial release.”
HiFi sequencing was pivotal to this achievement. The scientists note that HiFi sequencing features “20 kbp read lengths and a median accuracy of 99.9%, which has resulted in unprecedented assembly accuracy with relatively minor adjustments to standard assembly approaches. …HiFi sequencing excels at differentiating subtly diverged repeat copies or haplotypes.”
HiFi Sequencing Removes Technological Barriers
The team had initially started with a strategy of using noisy ultralong nanopore-based reads to build an assembly backbone, which was then polished with other platforms. But they subsequently switched to accurate and long HiFi reads. “We shifted to a new strategy that leverages the combined accuracy and length of HiFi reads to enable assembly of highly repetitive centromeric satellite arrays and closely related segmental duplications,” they report. The assembly is based on a string graph built from HiFi reads and has an average consensus accuracy between Q67 and Q73, “far exceed[ing] the original Q40 definition of ‘finished’ sequence,” the authors add.
The new assembly, to which a Y chromosome sequence will be added in the near future, should be used in place of the GRCh38 reference for “all studies requiring a linear reference sequence,” the scientists suggest, noting that it is “more complete, representative, and accurate” than its predecessor and “substantially increases the number of known genes and repeats in the human genome.”
The team also notes that reanalysis of short-read public data sets such as the 1000 Genomes Project using the new reference already shows improvement compared to the GRCh38 reference, and that new phenotypic associations should be expected given the more complete reference genome.
HiFi Sequencing Powers the Next Phase of Genomic Discovery
“The complete, telomere-to-telomere assembly of a human genome marks a new era of genomics where no region of the genome is beyond reach,” the authors write.
“Highly accurate, long-read sequencing, combined with tailored algorithms, promises the de novo assembly of individual haplotypes and sequence-level resolution of complex structural variation. This will require the routine and complete de novo assembly of diploid human genomes, as planned by the Human Pangenome Reference Consortium.”
Ultimately, they anticipate that highly accurate long-read sequencing will lead to a “collection of high-quality, complete reference haplotypes [that] will transition the field away from a single linear reference and towards a reference pangenome that captures the full diversity of human genetic variation,” the team reports. “Ideally, every genome could be assembled at the quality achieved here, since the small variants recovered by short-read resequencing approaches represent only a fraction of total genomic variation.”
How to Get Started with HiFi Sequencing for Any Genome
Learn more about our whole genome sequencing application.
Have your questions about HiFi sequencing answered by a PacBio scientist.
2021 HiFi for Accuracy SMRT Grant Program – Apply between June 7-25 for your chance to win free HiFi sequencing.
May is ALS Awareness Month, and we’re hoping to help raise awareness by shining a spotlight on two deserving publications from scientists at the University of Washington and at the Mayo Clinic. In both studies, researchers used PacBio technology to sequence targeted genomic regions and to discover and characterize complex pathogenic variants associated with ALS.
ALS, short for amyotrophic lateral sclerosis and commonly known as Lou Gehrig’s disease, is a progressive disease of the nervous system that causes loss of muscle control. The mean survival time for patients with ALS is just three to five years from diagnosis, and there is no cure for the disease.
At the University of Washington, scientists studied members of a large multigenerational family in which several people developed spontaneous ALS. By using barcoded amplicons and a multiplex strategy, the team analyzed the target WDR7 gene in nearly 300 samples and identified a variable number tandem repeat (VNTR) that is expanded in individuals with ALS.
Their paper, published in the American Journal of Human Genetics, details single-base resolution for a repeat that would have been virtually impossible to spot — let alone resolve with such granularity — using any other sequencing technology. “Our detailed interrogation of this VNTR demonstrates the value of high-depth, long-read sequencing of human-specific repetitive regions that expand in the genome,” the team reports.
Meanwhile, at the Mayo Clinic, scientists used PacBio’s CRISPR/Cas9-based No-Amp method to target, capture, and sequence a repeat expansion region associated with neurological diseases. The targeted element is the most common genetic cause of ALS, but the biological mechanism of how it causes disease has not been well understood. The team used PacBio sequencing to characterize the complete region, finding that individuals with fewer repeat copies had longer survival time than those with larger expansions. Results were recently published in the journal Brain.
“Our findings demonstrate that No-Amp sequencing is a powerful tool that enables the discovery of relevant clinicopathological associations, highlighting the important role played by the cerebellar size of the expanded repeat in C9orf72-linked diseases,” the scientists write.
These discoveries could open up new paths to identify people at risk of developing ALS. Congratulations to both of these teams, as well as other PacBio users who are making a difference for families affected by ALS!
Interested in learning more about this research?
- Watch Marka van Blitterswijk from the Mayo Clinic present, ‘Applying Targeted Long-read Sequencing to Assess an Expanded Repeat in C9orf72’
- See Meredith Course from the University of Washington present, ‘The Evolution and Function of a Large Tandem Repeat Associated with ALS’.
Visit our Neuroscience Research page to learn how PacBio sequencing provides a comprehensive understanding of the genetic basis of neurological disease.
If you are like most of us at PacBio you likely learned how to extract DNA in a high school or college biology class, or maybe even in your kitchen. But as you moved on to more high stakes experiments, you may have found that extracting DNA for sequencing in your lab isn’t always as straightforward as lyse, precipitate, wash, suspend. In this introduction to DNA extraction, we will share tips, tricks, and protocols to help make your DNA isolation easier!
For optimal results to power biological discovery, sample prep is a critical step in any sequencing project. And with long-read sequencing technologies, including HiFi sequencing, you not only want DNA free from nicks and degradation, but you also want long fragments (tens of kilobases) to achieve those coveted long reads.
Long-read sequencing expert and sample wrangler Olga Pettersson (@OlgaVPettersson) of SciLifeLab at Uppsala University, advises: “Aim for getting molecules as long as you can, as pure as you can, as fresh as you can.”
So, what are the factors that go into obtaining HMW DNA for sequencing? Jennifer Balacco (@JenBalacco) of the Vertebrate Genome Lab, which aims to sequence the genomes of all living vertebrate species and therefore has a dearth of experience with DNA extraction, points to sample type, the prep and storage of samples, and individualized extraction methods as the key components of successful DNA extraction. You can follow along as she shares her experience with many sample types in the video below and then explore our additional resources and considerations.
Watch this PacBio Virtual Global Summit presentation from Jennifer Balacco of the Vertebrate Genome Lab on DNA extraction approaches to achieve error-free genomes.
DNA Extraction Challenge #1: Sample Type
Both within the same organisms and between species there is variability in how readily HMW DNA can be extracted and how stable it is once extracted. DNA from liver, for example, is known to quickly degrade due to the enzymes that make a functioning liver, while DNA extracted from blood is typically more stable. Some plant species have phenolics and polysaccharides that interfere with extraction, and mollusks have high DNAase activity that makes it difficult to store DNA for any amount of time.
If you have a choice on the type of sample you use, a cell-dense tissue with minimal potential contaminants is your best bet. For vertebrates, this means using tissues like blood, brain, kidney, or muscle. For some invertebrates there may be a mucous membrane that inhibits the ability to obtain high-quality DNA, and you might want to consider an additional DNA cleanup step to rid the extraction of contaminants.
When working with small arthropods you can use an adult individual but may find that targeting pupae or larva are an easier DNA source than a tough exoskeleton-covered adult. When planning a fungus sequencing project, consider culturing the sample in order to acquire a single isolate/individual in the case of macroscopic organisms or an isogenic population in the case of microorganisms. And finally, for plants, it is recommended to obtain the youngest leaf/shoot tissue from an individual plant that has been dark treated (kept out of light) for 24-72 hours.
DNA Extraction Challenge #2: Sample Prep and Storage
The second consideration for HMW DNA extraction after you’ve decided what sample type to use is how you will treat that sample. Sequencing adheres to the “garbage in, garbage out” rule, therefore it’s prudent to take care when prepping your samples. In most cases, the freshest sample will work best, followed by samples flash frozen with liquid nitrogen and stored at -80°C. This is because as soon as a tissue is taken from its living organism it begins to release factors that degrade both DNA and RNA, making it a race against the clock to get the genetic material out intact.
“Aim for getting molecules as long as you can, as pure as you can, as fresh as you can.”
– Olga Pettersson, SciLifeLab at Uppsala University
Of course, we can’t always control how a sample is prepped or stored, and in those cases it’s generally worth a try to get the best DNA you can from any given sample. There are examples of ethanol stored samples providing sufficient quality DNA as well as museum specimens for amplicon sequencing. However you decide to prep and store your sample prior to DNA extraction, the main aim is to reduce the amount of time between sampling and stably storing your sample to reduce enzymatic degradation of the genetic material within.
DNA Extraction Challenge #3: Choosing the Right Method
The final piece of the puzzle when it comes to obtaining HMW DNA for a sequencing project is the method used for extraction. There is no shortage of kits, protocols, and tutorials for DNA extraction, and after spending years trying to find the best one-size-fits-all extraction method for various sample types, we are fairly confident one doesn’t exist! However, there are some approaches that consistently produce plentiful HMW DNA that can be binned by sample type.
In general, “old school” methods using chemicals commonly found in molecular biology labs perform fairly well. For example, phenol and chloroform extractions work well for many tissues, though the chemicals used are dangerous. The cetyl trimethylammonium bromide (CTAB) method for extraction of DNA from plants is also a fairly robust way to yield good DNA. And once you understand the chemistry of how DNA is liberated from cells via these methods, you can tailor the protocols to meet the needs of individual species.
If you’re in the market for a tailored protocol, we encourage you to check out Extract DNA for PacBio, where we have collected many protocols from published projects, organized by organism type. However, if you’re looking for an easy, all-in-one DNA extraction kit to get you started on your sequencing journey, there’s a few out there that have produced great DNA for HiFi sequencing, and are summarized in our DNA extraction technical note. If you are hoping to outsource this step to a DNA extraction lab, explore our Certified Service Providers, many of which offer DNA extraction as a service.
While there might not be a one-size-fits-all solution for extracting DNA, we hope our experience and those of our customers can help point you in the right direction for a successful HiFi sequencing project!
If you are ready to get started with sequencing or simply need help with choosing the best DNA extraction approach, connect with a PacBio Scientist.
Explore Other Posts in the Sequencing 101 Series:
If only we could track COVID-19 like we track the weather, with satellites and weather stations placed around the globe monitoring and sounding the alarm about potential storms, floods, droughts and other severe weather events.
A global pathogen surveillance network would save countless lives, and lessons learned from the current coronavirus pandemic could help make it possible, PacBio Chief Scientific Officer Jonas Korlach told Mendelspod host Theral Timpson (@theraltweet).
Korlach joined Brian Caveney, President and Chief Medical Officer of Labcorp, in a recent podcast to discuss SARS-CoV-2 viral surveillance and the trajectory of COVID research, vaccination and treatment.
PacBio has partnered with the national diagnostic testing company to support its large-scale SARS-CoV-2 testing, which has become part of the US Centers for Disease Control’s COVID19 genomic surveillance effort. Labcorp has sequenced thousands of samples from around the country on its fleet of Sequel II Systems, and worked closely with PacBio to develop a new HiFiViral SARS-CoV-2 Workflow protocol to enable any laboratory to rapidly and efficiently power viral mutation surveillance using PacBio’s HiFi sequencing.
While rapid COVID19 diagnostic testing is generally being done via PCR methods, there is still an important role for viral sequencing, Korlach said. PCR tests provide very limited information about genomic mutations and might not be able to identify which variant of the virus a person is infected with. HiFi sequencing on the PacBio systems can provide a highly detailed profile of the 30,000 base-pair long SARS-CoV-2 virus, including specific mutations and whether there are multiple subtypes of the virus in individual patients, which has been detected.
Labcorp is using both methods, Caveney said. The company has performed more than 38 million COVID-19 PCR tests, and sequenced more than 20,000 genomes with PacBio technology. Not only is the whole-genome sequencing useful in accelerating scientists’ understanding of the virus as it evolves, but it has helped Labcorp ensure its PCR tests continue to be sensitive to emerging variants and mutations.
“Our research and development team loves working with the PacBio equipment. They like the incredible ability to have high specificity with the long reads that we’re getting,” Caveney said. “It’s going to continue to be a very important research tool for both sides of the house — the diagnostic side, as well as the clinical research side — to make sure that the best medications, therapeutics and vaccines are coming to market.”
Another benefit of sequencing technology in the realm of infectious diseases is that it is a “universal measuring device,” Korlach said. Whether the pathogen is a virus or bacterium a DNA sequencer can detect either.
“It’s COVID today and the variants of COVID tomorrow, but what about all the other infectious disease agents that for many decades have cost millions of lives?” Korlach said. “We now have opportunities to tackle them a lot better than we have in the past, using the COVID pandemic as a blueprint.”
Caveney agreed. “We’re so focused on COVID, but 30, 40, 50,000 Americans die every year from influenza. And we now have learning from COVID that might help us bring that number down in the future. That would be a great win, in spite of the tragedy we just went through.”
So what would it take to create a global pan-pathogen surveillance network?
Collaboration, between and among scientific communities, public health agencies, and private companies, Caveney said. An international standardization of nomenclature is also high on his wishlist, “so that regardless of the instruments or the technology used to do the sequencing, it results in information that can be compared and assimilated in a way that all scientists and doctors know what to do with it.”
Continued investment, Korlach said. “Are we willing to keep investing and focusing on making that change permanent and applying it to other infectious diseases, of really building out a permanent and stable network where the routine medical care is going to shift from measuring a temperature and looking in your mouth to getting samples genomically tested within days? That is a future that I think is possible, and that I would like to be part of trying to do our little part to make that happen.”
To learn more about genome surveillance and the benefits of PacBio sequencing, explore our COVID-19 sequencing tools and resources
Today we’re pleased to announce the launch of a new HiFi Sequencing workflow along with a software update for the Sequel II and Sequel IIe Systems that will increase the number of HiFi reads at or above 99.9% accuracy (QV30) for whole genome sequencing-based applications. Together, these advances will improve the quality of HiFi Sequencing while providing an efficient and scalable workflow for sequencing hundreds to thousands of whole human genomes per year on Sequel Systems.
This high-throughput sequencing and analysis workflow release includes a new HiFi library prep protocol offering a three-fold reduction in DNA input, enabling HiFi sequencing with limited sample quantities (neonatal blood, tissue biopsies, and cell lines).
Developed in collaboration with Children’s Mercy Kansas City, the release supports the adoption of HiFi reads for comprehensive variant detection to better understand the genetic causes of rare and inherited diseases. In a statement announcing the release, Emily Farrow, Director of Lab Operations at Children’s Mercy Research Institute, said: “This new workflow provides efficiency in our lab where now two research scientists can comfortably produce one thousand HiFi libraries a year, with the hope of doubling the throughput for library prep by automated liquid handling currently tested in the laboratory.”
The release also features new enabling workflows for variant calling and analysis of the SARS-CoV-2 genome in combination with the recently released high-throughput COVID sequencing protocol developed in partnership with Labcorp.
Jasmine Pritchard, our Vice President of Product Marketing, said, “We see building enthusiasm in the market for HiFi sequencing and this new release demonstrates our commitment to continuously improving our already industry-leading accuracy and key aspects of the workflow. Our team is focused on delivering advancements across the full spectrum of our portfolio, from sample preparation to downstream analysis.”
The HiFi Sequencing and Software v10.1 Release is available to order today and includes the following features:
- New Consumables: SMRTbell Enzyme Clean Up Kit 2.0, Sequel II Primer v5, Polymerase Binding Kit 2.2
- HiFi Protocol: Updated HiFi Express protocol enabling reduced DNA input
- Sequel II ICS v10.1: On-instrument workflow improvements that simplify run set up, especially for multiplexed applications
- SMRT Link v10.1: Updates for Adaptive Loading, our new HiFiViral for SARS-CoV-2 analysis application, and improved Iso-Seq Analysis for multiplexed samples
We also invite you to watch our on-demand Rare Disease Week event to hear how scientists are using HiFi sequencing to help identify causative variants and increase solve rates in rare disease research.
Ready to get started with HiFi sequencing? Connect with a PacBio scientist for a free project or instrument consultation.
By Jonas Korlach, Chief Scientific Officer
Grapy dusks over tangerine fields. Potato-patch fog over beds of coral. Mountains, glaciers, forests, deserts, fertile farmland and seas with both Arctic and tropical biomes.
One of the most geographically and biologically diverse states, California is home to both the highest (Mount Whitney) and lowest (Death Valley) points in the 48 contiguous states, as well as to some of the world’s most exceptional trees — the tallest (coast redwood), most massive (Giant Sequoia), and oldest (bristlecone pine).
At PacBio, we are extremely fortunate to have this biodiversity in our back yard — almost literally. We didn’t have to travel far to take samples of the giant California redwood as part of a personal project to sequence its gigantic genome and transcriptome.
It’s one of the reasons we are excited to work with the California Conservation Genomics Project, a collaboration of scientists across the state that has selected more than 100 threatened, endangered or otherwise valuable species sampled from the full array of California ecosystems for HiFi sequencing and assembly.
The purpose of the $10 million state-funded project is to capture the genetic variation that exists across each species’ habitat, with the ultimate objective of informing smarter development and more effective conservation.
How can genetics inform conservation? More biodiversity means more resilient ecosystems, and conservationists have long focused on preserving habitats and studying the roles of species within ecosystems. But they are now recognizing the importance genetic variation can play on long-term survival of a species.
Populations with high genetic diversity are more likely to contain individuals with a genetic makeup that allows them to survive new environmental pressures. Populations with low genetic diversity might not even survive the next big threat, so it is crucial to identify individuals with genetic variation in order to conserve the species’ ability to survive and evolve.
Threats to one population can threaten others, including ours. A collapsing ecosystem affects all those species who rely on it. So preserving biodiversity is also an exercise in self-preservation.
California will not be the only ecosystem to benefit from the CCGP research. In many ways, the state is a microcosm of what’s happening to biodiversity around the world. It faces threats similar to those faced by habitats on other continents: climate change, wildfires, droughts, and an ever-expanding population that encroaches onto formerly wild lands.
Its efforts will be boosted by other international initiatives, such as the United Kingdom’s Darwin Tree of Life Project, Australia’s Oz Mammals Genomics Initiative, the Vertebrate Genomes Project and The Earth BioGenome Project, whose ambitious goal is to sequence the DNA of 1.5 million species by 2030. We’re proud that PacBio technology is being used in all of these projects. You can learn more about the biodiversity initiatives PacBio sequencing is supporting in my recent presentation at the Senckenberg Biodiversity Genomics Symposium.
While COVID-19 has focused the attention of the scientific community — including our own — on pathogen detection, surveillance and drug development, lockdown has also spurred a renewed appreciation of nature. How many of us have sought solace in a temple of trees — in some cases, amongst towering columns of sequoias older than the Parthenon?
On this Earth Day, I urge all of you to do your part to “Restore our Earth,” whether that be committing to a home conservation project, or supporting an international one. At PacBio, we will be participating in public awareness campaigns and contributing our time and expertise in support of these important biodiversity initiatives. Let’s make every day Earth Day.
When size matters and you need to be able to detect both single nucleotide changes as well as large repeated sequences, SMRT Sequencing on the Sequel II System is the way to go, concluded rare disease researchers at Centre de Recherche en Myologie at Sorbonne Université/INSERM
Stéphanie Tomé (@TomeStephanie) and colleagues used the highly sensitive, comprehensive long-read sequencing to investigate myotonic dystrophy type 1 (DM1), the most complex and variable trinucleotide repeat disorder, caused by an unstable CTG repeat expansion that can reach up to 4,000 triplets in those affected most severely with the disease.
As reported previously, the length of these repeated CTG sections and any interruptions in the sequences have been found to correlate with the severity and onset of symptoms of the neuromuscular autosomal disorder, which is the most common form of inherited muscular dystrophy in adults.
The highly variable clinical presentation of DM1 and current limitations in methods to determine the size and variant repeat interruptions of the large CTG repeat expansions, make genetic counseling for the condition very complex, so Tomé turned to PacBio sequencing to better understand this mutation. She successfully applied for a 2019 Targeted Sequencing SMRT Grant, and the results of her work were recently published in the International Journal of Molecular Sciences.
“Better characterization of expanded alleles in DM1 patients can significantly improve prognosis and genetic counseling, not only in DM1 but also for other tandem DNA repeat disorders,” Tomé said.
Inherited CTG repeat expansion size and the level of somatic mosaicism are traditionally evaluated by Southern blot and polymerase chain reaction (PCR), which do not provide any information on the sequence of CTG repeat expansion. Triplet-primed PCR testing may detect the presence of interruptions at the 5’ and 3’ ends of the CTG repeat expansion, and short-read sequencing can help identify further interruptions, but the methods give no information about the middle of the sequence.
Using the Sequel II System, Tomé’s team was able to sequence 1,000 CTG triplet-long repeats, detect a single CAG and multiple CCG interruptions, and also estimate somatic mosaicism (the occurrence of two genetically distinct populations of cells within an individual derived from a postzygotic mutation) within two DM1 families—with more accuracy than conventional PCR.
The data enabled them to gain insights into the genetic changes within the families in the study, as well as some observations applicable to the nature of DM1. They revealed the existence of de novo CCG interruptions associated with CTG stabilization/contraction across generations in one of the families. And the heterogeneity of the number and type of interruptions observed in the interrupted expanded alleles suggested new mechanisms leading to base substitution in the sequence and/or duplication of existing interruptions in the repeated sequence. These could be caused by multiple processes, including spontaneous DNA damage, DNA repair and DNA polymerase errors occurring in germ cells and somatic cells throughout embryogenesis and the lifetime of those affected by DM1.
“Our study reinforced the idea that interrupted alleles do not originate from an ancestral/normal allele, but from unknown mechanisms occurring both in the germline and in somatic cells,” the study concluded.
“SMRT Sequencing opens new avenues for DM1 disease and will provide a better understanding of the clinical and genetic variability observed in DM1 through global analysis,” Tomé added. “This new technology is a straightforward way to detect clinically significant repeat changes and estimate the size of the repeat in blood using targeted sequencing.”
To learn more about how scientists are using highly accurate long-read sequencing in large-scale studies to help identify causative variants, increase solve rates in rare disease research, and support the development of diagnostics for rare and undiagnosed diseases, watch on-demand presentations from PacBio Neuroscience Day, and register for the Rare Disease Week virtual event, April 27-29.
Resolve Complex Human Genetic Variation with Confidence
Apply by June 11, 2021 for your chance to win free sequencing.
Vaccine safety is of the utmost importance. Respiratory syncytial virus (RSV) is the most common cause of severe lower respiratory tract illness in infants and young children. Much like the flu, it can also cause severe disease in the elderly or immunocompromised adults, making it an important target for vaccine development. In a recently published study, researchers used PacBio long-read sequencing to evaluate the genetic stability of a live-attenuated RSV vaccine candidate and observed previously unknown adaptation mechanisms that was missed by short-read sequencing.
Codon-pair deoptimization involves recoding of open reading frames (ORFs) to reduce protein expression and is used as a mechanism for creating live-attenuated vaccine candidates. It is important, however, to understand whether deoptimized viruses could accumulate mutations under selective pressure that might lead to de-attenuation.
To this end, researchers of the Laboratory of Infectious Diseases at the National Institute of Allergy and Infectious Diseases used a combination of sequencing technologies to examine codon-pair deoptimization in human RSV under selective pressure.
In their PNAS paper “Rescue of codon-pair deoptimized respiratory syncytial virus by the emergence of genomes with very large internal deletions that complemented replication,” Cyril Le Nouën et al. tested the genetic stability of a live-RSV vaccine candidate that had been attenuated by codon-pair deoptimization of its glycoprotein genes. As one of the hallmarks of attenuation, the replication of this RSV strain was reduced at higher temperatures. This feature was used to test stability of the attenuation: the virus was passaged in cell culture over several months at continuously increasing temperatures. After each passage, the entire virus genome (about 15 kb) was sequenced to identify mutations.
The researchers discovered that, while the RSV strain accumulated point mutations, they had minimal effect on viral replication. Unexpectedly, however, using a combination of long-range PCR and PacBio HiFi sequencing, the scientists identified large deletions that appeared early in the serial passage and became the dominant species. These large-deletion genomes rescued RSV glycoprotein expression, thereby restoring replication of the deoptimized virus. They hypothesized that these large-deletion genomes occurred through polymerase jumping.
“Under selective pressure, Large Deletion (LD) genomes were selected to restore rather than to inhibit the replication of a single-stranded RNA virus, attenuated by [codon-pair deoptimization] of two ORFs,” the authors report. Such a mechanism of compensation was previously unknown for RNA viruses and suggests that the accumulation of DI (defective interfering) genomes has to be carefully investigated during the generation and evaluation of live-attenuated vaccine candidates.
With growing interest in viral sequencing and vaccine development, understanding how viruses adapt under selective pressure is essential. PacBio’s long-read sequencing played an important role in this study by identifying large deletions that would have otherwise been missed had only short-read methods been used. As first author Le Nouën notes, “Long-range deep sequencing is a useful method to understand the virus population dynamics and specifically how mutations co-evolve over time on a viral genome.”
Today we’re pleased to announce the three winners of our latest SMRT Grant which called for teams of researchers and collaborative projects that could be addressed using the power of HiFi sequencing. The winners are seeking to solve a diverse set of questions from mussel-hopping transmissible cancer to the power of pistachios to help tackle climate change, and sex determination in bearded dragons.
The 2020 HiFi for All – Collaborations SMRT Grant Program was open to scientists worldwide and offered three winning projects awards of up to 10 SMRT Cells 8M and sequencing on the Sequel II or IIe System by one of our service providers and co-sponsors. We received many truly compelling proposals, featuring teams from across the life sciences and beyond, and selecting three winners was quite a challenge. Here is a glimpse into how the winning teams will use HiFi sequencing to advance their science.
Cancer or Infectious Disease? Unblurring the Line in Bivalves
Cancers may evolve as they mutate and divide, but they almost always lead to an evolutionary dead end with the death or remission of their hosts. However, in a handful of cases, cancer cells have been shown to spread beyond their original hosts. In these transmissible cancers, cells themselves jump from individual to individual, spreading through the environment.
An inter-continental team of biologists and geneticists, led by Metzger, will investigate this type of cancer in marine mussels, where molecular analysis has shown that a bivalve transmissible neoplasia (BTN) that arose from a single Mytilus trossulus individual has now been found infecting four different Mytilus species around the world.
Collaborators Nicolas Bierne (CNRS – University of Montpellier, France) will provide samples from M. edulis in Europe; Petr Strelkov and Maria Skazina (St. Petersburg State University, Russia) and Nelly Odintsova (Far Eastern Branch of the Russian Academy of Sciences) will provide M. trossulus samples from their studies in the Sea of Japan; Artur Burzynski (IO PAN, Poland) will aid in analysis of a sample from Northern Europe; and Gloria Arriagada (Universidad Andrés Bello, Chile) will provide samples of M. chilensis from South America.
Together, this team has assembled the most widespread marine transmissible cancer lineage known and aims to use HiFi sequencing to detect and phase somatic variants to understand the selective pressures underlying this fatal and unexplained phenomenon.
“We are really excited about what the HiFi data and this collaboration will allow us to see. This single lineage of transmissible cancer that began in a single animal has spread into populations of marine mussels around the world, and by working together, we will be able to untangle the genetic changes that have shaped its evolution.”
– Michael Metzger, Pacific Northwest Research Institute
Sequencing for this project will be provided by the PacBio Certified Service Provider University of Louisville’s Department of Biochemistry and Molecular Genetics.
Tackling a Challenging Genome That Could Help Address Climate Change
Winner: Esaú Martínez, CIAG-IRIAF, Spain
As a highly nutritious crop adapted to arid conditions, pistachio has become popular for crop replacement in regions affected by climate change. However, even this resilient species is starting to suffer the effects of warmer winters, causing a lack of chilling accumulation which affect bud dormancy breaking, bud burst and flowering.
In order to make the most of the crop’s tolerance to drought and warm climates, a team of scientists from across Europe and the United States are working to find resilience mechanisms by sequencing the large genetic variation of cultivated pistachio.
Led by Martínez, the team will create a pangenome collection of six highly heterozygous cultivars, with extensive haplotype diversity that can be exploited for breeding. The team hopes the pangenomes will serve as a critical starting point toward the long-term goal of using HiFi sequencing to characterize genomic diversity in global pistachio collections. Additionally, the team will generate a pantranscriptome using PacBio full-length RNA sequencing (Iso-Seq) method to precisely annotate the pangenome and identify transcript variants related to climate adaptation.
Collaborators include Adela Mena (IVICAM-IRIAF, Spain) Antonio Giovino and Luigi Cattivelli (@luigicattivelli, Council for Agricultural Research and Economics, Italy); Annalisa Marchese and Francesco Paolo Marra (University of Palermo, Italy); and Pablo Carbonell (@pcarbonellb, Max Planck Institute for Developmental Biology, Germany). Grey Monroe (@grey_monroe, University of California-Davis) will also take part to identify functional variants contributing to production under warmer climates.
“We are extremely delighted for the opportunity to work together with PacBio in this project. Pistachio is a highly nutritious crop adapted to arid conditions. However, pistachio farmers are already starting to experience the negative effects of climate change. We believe HiFi is the perfect technology for the development of functional genomic resources for pistachio breeding. Combining pangenome and pantranscriptome approaches we will identify functional variants enabling the sustainability of the crop under warmer conditions.”
– Esaú Martínez, CIAG-IRIAF
Sequencing for this project will be provided by the PacBio Certified Service Provider GENTYANE part of the Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE).
Unraveling the Mechanisms of Sex Determination in Reptiles
In vertebrates, sex is determined either by genetic factors on sex chromosomes (genetic sex determination; GSD) or by environmental cues, such as egg incubation temperature (temperature-dependent sex determination; TSD).
In reptiles, both modes are common, but the mechanisms underpinning reptile GSD and TSD remain mysterious. High repeat content and close homology between the sex chromosomes in reptiles have thwarted previous attempts at their assembly and phasing to identify GSD candidates, and the gene/s controlling TSD may act at any level in the complex regulatory cascade governing sexual differentiation, potentially implicating a multitude of genes across all chromosomes.
A team of researchers from across Australia, led by Deveson will attempt to unravel the mystery by sequencing the genome and transcriptome of the bearded dragon lizard, Pogona vitticeps, a unique model organism in which sex is shaped by both genotype and temperature. Dragons have a GSD system wherein chrZW embryos develop as female and chrZZ as male. However, at high egg-incubation temperatures, this is overridden so that both chrZZ and chrZW embryos develop as female.
Additional RNA sequencing on adult tissues and embryonic developmental stages will allow the team to generate high-quality, allele-specific transcriptome annotations and uncover the primary transcriptional signatures of both GSD and TSD.
Collaborators include: Arthur Georges and Sarah Whiteley (University of Canberra); Hardip Patel and Yu Lin (Australian National University); Parwinder Kaur (UWA, Australia) and Andre Reis (Garvan Institute).
“The daunting task of complete assembly and phasing of reptilian ZW sex chromosomes requires long reads with very high per-base accuracy. I believe that PacBio HiFi sequencing is the only technology that can deliver this.”
– Ira Deveson, Garvan Institute
Sequencing for this project will be provided by Nucleome Informatics.
Congratulations to all our HiFi for All – Collaborations SMRT Grant winners! And thank you to our co-sponsors for teaming up with PacBio to make these SMRT Grants possible. Explore the 2021 SMRT Grant Programs to apply to have your project funded and learn more about HiFi sequencing.
It’s been a year since we took a little field trip to Stanford to collect samples from the giant California redwood (Sequoia sempervirens) with the goal of assembling its ginormous 27 Gb genome.
What would have been considered a herculean effort not that many years ago was accomplished in only a few weeks by a handful of personnel —Emily Hatas (@EmilyHatas), Greg Young (@PacbioGreg), Michelle Vierra (@the_mvierra), and Greg Concepcion (@phototrophic) — in their spare time.
As detailed in this blog post, the crew put together an assembly with 22-fold coverage in just 17 days — 4 days of sample prep, 7 days of sequencing, and 6 days for assembly. With a little more time on their hands, and enough HiFi library to go around, the team embarked on more sequencing to create an even better assembly, with 33-fold coverage, a contig N50 of 3.8 Mb.
Wanting to take things even further, Iso-Seq analysis expert Elizabeth Tseng (@magdoll) delved deep into sequences of transcripts from the redwood’s needles.
The results from two Sequel II SMRT Cells (a total of 5.3 million full-length reads) were mapped to the hifiasm v12 assembly of the PacBio redwood genome, yielding 336,853 high-quality Iso-Seq transcripts, with 69,198 mapped loci and 205,792 unique, full-length transcripts.
The mapped transcripts ranged from 50 bp to 14.2 kb with a mean length of 2.9 kb. While most of the loci had 1–5 isoforms, there were many that displayed complex alternative splicing patterns, highlighting the power of full-length transcript sequencing.
“I found several aspects of the Iso-Seq data exciting,” Tseng noted in a Medium post about the work. “One was the ability to see alternative splicing. Another was the ability to predict ORFs directly from the sequences.”
The exercise was also a good test of the IsoPhase isoform phasing method that Tseng initially developed for maize, a diploid genome. Would it work for a hexaploid genome?
By combining phased genome information with phased transcriptome data, Tseng was able to identify five distinct alleles, as well as genes that were likely to be homologous.
Lastly, Tseng used the Iso-Seq data — and another tool, Cogent — to assess the quality of the redwood genome assembly.
“The high mappability of the Iso-Seq data to the PacBio genome has shown that the genome assembly is quite complete in terms of coding regions,” she said. “Missing genes or difficult-to-assemble gene regions can be assessed using Iso-Seq transcripts.”
Want To See More Redwood Iso-Seq Analysis? Dig In! We’ve released the Iso-Seq dataset, including the transcript sequences, GFF files, BLASTN hits, IsoPhase and Cogent results. We welcome the community to use this dataset for research, tool development, and give us feedback.
Data for the redwood genome has also been made publicly available along with the updated genome assembly and can be found here.
As new strains of the SARS-CoV-2 virus emerge—including variants that appear to make the virus more contagious or potentially more likely to effect the efficacy of the new vaccines—it’s clear that continued genomic surveillance will be essential as we try to rein in the COVID-19 pandemic.
Fortunately, scientists have been using all the tools at their disposal, including SMRT Sequencing platforms from PacBio. In a talk presented at a virtual American Society for Microbiology event in December 2020, Labcorp’s Michael Levandoski spoke about using the Sequel II System for mapping COVID-19 outbreaks by location and over time.
Watch Levandoski’s full presentation:
As a reference lab taking samples from all over the U.S., Labcorp has unique insight into the genetic path of the virus in the world’s largest outbreak. “The samples we collect represent circulating virus in the population,” Levandoski noted, adding that the company has tested more than 3.4 million positive samples as of February 2021. For each sample collected, metadata about timing, location, and patient demographics are recorded to create a truly valuable data set. The company’s large-scale SARS-CoV-2 sequencing project was built on a Sequel II System workflow, analyzing remnants of samples that have tested positive with diagnostic assays.
Of course, that means Labcorp scientists are working with low-input samples, for which they use two pools of overlapping 1.2 kb amplicons to sequence the whole viral genome at a pace of 600 to 1,000 genomes per SMRT Cell using HiFi reads. Levandoski’s team has sequenced more than 17,000 genomes as of February 26, 2021 and reports that HiFi assemblies offer very high resolution without missing any regions, so scientists can identify new mutations with confidence. “We’re able to detect new mutations and variants in the population as they appear,” he said. The team has nearly 20,000 archived samples collected from before March 15th that are in the pipeline for sequencing, and new samples will continue to be sequenced for pathogen surveillance purposes.
With such a small genome, though, is long-read sequencing really necessary? Levandoski thinks so: he told conference attendees that short-read amplicon sequencing could miss new mutations that are key to monitoring transmission and the outcome of vaccinations. By ensuring that all mutations are detected with highly accurate HiFi reads, his team can check each new genetic change to determine whether it’s in a clinically relevant region of the genome.
As a result of this ongoing surveillance work, Labcorp was awarded a sequencing surveillance contract with the US Centers for Disease Control as a part of their efforts to track and learn more about SARS-CoV-2 as it evolves and spreads throughout the country.
Download the Labcorp protocol and learn more about the benefits of using PacBio sequencing for SARS-CoV-2 surveillance.
Sunday is Rare Disease Day – a time to honor the patients, families, caregivers, and healthcare professionals who are part of the rare disease community.
At PacBio, we are passionate about supporting this community and providing tools that help improve the ability of scientists and clinicians to deliver valuable answers to families and reduce what can be a years-long diagnostic odyssey. And while each ‘rare’ disease may affect a limited number of people, collectively these diseases affect hundreds of millions of people around the world.
Since we last celebrated this special day, we’ve been particularly excited by the progress made by cutting-edge scientists and clinicians who are applying new technologies to find the genetic root causes of these diseases. Leading into Rare Disease Day, we’d like to highlight and acknowledge the work of these scientists who are striving to improve the lives of those affected by rare diseases.
In Missouri, the team at Children’s Mercy Kansas City recently announced the opening of a massive new pediatric research facility housing the Children’s Mercy Research Institute (CMRI). The institute, established in 2015 to accelerate precise diagnoses and treatments for complex childhood diseases, is built on a translational approach that brings science and medicine together seamlessly.
One of the institute’s most important research projects is Genomic Answers for Kids (GA4K), a first-of-its-kind pediatric data repository that is collecting genomic data and health information from 30,000 children and their families during the next seven years to create a database of 100,000 genomes. More than 2,230 families with rare disease have enrolled in the program to-date, which has resulted in more than 10,200 new genomic analyses, more than 250 genetic diagnoses and already contributed to the reporting of 10 new disease genes.
GA4K focuses on rare diseases and has been solving previously unsolvable cases by implementing highly accurate long-read sequencing, known as HiFi sequencing. Based on early successes, the team has scaled up its capacity with additional Sequel IIe Systems and aims to use HiFi whole genome sequencing for approximately 1,000 cases that went unsolved after the preliminary short-read exome analysis.
Meanwhile, in Alabama, scientists at the HudsonAlpha Institute for Biotechnology recently announced that they found likely pathogenic variants in two pediatric rare disease cases that had remained unsolved using short-read sequencing. In both cases, the patients suffered from neurodevelopmental disorders. The scientists were able to pinpoint the disease-causing genetic variants through whole genome sequencing of parent-proband trios. One of the pathogenic variants was a 7 kb insertion in the CDKL5 gene, while in the other instance an extensive structural variation was highlighted. Both variant types are known to be challenging for short-read sequencing technologies and were therefore not discovered in the preliminary analysis.
“The ability to find so many variants that were previously missed is exciting, and holds great promise for diagnostic testing in the future,” says HudsonAlpha Faculty Investigator Greg Cooper, PhD. “Long-read genome sequencing will become a powerful tool for research and clinical testing over the next few years.”
One of the earliest examples of how PacBio sequencing technology could make a difference for rare disease cases came from the Stanford lab of Euan Ashley, a noted cardiologist who just released a new book, The Genome Odyssey: Medical Mysteries and the Incredible Quest to Solve Them. The book includes, among many others, a fascinating case of Carney complex in an individual who had suffered a series of tumors in his heart and glands, for whom eight years of genetic analyses had produced no firm answers.
These are just a few of the many great advancements among rare disease experts that are making new inroads into tough cases with HiFi sequencing. It is critical to remember that each of these explained cases represents a family that is now closer to the end of their diagnostic odyssey, potential treatment options, and renewed hope for healthier futures. We send our sincere gratitude to them and everyone working hard to accelerate the development of medical advancements in rare disease research.
If you’d like to participate in this wonderful community, take a look at these upcoming events in support of rare disease research awareness, funding and education:
- Rare Disease Day strives to raise awareness amongst the public and decision-makers about rare diseases and their impact on patients’ lives – to show your support, you can use your social channels to amplify and tag #RareDiseaseDay
- Collaborate with and support Children’s Mercy’s Genomics Answers for Kids program, by nominating a patient for participation and/or donating to support their vision
- The HudsonAlpha team is hosting the Double Helix Dash in April, a virtual 5K to support childhood genetic disorders research – anyone can participate!
- PacBio is hosting a 3-day virtual event in April focused on the genetics of rare disease – register to attend and hear firsthand from scientists and clinicians on their recent discoveries
To learn more about how PacBio HiFi sequencing is helping advance our understanding of rare disease, visit our rare disease resource page.
What does the ideal genome assembly look like? High-quality, free of errors, with no gaps, and all haplotypes resolved.
It’s a big ask, especially with challenging genomes like plants that are rich in repetitive content with high levels of heterozygosity and complex polyploidy. Moreover, such assemblies often require a combination of technologies, such as sequencing plus optical mapping.
But a team of scientists at the King Abdullah University of Science and Technology (KAUST) Core Labs (@kaust_corelabs), proved it is possible by using one technology — PacBio HiFi Sequencing — in just seven days.
Their recent preprint introduced LeafGo, a streamlined workflow able to produce a high-quality draft plant genome from plant tissue without using additional scaffolding technologies.
The rapid, one-pass approach was tested on two different Eucalyptus species, E. rudis, and E. camaldulensis.
There are more than 800 eucalypt species, but only three genomes have been published: E. grandis, E. pauciflora and E. camaldulensis. The LeafGo produced high-quality draft E. camaldulensis genome is an improvement upon those highly fragmented genomes, the KAUST team wrote.
Their assembly of E. rudis, a close relative of E. camaldulensis that inhabits a different ecological niche, is the first for that species.
“The two genomes sequenced here will improve our genomic knowledge of eucalypts, which at the moment is relatively sparse, and will assist with conservation issues and commercial uses,” they wrote.
The team tested both continuous long read (CLR) and HiFi circular consensus sequencing (CCS) data, and were especially impressed with the results from HiFi reads — “the higher base-level accuracy given by HiFi improves the assembly considerably, thus removing the need for polishing with short-read sequencing.”
“HiFi assemblies demanded less computational requirements, had higher BUSCO scores, showed several fold improvement of contig N50/N90 and L50/L90, and generated more complete genome assemblies,” the authors wrote.
“In fact, our HiFi sequencing data, assembled with hifiasm, produced near-chromosome level haploid draft genomes,” they added.
“One of the main advantages for our chosen genome assembly workflow, using hifiasm with HiFi reads, are the savings in time and compute requirements, all with minimal manual intervention.”
The estimated total time from raw reads to HiFi data to the assembly of a high-quality contiguous draft for a haploid genome of 0.6 to 1.0 Gb is approximately one day, they wrote. Assembling the HiFi data using hifiasm took 80 minutes for E. rudis (23x coverage) and 120 minutes for E. camaldulensis (27x coverage).
“When combined with time estimates of HMW DNA extraction (one day), HiFi library preparation and sequencing (five days) and assembly; a high-quality draft genome can be prepared from plant samples in seven days, depending on available compute resources,” the authors stated.
The team also created a modified Qiagen Genomic protocol in order to tackle the challenge of extracting high molecular weight DNA from the Eucalyptus species, which is difficult due to their high phenolic and polysaccharide content.
“Our extraction protocol generated high pure and copious amounts of HMW DNA within a day and using minimal resources and effort,” they wrote.
The authors say they hope LeafGo will be a valuable tool for global initiatives to sequence and assemble genomes for many thousands of eukaryotic life forms that do not yet have published standardized workflows.
Genome assembly statistics for two Eucalyptus species
This blog post has been updated, it was originally published September 2016.
In recent interactions with the scientific community, we’ve seen a growing number of questions around scaffolding genome assemblies. We thought it might be useful to review the concepts behind contigs and scaffolds, as well as the circumstances in which one might want to scaffold a high-quality PacBio genome assembly.
Contigs vs. Scaffolds
Contigs are continuous stretches of sequence containing only A, C, G, or T bases without gaps. SMRT Sequencing has all of the necessary performance characteristics – long reads, lack of sequence-context bias, and high accuracy – to generate contiguous genome assemblies with megabase-sized contigs. Ultra-long contigs provide complete and uninterrupted sequence information across full genes, and more recently even allow separation of the different chromosomes for diploid and polyploid organisms.
The unprecedented quality of PacBio highly accurate long reads – known as HiFi reads – has been described as “the most effective standalone technology for de novo assembly” in a study focused on sequencing the CHM13 human cell line, which yielded an assembly contig N50 of 29.5 Mb and a Phred quality score of Q45. HiFi reads have also enabled generating reference-quality de novo assemblies of many plant and animal species, population-specific human assemblies and the first fully complete sequence of a human autosome – chromosome 8, including the centromeres. Even large and complex plant genomes like the California Redwood, a 27 Gb hexaploid, can be readily assembled with high contiguity using HiFi reads.
Learn how HiFi reads help scientists unlock new discoveries.
Scaffolds are created by chaining contigs together using additional information about the relative position and orientation of the contigs in the genome. Contigs in a scaffold are separated by gaps, which are designated by a variable number of ‘N’ letters. Scaffolding is often used for short-read assemblies to make sense of the fragmented genome assemblies containing short contigs. However, there are three important principal deficiencies of scaffolds:
- Scaffolds miss critical information. Gaps represent missing genomic information and, in many cases, these gaps can coincide with important genomic loci. Many promoters and first exons are GC-rich in sequence, often resulting in missing or low-quality sequence reads from short-read or Sanger sequencing. Thus, genes are incompletely resolved, and their regulation cannot be understood. Another reason for gaps in scaffolded assemblies is large, repetitive elements which short-read sequencing methods struggle to bridge. Thus, duplicated genes, genes vs. pseudogenes, short tandem repeats, variable number tandem repeats, microsatellites, and many other structural genomic features are often unresolved in scaffolded short read assemblies. As summarized in a Nature Genetic Reviews article, long-read sequencing technologies, and specifically HiFi reads help overcome these types of complex regions to give a complete picture of genetic variation, including in regions previously thought to be intractable like telomeres and centromeres.
- The length of a scaffold gap often has no relation to the true gap size. In several reference genomes, gaps are arbitrarily set to certain fixed lengths. For example, most gaps in the zebra finch reference are set to 100 Ns, while in the version 3 maize reference they are set to 1,000 Ns. This means that in most cases, the true length of sequence represented by the gap differs from the set gap size, and is sometimes off by thousands of bases. The uncertainties of gap sizes in scaffolds result in an inability to understand the true spatial relationships of functional elements in genomes and is an underestimate of the actual extent of missing information. More recently, those older reference assemblies have benefited from PacBio long-read sequencing – see the latest: zebra finch and maize.
- Gap-flanking scaffold sequence can be low-quality, and is sometimes completely wrong. The sequences surrounding gaps often fall into areas where short-read technologies have deficiencies due to GC-bias or read-length limitations. This can result in sequence that is of lower quality and, in some cases, completely erroneous. For example, because of complex repeat structures in the human IGH locus, the right edge of a 50,000 N gap in the short-read assembly contains 1,836 bases of flanking sequence that has no support in the hg19 human genome reference or the PacBio assembly. In some ways, having incorrect flanking sequence in scaffolds is worse than having ‘N’ gaps, since that erroneous sequence is considered and included for downstream analyses.
Illustration of the difference between contigs and scaffolds in genome assemblies
The information missed by gapped scaffold assemblies complicates and may preclude downstream analysis and understanding related to functional and comparative genomics. Scaffolded short-read assemblies get nowhere near the quality of PacBio genome assemblies in terms of contiguity and completeness, and they often require labor-intensive follow-up work to close gaps, adding time and cost to projects.
Scaffolding PacBio assemblies for chromosome-scale genome representations
For even longer-range genomic connectivity, e.g. to bridge the largest segmental duplications and repeat regions, researchers can go a step further by adding scaffolding information to a PacBio assembly, often resulting in telomere-to-telomere, chromosome-scale genome representations. Several methods have been demonstrated to work very well for this purpose, including optical mapping and crosslinking approaches. Check out examples of barn swallow, insects, and human genome sequencing to see how chromosome-level scaffolding enables more comprehensive insights.
There are numerous large international initiatives using PacBio long-read sequencing to produce high-quality, phased, chromosome-level genome assemblies of many organisms:
- Vertebrate Genomes Project
- Sanger 25 Genomes Project
- Darwin Tree of Life Project
- NHGRI Human Pangenome Reference Initiative
- PacBio Workshop: Understanding the biology of genomes with HiFi sequencing
- Webinar: Sequencing 101 – How long-read sequencing improves access to genetic information
- Understanding Accuracy in DNA Sequencing
- Looking Beyond the Single Reference Genome to a Pangenome for Every Species
- The Evolution of DNA Sequencing Tools