This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
For scientists who utilize DNA sequencing in their research but are not experts in the underlying technology, it can be difficult to determine the accuracy of sequencing results — and even harder to compare accuracy across sequencing platforms. Furthermore, accuracy differs not only between technologies but also across genomic regions as some stretches of the genome are inherently more difficult to read.
It is critically important to understand accuracy in DNA sequencing to distinguish important biological information from sequencing errors.
What are the Types of Sequencing Accuracy?
There are two key types of accuracy in DNA sequencing technologies: read accuracy and consensus accuracy. Read accuracy is the inherent error rate of individual measurements (reads) from a DNA sequencing technology. Typical read accuracy ranges from ~90% for traditional long reads to >99% for short reads and HiFi reads.
Consensus accuracy, on the other hand, is determined by combining information from multiple reads in a data set, which eliminates any random errors in individual reads. Deeper coverage, meaning more reads from which to build a consensus, generally increases the accuracy of results. However, there are still limitations to calling consensus from multiple reads. Consensus calculation is a complicated and computationally expensive process, and it cannot overcome systematic errors. If a sequencing platform consistently makes the same mistake, then it will not be erased by generating more sequencing coverage.
To sidestep this problem, it is common to “polish” long reads that have systematic errors with high accuracy short reads. However, because of their read length, short reads cannot always map to the long reads unambiguously, limiting their ability to improve accuracy. In general, consensus is improved – and vastly simplified – by starting with highly accurate reads with no systematic biases.
How Does Accuracy Impact the Utility of Sequencing Data?
It is commonly known that certain genomic regions are more difficult for sequencers to get through than others. Centromeres and telomeres are notoriously tough because of the highly repetitive sequence they contain. Regions that are AT-rich or GC-rich are similarly difficult because they respond poorly to the amplification protocols required by some platforms. Palindromic sequences or hairpin structures are difficult to denature, making such regions challenging for sequencing tools that include a denaturation step.
Many scientists avoid these problems by opting for a single-molecule sequencing method that does not require amplification or denaturation, such as PacBio’s SMRT Sequencing technology. Because SMRT Sequencing can process even difficult regions, performing uniformly regardless of sequence context, it generates accurate results even in regions that would flummox other platforms. Selecting a platform without systematic bias, like the Sequel II System, is important to producing the most accurate sequence data.
The accuracy of a genome assembly goes beyond the accuracy of each individual base. Even perfect reads can contribute to poor accuracy if they are not ordered and oriented correctly in the assembly. This question of where to place the read is called mappability.
Reads containing only a piece of a large structural element, or consisting of highly repetitive sequences, can be very difficult to align, mapping ambiguously to many different locations in a reference. This is where short reads really struggle; because of their size, there is a greater chance that they will not contain enough unique sequence data to anchor them properly in a genome. Since HiFi reads stretch across many kilobases of DNA, they almost always contain unique flanking sequences that can be used to map them accurately in an assembly.
When exploring diploid or polyploid genomes, phasing means separating the different copies of each chromosome (e.g. maternal and paternal for diploid), known as haplotypes. With sufficient accuracy, the identity of nucleotides at each position in the genome can be compared with a reference sequence to identify SNVs, with a heterozygous locus indicating a difference in sequence between a homologous chromosome pair. This is where the inherent low accuracy of traditional error prone long reads becomes a limitation – with a high error rate, it makes it impossible to decide whether a disagreement between a reference and data set is a variant or a sequence error.
Another approach to obtain phase information is to also sequence the parents of the individual whose genome you need phased. However, in many wild species where the parents aren’t available, a highly accurate long-read sequencing approach, like HiFi sequencing, would be simpler. There are also computational methods (learn about Nighthawk) or the use of population haplotype frequency information to infer phasing.
Overall, phased genomes or variant calls are higher quality than haplotype collapsed versions as they provide allelic information, which can be important for studying human diseases, crop improvement, evolution, and more. HiFi reads, with accuracy high enough to detect SNVs and read lengths to detect these SNVs over many kilobases, generate larger phased haplotype blocks.
As scientists analyze more and more genomic data, the role of sequence accuracy will likely only become more important. HiFi reads offer the benefits of high accuracy equivalent to short-read sequencing data, but with the length necessary for complex genome assemblies and phasing of variants across large swaths of the genome.
Explore other posts in the Sequencing 101 series:
The strides scientists have made in rare disease research lately is truly impressive. For an overview of recent progress, we encourage you to check out a new article in The Pathologist from our own Luke Hickey (@Luke_Hickey), Senior Director of Strategic Marketing. It offers a great overview of how scientists have used long-read sequencing to find the genetic explanations for elusive rare diseases.
“Never before have our laboratory techniques been so successful at identifying rare diseases and elucidating their underlying biological causes,” Hickey writes. “The knowledge we obtain today opens the door to new treatments, giving hope to people who suffer from these rare disorders.”
The educational article offers a look at how various types of sequencing have been important for solving rare diseases, including an overview of the types of variants that can be accessed by short-read platforms (typically limited to single nucleotide variants and small indels) or by long-read SMRT Sequencing (all structural variants, even very long elements such as repeat expansions).
Despite recent progress, at least half of known rare disease cases have not been solved with short-read sequencing or older techniques. That’s why so many scientists are now embracing SMRT Sequencing as a new application in the field, with early successes showing the tremendous promise of this approach. “In just the past few years, researchers have used SMRT whole genome sequencing to solve previously intractable rare diseases — and other significant efforts are now underway,” Hickey writes.
The article recaps some of those successes, such as Euan Ashley’s (@eaunashley) Carney complex work and Naomichi Matsumoto’s pathogenic deletion in cases of myoclonic epilepsy. It also provides a look at some large-scale efforts applying PacBio sequencing to rare disease studies, including SOLVE-RD and CSER.
“Large-scale programs like these should contribute a significant amount of new knowledge about the genetic mechanisms underlying rare disease, filling in many of the gaps in our understanding today,” Hickey concludes. As SMRT Sequencing “helps to explain more rare diseases and increase overall diagnostic yield, it should have a profound effect on our ability to diagnose, understand, and ultimately improve treatment for rare disease cases.”
To hear the latest in rare disease research, register to attend our upcoming webinar on May 27: Increasing Solve Rates for Rare and Mendelian Diseases with Long-read Sequencing.
It’s been more than a year since we introduced HiFi sequencing to generate highly accurate long reads. In that time, we’ve seen many PacBio users make HiFi sequencing their go-to setting because it’s simple, reliable, and cost-effective. For scientists who have yet to generate their own HiFi data, we thought it might be helpful to publish a few data sets for exploration and analysis.
In a new preprint, we have released HiFi data sets for five samples: mouse, frog, maize, strawberry, and a mock metagenome community. We like to think there’s a data set for everyone here, whatever your research area of interest! Working with any of these HiFi read collections should offer a great introduction to this sequencing mode and show you why we often hear how easy it is to analyze HiFi data compared to traditional long reads.
Consistent with previous reports, the HiFi data generated for these five organisms yielded excellent accuracy, with average read qualities ranging from 99.84% to 99.97%. With that kind of accuracy we look forward to seeing what interesting biology our collaborators find within this data.
|Organism||SRA||HiFi Data Yield (Gb)||Average Read Length (kb)||Average Read Quality|
The five HiFi data sets generated on the Sequel II System
In addition to letting scientists get a fresh look at HiFi data, we hope this release will encourage development of new applications and software for the benefit of the entire sequencing community. New and improved tools for assembling polyploid genomes or calling variants in non-model organisms are just a couple of areas we hope to see grow.
For those of you who want to use existing software to explore these datasets, here are some tools that we find useful for working with HiFi reads:
- Assembly: FALCON, hifiasm, HiCanu (Check out this in-depth overview of HiFi assemblers from @Magdoll)
- Variant detection: DeepVariant for SNVs and small indels (<50 bp) and pbsv for SVs (≥50 bp)
- Metagenomics: Canu for assembly, FragGeneScan for gene prediction, and MEGAN for taxonomic and functional profiling
For this data release, we’d like to thank all of the collaborators who helped to generate and present these results: Jane Landolin (@jlandolin), Nicholas Maurer, David Kudrna, Michael Hardigan, Cynthia Steiner, Steven Knapp (@knapp1955), Doreen Ware, and Beth Shapiro (@bonesandbugs).
And congratulations to the PacBio team members who led the charge on this effort: Ting Hon, Kristin Mars, Greg Young (@PacbioGreg), Yu-Chih Tsai, Joseph Karalius (@JoeyKaralius), Paul Peluso, and David Rank.
What is a Pangenome?
Unless you have an identical twin, no other person has a genome that is identical to yours. The same is true for other animal, plant, and microbial species that reproduce sexually: the genomes of individuals are unique. Less well known, but equally true, is that individual members of a species do not always share even the exact same genes. Nevertheless, scientists mostly use a single reference genome to represent an entire species: one human genome, one maize genome, one Staphylococcus aureus genome.
The Coining of the “Pangenome”
Around 2005, geneticists started to explore the concept of the pangenome, originally defined as the entire set of genes possessed by all members of a particular species and then extended to refer to a collection of all the DNA sequences that occur in a species.
It started with bacteria, as many things do. Genomic activity like recombination, mobile genetic elements, and horizontal gene transfer were clearly contributing to individual diversity across the bacterial domain. Some scientists discovered dozens, if not hundreds, of unknown genes when they sequenced new strains.
In 2007, MIT microbiologist Sallie Chisholm (@ChisholmLab_MIT) set out to determine the extent of genetic variation in the marine cyanobacterium Prochlorococcus. Each strain contains approximately 2,000 genes, and Chisholm estimated that a pangenome for Prochlorococcus would be around 6,000 genes, based on an initial set of 12 genome sequences. Eight years later, with 45 strains sequenced, she revised that estimate up to at least 80,000 genes—around four-times the number of genes in the human genome—with the core genome for the species comprising only about 1,000 genes, or less than 2 percent of the total gene pool.
“That’s a lot of information shaping that collective,” Chisholm told The Scientist. “[The pangenome view] changes the way you think about what an organism is.”
Why is it Important to Capture the Full Range of Genetic Diversity?
Those looking to create vaccines need to understand the genomic variation and versatility of disease-causing microbes, especially if they are hoping to develop universal vaccines that could provide protection against more than one strain in a species.
Those studying adaptation to climate change would benefit from a comparison of genes absent or in abundance within species found in different geographic locations and/or environmental conditions. In crop plants, differences in variable genes could have implications on disease resistance, metabolite production, and stress responses.
And with differences in gene number increasingly being associated with disorders including autism, Parkinson’s and Alzheimer’s diseases there are strong medical justifications for taking a more variation-centric view of the human species. Variants cannot be identified within regions completely missing from the reference sequence, many of which have been found to be more common than previously thought.
What is Being Done to Generate Pangenomes?
To answer this question, we sat down with a few scientists to talk about the era of the pangenome and what’s to come.
One particularly important crop that has haunted geneticists and breeders for years is maize. It is challenging to sequence because the vast majority of its 2.3 Gb genome, a staggering 85 percent, is made up of highly repetitive transposable elements. Maize is also incredibly diverse in its DNA makeup. As an example, a study comparing genome segments from two inbred lines revealed that half of the sequence and one-third of the gene content was not shared – that’s much more diversity within the species than between humans and chimpanzees, which exhibit around 94 percent sequence similarity.
“The whole notion of a single reference genome for crop plants is an antiquated concept borne out of necessity from the technological limitations of the past. Now with the capability to rapidly generate high-quality references for even the largest crop genomes, we can readily access the full complement of sequence diversity and structural variation within a crop,” says Kevin Fengler, Comparative Genomics Lead at Corteva Agriscience.
So the field was delighted when a collective of 33 scientists released a 26-line maize pangenome reference collection earlier this year. The collection was created using PacBio sequencing, and includes comprehensive, high-quality assemblies of 26 inbreds known as the NAM founder lines. These include the most extensively researched maize lines that represent a broad cross section of modern maize diversity, as well as an additional line containing an abnormal chromosome 10.
It turns out it’s not just maize biology that can be informed by pangenomes. “The high level of diversity in maize is well known, but we see a lot of diversity and structural variation underlying traits of interest in all the crop plants we work on. Creating the first reference genome for a crop genome is a great first step, but things get really interesting as you begin to add more genomes and a more comprehensive view emerges,” adds Fengler.
As for our own species, the current reference genome (GRCh38) – an update of the genome produced by the international Human Genome Project in 2000 and based mostly on DNA from one person – has been added to and annotated through the years, but is still an incomplete sequence and woefully inadequate as a representation of human diversity and genetic variation. Scientists estimate that up to 40 megabases of sequence, including protein-coding regions, are absent from the reference genome.
Several studies using PacBio long reads have reported an average of ~20,000 structural variants (SV) per human genome, most of which fall within repetitive elements and segmental duplications. Furthermore, it does not represent the diploid structure of human genomes. Rather, it is an arbitrary linear combination of different haplotypes, or a mosaic of multiple individuals.
Several groups have undertaken efforts to ensure certain populations are better represented in genomic databases, from Sweden to Tibet to Japan. Check out an interactive map of human genomes generated with PacBio sequencing.
When asked the value a pangenome could bring to human research, Fritz Sedlazeck (@sedlazeck), Assistant Professor at Baylor College of Medicine, said, “the pangenome has the potential to represent the diversity of the human population or any species. This eases the re-identification of complex alleles or even haplotypes.”
And it seems the National Human Genome Research Institute agrees, recently committing $30 million towards the creation of a new human pangenome based on high-quality sequencing of 350 individuals from across the human population, to capture all genomic variation observed in human populations.
“One human genome cannot represent all of humanity. The human pangenome reference will be a key step forward for biomedical research and personalized medicine. Not only will we have 350 genomes representing human diversity, they will be vastly higher quality than previous genome sequences,” said David Haussler, director of the University of California Santa Cruz Genomics Institute, which is leading the project.
How to Generate a Pangenome?
So, what are the most important things to keep in mind when creating a pangenome reference?
First, Fengler says that being able to be confident in your results is really important. “Ideally, all of the references in the pangenome collection will be built with a similar recipe to enable direct comparisons without artifacts from different technologies.” This points to the need for a reliable technology that can be used to generate equivalent quality genomes for many samples with little variability.
Second, the data must be high quality. When asked the importance of long reads to pangenome efforts, Sedlazeck said, “they will be important to distinguish between different alleles/paths in the graph and to characterize novel mutations. Thus, being able to cope with graphs that encode a much higher number of variations to better represent the population.” Along those lines, Fengler adds, “the approach for assembly needs to be robust and accurate such that mis-assembly and sequence errors are not interpreted as structural variation and sequence diversity.”
Lastly, cost and speed have to be taken into account. With the high accuracy of HiFi sequencing, only 10- to 15-fold coverage per haplotype is needed for a high-quality resulting genome assembly, and the analysis time can be cut in half.
“Now researchers no longer need to wait for actionable sequence data,” says Fengler. “For maize, we can generate a high-quality reference genome the same day that the sequencing finishes.”
What’s Next in Pangenomes?
As pangenome collections grow, scientists have to tackle questions around how to represent a pangenome. “Which variations should be included into a pangenome? Is it all of them? Then you lose specificity in regions. Is it only the common variations? Then you have a problem with disease-causing variations and other complex regions like HLA,” asks Sedlazeck, highlighting the continued work that needs to be done.
In addition, tackling things like annotation, visualization, and relationship management are on Fengler’s mind. “A variety of new pangenome analysis and visualization tools are needed to fully realize the value of having a pangenome collection for each crop.”
And then we have to move into functional and translational analysis. Scientists need to be able to take their newfound understanding of variation at the genome level and see how it impacts phenotypes, and whether the variation can be introduced artificially to influence agronomic traits, for instance.
One thing is for sure, the pangenome era is upon us, and whether you need a pangenome to understand important traits or you build tools to interpret those traits, there will be plenty to work on in the coming years!
Explore other posts in this series:
Will the next big cancer breakthrough be in immunotherapy? Therapeutic modification of the tumor microenvironment or microbiome? Or early detection and screening?
Whatever the result, long-read sequencing technology can play a pivotal part in the discovery process, according to Meredith Ashby, PacBio’s director of Market Strategy for Microbial Genomics, Cancer and Immunology.
In a recent article for Lab Compare, Ashby highlighted some of the ways Single Molecule, Real-Time (SMRT) Sequencing has given researchers a deeper understanding of tumors at the genomic and transcriptomic level.
By spanning very large structural variants in single reads, SMRT Sequencing can provide clarification in cases where variants may be acting in concert to affect treatment response. Ashby describes recent work where scientists at Memorial Sloan Kettering Cancer Center and other institutes used SMRT Sequencing to explore why certain patients are “super responders” to alpelisib, a targeted PI3Kα inhibitor. The ability to phase all variants along PIK3CA transcripts revealed that patients with distal mutations in cis showed remarkably improved response to therapy, as compared to those who had only a single mutation, or whose mutations were present in trans.
Long reads can also help determine cancer risk status involving ‘hard to sequence’ genes which have highly homologous, inactive pseudogenes. For example, long reads can distinguish SNVs, indels and larger rearrangements in PMS2 from those in the inactive pseudogene, PMS2CL.
Finally, generating full-length transcripts via the Iso-Seq method can disambiguate isoforms to provide insights into cancer biology or serve as better biomarkers. Long reads can reveal cryptic exons, retained introns, and other splicing changes that are often cancer-specific and therefore may be missing from the gene models typically used to aid short-read transcript assembly, Ashby noted. Ashby cited an example where targeted SMRT Sequencing of androgen receptor (AR) isoforms revealed that the structure of AR-V9 was previously mischaracterized, and that the corrected isoform information could be used to improve the prediction of drug resistance in prostate cancer.
“While a decade-plus of short-read data has produced truly exciting information for the cancer research community, there is much more to be learned simply by looking through a different lens,” she said.
The COVID-19 pandemic has brought a sudden urgency to virus research and led many of us to dig more deeply into all the tools available for characterizing viral genomes, from RT-PCR to DNA sequencing. For all their outsized impact on human health, viruses have remarkably small and simple genomes, some just a few thousand bases in length, and most lacking any repetitive structures. With such tidy genomes, you may wonder, why would scientists want to sequence them with a long-read technology like PacBio HiFi reads?
While it is true that most viral genomes do not require long reads for assembly, viruses exist as populations within infected hosts, and long reads are a powerful tool for fully characterizing these populations. Depending on the mutation rate of the virus, the population structure can have one dominant variant with only a very small proportion of rare variants, or it can be comprised of a highly diverse set of closely-related variants, called a quasispecies.
Highly accurate, single-molecule sequencing allows researchers to fully characterize all the variants within a viral population, as opposed to just the dominant variant. Oftentimes this more detailed view of a complex population of viruses reveals important aspects of biology, including how a viral infection evolves over time or in response to therapeutics.
Here are some examples of how scientists have used SMRT Sequencing to explore viral genomes, and the highlights of their work.
Influenza: While the influenza virus continues to evade efforts to produce a universal vaccine, HiFi sequencing has given scientists a clearer picture of the dynamics of flu virus evolution. In one study, researchers sequenced multiple samples from a single patient with a two-year recurrent infection and revealed in great detail how the virus adapted in response to flu treatments. In another population-scale study tracking a flu pandemic in Hong Kong, PacBio sequencing revealed that transmission was enabled in part by minor strain variants.
In addition, scientists have used long reads to analyze large deletions in the influenza genome that characterize viruses incapable of replicating. Most recently, a study combining PacBio long reads and single-cell sequencing gave a comprehensively detailed view of how influenza mutations evolved over the course of a typical infection, revealing both point mutations and indels.
Hepatitis C Virus: Among the most significant challenges in HCV treatment is drug resistance. To better understand how resistance arises, scientists have deployed SMRT Sequencing to study HCV evolution in individuals who failed to respond to antiviral therapy. By using HiFi sequencing to obtain single reads encompassing entire clones, they were able to detail how multid rug resistant variants arose from low-abundance, drug-resistant clones present at baseline.
In another example, researchers generated long-read data to produce full-length HCV envelope sequences, which allowed them to track the transmission path for a sexually transmitted cluster of HCV infections. They also reported how viral genetic diversity changed over the course of an infection, appearing low during the acute stage but increasing over time.
HIV: An ongoing area of research in HIV is how the virus evolves and persists in patients on long-term antiretroviral therapy (ART). With SMRT Sequencing, scientists have generated full-length viral genome sequences from proviruses to study what proportion the latent reservoir is replication competent, and what types of mutations are favored in this reservoir under ART. A separate effort involved analyzing proviral sequences in the brain and other tissues to understand HIV-associated dementia. HiFi reads allowed the researchers to create a detailed phylogenetic tree of all the variants within an individual, and revealed that variants in the brain were distinct in important ways and absent from other parts of the body.
SARS-CoV-2: Recently, researchers at Mt. Sinai published a study using genetic drift in the SARS-CoV-2 virus to determine when and how the virus arrived in New York City. They found that the virus had arrived from Europe and the West Coast multiple times, though not from China, and it arrived significantly earlier than recognized. Other researchers have developed multiple long-amplicon protocols for targeted sequencing of the novel coronavirus, which may enable unique insights into the biology of SARS-CoV-2 as the pandemic, and our understanding of it, evolves.
To learn more about how to sequence viral genomes, explore our COVID-19 sequencing tools and resources or review our sample prep and analysis workflows for resolving viral populations.
Learn how you can use HiFi reads to enable your viral research, from understanding viral genomes to the host immune response.
Explore other posts in this series:
California redwoods: Not only are they giants in height and age (up to 379 feet high, 29 feet round, and thousands of years old), but the famous towering trees are also derived from a massive 27 Gb genome.
Seeking a sequencing challenge for the Sequel II System, we picked the California redwood, or Sequoia sempervirens as it’s known to scientists. There also happened to be several fine specimens at nearby Stanford University.
A small crew of PacBio scientists — Emily Hatas (@EmilyHatas), Greg Young (@PacbioGreg), and Michelle Vierra (@the_mvierra) — headed to campus to acquire samples equipped with ice, scissors, and a kitchen scale. DNA was isolated (using the Circulomics Plant Nuclei kit), a HiFi library was created, and sequencing got underway. In just seven days, the team achieved 22-fold coverage of the genome (606 Gb of HiFi data). Another 6 days later, Greg Concepcion (@phototrophic) generated a partially haplotype-resolved genome assembly almost twice the expected genome size with a contig N50 of 1.92 Mb.
“The results were amazing,” Vierra said. “We are very pleased to see the improvements that this genome assembly represents over other recent conifer genomes.”
The massive genome was put together in just 17 days — 4 days of sample prep, 7 days of sequencing, and 6 days for assembly, and was detailed in a Medium post by Vierra.
But the team wasn’t done yet. As a general recommendation, 10- to 15-fold coverage in HiFi reads is the ideal range to yield a genome that measures up favorable in the 3 C’s of genome quality.
For genomes of the California redwood’s size, it may not be economical or feasible to obtain that much coverage in a limited timeframe, which is why the team opted to see what a reasonable ~20-fold coverage could generate. However, with a little more time on their hands, and enough HiFi library to go around, the team embarked on more sequencing to bring the total to 875 Gb of HiFi data representing 33-fold coverage of the genome.
Not surprisingly, more of a good thing (HiFi reads) made an even better genome assembly. The contiguity improved significantly with a contig N50 of 3.8 Mb and completeness increased with an almost 61% complete BUSCO score.
Overall, what would have been considered a herculean effort not that many years ago was accomplished in only a few weeks by a handful of personnel in their spare time. It’s our hope that with the increasing adoption of PacBio HiFi reads we will continue to see massive improvements in the assembly of all genomes, including large and increasingly complex polyploid plants.
The additional data for the redwood genome has been made publicly available along with the updated genome assembly and can be found here. We hope it will be a useful tool for conifer researchers everywhere!
Comparison of California Redwood genome assembly results.  Hybrid assembly of redwood.  Transcript set of Abies alba from Neale et al. Varying number of transcripts aligned to each genome (4,958 mapped to 22-fold HiFi Reads, 4,970 mapped to 33-fold HiFi reads, 4,760 mapped to ONT)  Assembly with 33-fold HiFi reads was done with 80 cores and an updated version of Hifiasm (0.3.0).
Herculean efforts are being made by scientists around the world to respond quickly to the COVID-19 crisis in a race to understand the virus causing the pandemic and develop diagnostics, vaccines, and therapeutics. But many research questions remain. How can long-read SMRT Sequencing technology help fill the gaps?
PacBio microbiology expert Meredith Ashby highlighted several opportunities to support coronavirus research in a recent webinar as part of a day-long virtual conference hosted by LabRoots.
Sequencing the viral genome
Understanding the basic biology of the virus is essential, and the more detailed our investigation, the better.
Highly accurate, long-read sequencing of viral genomes has for years been used by virus researchers to access information that is challenging to resolve with short-read data. Mutation phasing and the detection of rare variants are both greatly simplified when entire viral genes or genomes can be sequenced in a single molecule, Ashby said.
This phasing information allows researchers to understand if there are mutational hotspots or regions of stability, both of which are important to understanding how to design a vaccine or therapeutic. Better phasing also allows researchers to understand how and how fast the virus evolves, either within one person during the course of an infection, or over time during the course of an outbreak or epidemic. In the context of the current coronavirus pandemic, Ashby said, having high-quality sequencing information can help track whether the virus evolves in such a way that test quality is compromised, as well as monitor whether drug resistance arises once therapeutics are developed.
To illustrate these points, Ashby discussed several examples of researchers using PacBio technology to gain unique insights into the biology of other viruses, including HIV and influenza. In one example from HIV research, scientists were able to determine that two distinct strains of the env gene had arisen in a single patient, and that one strain was compartmentalized in brain tissues. In another example from influenza research, combining PacBio and single-cell sequencing gave a comprehensively detailed view of how viral mutations evolved over the course of an infection, revealing both point mutations and indels.
Why do some people get so sick while others do not?
The answer to this question likely lies within the complex workings of our immune systems, which are still not completely understood. Ashby discussed how many of the regions of the genome that are involved in the immune response are highly polymorphic, and in the past have been quite challenging to sequence and to phase. These include the immunoglobulin heavy-chain (IGH) locus and the human leukocyte antigen (HLA) genes. In addition, long reads also offer several advantages for sequencing B cell receptor (BCR) repertoires.
BCR sequencing can help us identify broadly neutralizing antibodies in recovered patients, which could then be used as interim treatments ahead of the development of a vaccine. Ashby described how using a long-read approach allows sequencing of the entire BCR transcript, including those regions outside the CDR3 domain where mutations can arise during somatic hypermutation.
In addition, full-length sequencing allows for a simplified PCR reaction, which can avoid known issues with primer bias in the commonly used highly-mulitplexed short amplicon PCR strategy. Finally, Ashby discussed how one reason to sequence the immune repertoire, inference of the IGH locus, can instead be accomplished by direct sequencing with PacBio long reads.
The ability to delve deeply into germline sequences and examine allelic differences from person to person could shed light on how the genetic background of patients may influence disease susceptibility, progression and outcomes.
Watch the webinar for examples of relevant research and more detailed application uses:
Interested in applying SMRT Sequencing in your COVID-19 research? Protocols, assay development recommendations, primer sources, workflows and checklists are available on the continually updated COVID-19 Sequencing Tools and Resources page.
Need to find a sequencing center? Connect with us and we’ll put you in contact with the closest service provider that is operational during the pandemic.
Our team is proud to announce that PacBio has been working closely with customers to help in the fight against the COVID-19 pandemic. Scientists in commercial, academic, and government research teams are using highly accurate SMRT Sequencing data to resolve variants of the SARS-CoV-2 virus that exist within one individual or across a population of patients, which is critical to developing and maintaining effective diagnostics, vaccines, and therapeutics.
Many of these efforts are powered by our HiFi reads, which are both long and highly accurate. Such reads are well-suited for applications like viral sequencing, which requires the ability to distinguish variants that may differ by only a handful of single nucleotide variants (SNVs) within a viral gene or across an entire viral genome.
Here’s a look at how some research teams are deploying PacBio sequencing for their coronavirus investigations:
- LabCorp, which is actively supporting the response to COVID-19 in the United States and globally, will work closely with PacBio to sequence a large number of SARS-CoV-2 viruses from de-identified positive samples. LabCorp’s scientific teams will use this information to shed light on virus evolution, mutations found in different geographic regions, and implications for disease severity and outcomes, helping to support more informed patient treatment decisions.
“As we strive to rapidly learn as much as possible about the biology of this novel coronavirus to help deal with the current pandemic and also to look ahead to future outbreaks, SMRT Sequencing will give us an accurate, high-resolution view of the pathogen.”
Marcia Eisenberg, PhD, Chief Scientific Officer of LabCorp Diagnostics
- Scientists at the Vaccine Research Center at the National Institute of Allergy and Infectious Diseases are planning to use the Sequel II System to study virus population diversity and minor variants in samples collected from infected individuals. The information could ultimately be used to support the design of effective vaccines and antibody-based therapies.
- At the University of California, San Diego, scientists are using SMRT Sequencing data to analyze SARS-CoV-2 samples. They will utilize targeted sequencing data to study the viral genome as well as shotgun metagenomics to characterize the microbiome of nasal tissues responding to a COVID-19 infection.
“We anticipate that the insights we will gain from HiFi sequencing on the Sequel II System will contribute significantly to our knowledge about SARS-CoV-2 and how it operates in people.”
Rob Knight, PhD, Director of the Center for Microbiome Innovation at UC San Diego
- At the Research Center Borstel, a member of the German Leibniz Association, scientists who focus on lung diseases will be sequencing SARS-CoV-2 samples and other lung pathogens collected from routine diagnostic samples to foster genomic diagnostic applications and study their spread and evolution.
- Researchers at the Vanderbilt Vaccine Center have used PacBio sequencing technology to study the human B-cell response to the SARS-CoV-2 virus, with the goal of identifying therapeutics or protective antibodies from patient samples.
“We are proud to support the rapidly expanding group of our customers who are engaged in this essential work and believe that the unique nature of SMRT Sequencing will allow them to delve into virus biology and host response research in a way that directly supports the development of much needed diagnostic tests, vaccines and medicines for managing COVID-19,” said Jonas Korlach, PhD, Chief Scientific Officer, PacBio.
Visit our COVID-19 Sequencing Resource Center to review the latest protocols, primer sets, and relevant publications.
We’re pleased to release a short video describing PacBio Sequencing and our latest platform, the Sequel II System. If you’ve ever wondered how Single Molecule, Real-Time (SMRT) Sequencing works, what the Sequel II System is, and what applications are available, this video is a great place to start.
We are excited to share the capabilities of our Sequel II System as it makes SMRT Sequencing affordable for scientists in any lab and provides comprehensive views of genomes, transcriptomes, or epigenomes. The Sequel II System also produces highly accurate long reads, known as HiFi reads, to deliver the highest quality sequencing data.
The three-minute introductory video outlines the key advantages of SMRT Sequencing:
- Long reads allow you to readily assemble complete genomes and sequence full-length transcripts
- High accuracy provides over 99.99% accurate sequencing results
- Uniform coverage enables sequencing through regions inaccessible to other technologies
- Single-molecule resolution lets you capture sequence data with over 99% single-molecule accuracy
- Epigenetics that can be explored through direct detection of base modifications during sequencing
Watch the video for more information about the science behind SMRT Sequencing. From the difference between circular consensus sequencing (CCS) to generate HiFi reads and continuous long read (CLR) sequencing mode, to and how it makes a difference for applications such as whole genome sequencing for de novo assembly, variant detection, and RNA sequencing.
If this introductory video peaked your interest, you can learn more about the Sequel II System, the advantages of SMRT Sequencing, or explore our resource library to learn how scientists worldwide are using SMRT Sequencing to advance their science.
Explore other posts in this series:
In a new preprint, scientists from the National Human Genome Research Institute, the University of Washington, and other institutions describe HiCanu, a modified version of the Canu assembler designed specifically for PacBio HiFi reads. The team put the new assembler through its paces, reporting that it significantly outperformed traditional assembly methods — even getting through centromeres, segmental duplications, and other notoriously difficult regions.
As lead authors Sergey Nurk (@sergeynurk) and Brian P. Walenz, corresponding authors Sergey Koren (@sergekoren) and Adam Phillippy (@aphillippy), and collaborators report, “HiFi is a major leap forward in terms of long-read read accuracy.” They add, “As the accuracy of other long-read technologies have not exceeded 95%, the median accuracy of current HiFi reads can exceed 99.9% (>Q30), making them a promising data type for separating highly similar repeat instances and alleles.”
HiCanu applies homopolymer compression, overlap-based error correction, and tandem repeat masking to eliminate the few remaining errors in HiFi reads, resulting in 97% of reads matching perfectly to a curated reference sequence. This near-perfect accuracy helps to distinguish high-identity genomic repeats, as differences in HiFi reads can be trusted to be biological and not sequencing errors.
The new assembler generated draft assemblies of Drosophila and several human genomes. The HiCanu assemblies were all highly contiguous and extremely accurate. “On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultra-long Oxford Nanopore reads in terms of both accuracy and continuity,” the scientists write. The reported difference in accuracy is especially large: the HiCanu assembly has 831× fewer errors than the assembly of ultra-long Oxford Nanopore reads.
The team zoomed in on certain regions known to be challenging — including centromeres, segmental duplications, and the MHC locus. For CHM13, the scientists report, “This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions.”
HiCanu also deftly handles haplotype phasing, with the authors stating that “HiCanu consistently recovers both haplotypes for the six canonical MHC typing genes in the human genome.”
The authors report several other advantages of HiCanu. First, assemblies generated by HiCanu do not require polishing. In fact, the authors “discourage polishing HiCanu HiFi assemblies, because… polishing pipelines may map reads back to the wrong repeat copies and actually introduce errors.” Second, HiCanu is computationally efficient: “The number of CPU hours required for assembly of a human genome is under 4,000, which could be completed on any modern cloud platform in less than a day for a few hundred dollars,” the team reports. “This is 30-fold less than recent Oxford Nanopore assemblies that required more than 100,000 CPU [hours].”
“We have demonstrated that HiCanu is capable of generating the most accurate and complete human genome assemblies to date,” the scientists write, pointing out that HiCanu could also be applied to non-human genomes, including metagenomic samples. “These results represent a significant advance towards the complete assembly of human genomes.”
Welcome to the Sequencing 101 blog series – where we will provide introductions to sequencing technology, genomics, and much more!
If you’re not immersed in the field of DNA sequencing, it can be challenging to keep up with the rapid evolution among all the platforms and technologies on the market. Let’s start with a quick overview of how these different technologies came about — and how each is used today.
First Generation Sequencing – Starting the Era of Genomics
DNA sequencing as we know it originated in the late 1970s, when Frederick Sanger at the MRC Centre in Cambridge developed a gel-based method that combined a DNA polymerase with a mixture of standard and chain-terminating nucleotides, known as ddNTPs. Mixing dNTPS with ddNTPs causes random early termination of sequencing reactions during PCR. Four reactions are run, each with the chain-terminating version of only one base (A, T, G or C). When visualized with gel electrophoresis, one reaction per lane, the fragments are sorted by length, allowing the DNA sequence to be read off base by base. This technique was revolutionary at the time, enabling sequencing of 500-1000 bp fragments. However, since the original method used radioactive ddNTPs and X-rays, it was less than ideal for widespread use.
By the 1980s, Sanger’s original method had been automated by scientists at Caltech and commercialized by Applied Biosystems. Radioactive ddNTPs were replaced with dye-labelled nucleotides and large slab gels were replaced with acrylic-finer capillaries. Scientists could now simply feed prepared DNA into a machine and view the results of fluorescence-based reactions on an electropherogram. This technology, which was continuously improved over the years, served as the bedrock of the Human Genome Project. Today, automated Sanger sequencing is still in use, primarily in clinical labs where it is acceptable to have low throughput, higher per-sample costs, and sequencing reads 500-1,000 bp in length.
But even after the Human Genome Project, the cost of automated Sanger sequencing — also known as capillary electrophoresis — remained too high to enable the kind of large-scale sequencing projects envisioned by scientists. By the mid-2000s, remarkable efforts had been made to bring down the costs of sequencing. Driven largely by grants from the National Human Genome Research Institute (NHGRI), labs around the world tested out new methods for higher-throughput sequencing, using concepts as diverse as electronics, physics, and magnetics.
Second Generation Sequencing – Short Reads Become Fast and Efficient
One key player in the advent of next-generation sequencing (NGS) was a UK-based company called Solexa, which was later acquired by Illumina. The key innovation of the Illumina platform was ‘bridge amplification’ which allows the formation of dense clusters of amplified fragments across a silicon chip. Amplification of the original single molecule into a large cluster of many copies is what makes it possible to detect a fluorescent signal as a single dNTP is added one at a time, as sequencing proceeds by synthesis. Over time, the number of clusters that could be read simultaneously grew tremendously, and Illumina instruments became the first commercially available massively parallel sequencing technology. Other tools developed around the same time, such as the Ion Torrent platform, became part of the NGS landscape as well. NGS platforms are the dominant type of sequencing technology used today. Their extreme capacity allows for sequencing at very low cost. They are limited, however, in read length; NGS platforms typically produce reads of ~50-500 bp in length. This makes them an excellent fit for resequencing projects, SNP calling, and targeted sequencing of very short amplicons.
Third Generation Sequencing – The Rise of Long Reads
However, short reads are not suitable for all sequencing projects. Another approach that was supported by the so-called $1,000 genome grants from NHGRI was Single Molecule, Real-Time (SMRT) Sequencing from PacBio. This technique uses miniaturized wells, known as zero-mode waveguides, in which a single polymerase incorporates labeled nucleotides and light emission is measured in real time. A different single-molecule approach to long-read sequencing, using pore-forming proteins and electrical detection, was adopted by Oxford Nanopore Technologies (ONT).
Watch this short video to learn how SMRT Sequencing works.
SMRT Sequencing has a number of advantages. Most notable, perhaps, is its ability to produce long reads — tens of thousands of bases long in a single read. These long reads make it possible to span large structural variants and challenging repetitive regions that confound short-read sequencers because their short snippets cannot be differentiated from each other during assembly. Another advantage is low GC bias, which allows PacBio Systems to sequence through extreme-GC at AT regions that cannot be amplified during cluster generation on short read platforms. A third advantage is the ability to detect DNA methylations while sequencing, since no amplification is done on the instrument.
As scientists began to work with SMRT Sequencing — sometimes known as third-generation sequencing — they realized that it had particular value for applications including de novo genome sequencing, phasing, detection of structural variants, epigenetic characterization, and sequencing of the transcriptome without the need for assembly. Technology improvements over time increased the throughput and accuracy of SMRT Sequencing platforms, bringing their costs in line with NGS platforms for many types of projects. Now, SMRT Sequencing has industry-leading accuracy thanks to its HiFi sequencing, and it is being used around the world to produce reference-grade genomes for microbes, plants, animals, and people.
Explore other posts in this series:
It was a pleasure to attend the annual Advances in Genome Biology & Technology meeting in sunny Marco Island, Fla., last month. The conference has a long history of supporting sequencing innovation, and during the 20th anniversary celebration this year, the tradition continued. Video and synopses from several presentations featuring SMRT Sequencing are below.
Adam Ameur (@_adameur) from Uppsala University spoke about the use of long-read PacBio sequencing to detect off-target edits from CRISPR/Cas9. In a method known as SMRT-OTS, Ameur’s team used a clever adaptation of the standard PacBio library preparation to enrich for molecules bound by a guide RNA, which were then sequenced to generate HiFi reads. The team also used HiFi reads generated on the Sequel II System to create a de novo assembly of the human cell line used in the experiments. They found 55 off-target sites for three guide RNAs, including inexact matches to the guide RNA. Ameur’s group has already generated preliminary data from editing living cells, an exciting next step for this work. For more detail, check out their recent bioRxiv preprint.
Watch Ameur’s full AGBT 2020 presentation: Studying CRISPR Guide RNA Specificity by Amplification-Free Long-Read Sequencing
A talk about human reference genomes came from Tina Graves-Lindsay at Washington University in St. Louis and the Genome Reference Consortium. “The human reference is a work in progress,” she told AGBT attendees, offering an update on her team’s many contributions to that progress. They have been using SMRT Sequencing — most recently to produce diploid assemblies — and submitting the resulting, high-quality assemblies to GenBank. They have moved to PacBio HiFi reads for human genome assemblies, she said, because accurate long reads eliminate the expensive error correction step in analysis and produce reference-grade assemblies with half the sequence coverage needed before. In one recent project using HiFi reads, Graves-Lindsay and her team generated a highly contigous diploid assembly with 87% represented in haplotigs. She also reported on a new pangenome reference project, which aims to include sequence data from 350 individuals and generate telomere-to-telomere assemblies.
Watch Graves-Lindsay’s full AGBT 2020 presentation: Generating High Quality Human Reference AssembliesTop of Form
Laura Mincarelli (@MincLaura) from the Earlham Institute gave a presentation that included the use of Iso-Seq data to uncover alternative splicing events in individual stem cells and progenitor cells. She noted that PacBio long-read data is advantageous for this approach because it helps measure cell traits by allowing users to view entire transcripts, not the snippets produced by other technologies. She reported that SMRT Sequencing led to another benefit: the detection of more than 2,100 novel exons in some 950 genes. This work is helping her understand the effects of aging in cells. See the preprint about this research, entitled ‘Combined single-cell gene and isoform expression analysis in haematopoietic stem and progenitor cells.’
Finally, Brenda Oppert from the U.S. Department of Agriculture shared results from the generation of reference-grade insect genome assemblies as part of a larger project to understand potential insect-based food sources for humans. With severe food shortages looking possible in just a decade, she said, “We’ve got to start thinking outside of the box now.” Insects could be a promising alternative protein source, so Oppert has been sequencing their genomes with PacBio technology. “The long reads are absolutely essential for insects,” she said. In cases like the mealworm, for instance, 60 percent of the genome is satellites consisting of units of 142 nucleotides with less than 2 percent sequence divergence. Oppert reported that on the Sequel II System, a single SMRT Cell provides sufficient coverage to produce a high-quality assembly for most insects.
Watch Oppert’s full AGBT 2020 presentation: Feed the World: Developing Genomic Resources for Insects as Food
The PacBio team also had the opportunity to present several posters at AGBT:
- Unbiased Characterization of Metagenome Composition and Function Using HiFi Sequencing on the PacBio Sequel II System – Meredith Ashby, et al.
- Amplification-Free Protocol for Targeted Enrichment of Repeat Expansion Genomic Regions & SMRT Sequencing – Yu-Chih Tsai, et al.
- New Advances in SMRT Sequencing Facilitate Multiplexing for de novo & Structural Variant Studies – Primo Baybayan, et al.
- Copy-Number Variant Detection with PacBio Long Reads – Aaron Wenger, et al.
- A Complete Solution for Accurate Full-Length Transcript Sequencing Using SMRT Sequencing – Jason Underwood, et al.
Since the first PacBio instrument was released in 2011, methylation detection has been one of the advantages of SMRT Sequencing. The kinetics of nucleotide incorporation change as the DNA polymerase moves across a methylated position on the DNA template strand, producing distinctive perturbation patterns (Figure 1) that can be recognized by methylation-calling software.
With the advent of a simple method for detecting methylation in prokaryotes, researchers have demonstrated that in addition to functioning as a defense against phages, bacterial R-M systems can also drive important traits like antibiotic resistance, immune evasion, virulence and persistence in hosts.
Recent internal validation work has confirmed that detection of m6A and m4C in prokaryotic DNA and the R-M system target motifs they reside in continues to perform robustly on the Sequel II System. The detection of 5mC continues to require significantly higher coverage and is therefore not supported through the SMRT Analysis ‘Base Modification Analysis’ workflow.
Our initial validation was done on E. coli K, sequenced as part of a 48-plex sequencing run on the Sequel II System (Figure 2). All three known m6A motifs were successfully detected. In addition, the high coverage weakly detected the known target of the Dcm m5C methylase, CCWGG. However, since m5C calling is not supported, it was erroneously tagged as m6A.
An important takeaway is that to obtain the cleanest motif-finding result, the ‘Minimum Qmod Score’, available as an advanced parameter in the ‘Base Modification Analysis’ application in SMRT Analysis, had to be increased manually. As shown by the red arrow in Figure 2, this value should be set such that it excludes most baseline noise while fully including the cloud of methylation signal. In this example, the ideal setting is Qmod = 200. While the optimal value of Qmod changes with sequencing coverage, we have found a value of 100 produces a good result in most cases when sequencing 48 microbes per SMRT Cell 8M.
To better assess performance across the full range of methylation patterns seen in microbes, we then analyzed data from 4 more challenging microbes. These more difficult examples confirm that the Sequel II System can detect both m6A and m4C at the same level of performance seen with our previous sequencing systems. The known R-M systems in Neisseria meningitidis FAM18 (Table 1), Treponima denticola A (Table 2), and Methanocorpusculum labreanum Z (Table 3) were largely recovered at high confidence. The few exceptions are likely due to competition between multiple methyltransferases that target overlapping motifs.
The most difficult test case was H. pylori J99, which carries 24 distinct R-M systems, targeting m6A, m4C, and m5C. We called 21/24 motifs precisely correctly. In one instance our motif caller was confounded by overlapping motifs, but the correct answer could be easily discerned by visual examination. The remaining two missed motifs involve m5C, which continues to be unsupported.
We hope these results will give all our customers who study prokaryotic methylation the confidence to move forward with planning bacterial whole genome sequencing experiments on the Sequel II System, taking full advantage of the higher multiplexing capacity and reduced per sample cost.
Meerkats, yaks, geese, and lots of flies — oh my! A full menagerie of new and updated animal genomes has been released by the Ensembl project.
Thirteen of the new assemblies have been produced by the Vertebrate Genome Project (VGP):
- Canada lynx
- Greater horseshoe bat
- Golden eagle
- Jewelled blenny (pictured, right)
- Pinecone soldierfish
- Live sharksucker
- Orbiculate cardinalfish
- Gilthead seabream
- River trout (also part of the Sanger 25 Genomes Project)
- Zebra finch (updated assembly)
- Asian bonytongue (updated assembly)
- Fugu (updated assembly)
Part of Ensembl’s mission is to provide gene annotation for the genome assemblies produced by this long-term global collaboration.
The release also included six more mammalian genome assemblies:
- Siberian musk deer
- Chacoan peccary
- Sperm whale
- Meerkat (pictured, right)
- Arabian camel
- Domestic yak
Nine more bird genome assemblies:
- Gouldian finch
- Yellow-billed parrot
- Burrowing owl
- African ostrich (pictured, right)
- Swan goose
- Indian peafowl
- Eurasian sparrowhawk
- Golden pheasant
- Ring necked pheasant
- Golden-line barbel
- Blind barbel (pictured, right)
- Horned golden-line barbel
- German mirror carp
- Hebao red carp
- Hunaghe carp
- Atlantic salmon (also part of the AquaFAANG project)
- Blue tilapia
- Round goby
- Nile tilapia (updated assembly)
And four new reptiles:
- Komodo dragon (pictured, right)
- Common wall lizard
- Eastern brown snake
- Three-toed box turtle
There are also 35 new metazoa genome assemblies, including:
- 18 new Anopheles mosquito species
- an update from the L3 to L5 assembly for Aedes aegypti
- the vector of Zika virus Aedes albopictus
- six Tsetse fly species
- two Sand fly species
- the freshwater snail vector of schistosomiasis (Biomphalaria glabrata)
- the common bedbug (Cimex lectularius)
- the Lyme disease tick (Ixodes scapularis)
- common house fly (Musca domestica)
- stable fly (Stomoxys calcitrans)
And there are some sweet additions to plant genomes:
- Sweet cherry (Prunus avium)
- Clementine (Citrus clementina)
- Morning glory (Ipomoea triloba)
- Wild sugarcane (Saccharum spontaneum)
Started in 1999 to annotate the human genome and make all data publicly and freely available via the web, the Ensembl project is based at the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI), located on the Wellcome Genome Campus near Cambridge, UK, and now involves hundreds of scientists from around the world.
PacBio highly accurate long reads, known as HiFi reads, offer all the benefits of long-read sequencing with accuracy comparable to short-read sequencing. To celebrate this new paradigm in sequencing technology, we hosted the 2019 HiFi for All SMRT Grant this past fall. This SMRT Grant was open to scientists worldwide and offered three winning projects each up to six SMRT Cells 8M and sequencing on the Sequel II System by our Certified Service Providers and co-sponsors.
In response to our call for projects across the range of SMRT Sequencing applications, we received many truly compelling proposals, which made selecting the winners quite a challenge. Today, we are thrilled to announce the three winners of this SMRT Grant and share a glimpse into how they will use HiFi sequencing to tackle a diverse set of scientific questions.
Holding on by a Claw: Elucidating the Genomics of the African Leopard
Winner: Ellie Armstrong (@_ellie_cat), Stanford University
Synopsis: This project will generate a high-quality genome assembly for the African leopard, a big cat facing endangered status due to habitat loss, hunting, and illegal wildlife trade. Very little genetic information has been produced for leopards. A high-quality assembly will be important for conservation genomics and for investigating genetic and structural variation across leopard subspecies.
“We are thrilled to be working with PacBio to produce a high-quality assembly of the African leopard. Leopards are extremely elusive, making them a prime species for the development of genomic monitoring tools. This genome will allow us to investigate the distribution of genomic diversity of leopards, their evolutionary history, and gain insight into how they adapt to such a wide variety of landscapes.” – Ellie Armstrong
Sequencing for this project will be provided by Georgia Genomics and Bioinformatics Core.
Establishing the Largest Longitudinal HIV Sequence Database Ever Assembled
Winner: Daniel Sheward (@DannySheward), University of Cape Town
Synopsis: In a collaboration between the University of Cape Town (PI: Carolyn Williamson), the National Institute of Communicable Diseases of South Africa (PI: Penny Moore) and the Karolinska Institutet (PI: Ben Murrell), scientists will use PacBio sequencing for more than 1,000 samples collected from 150 women in the South African CAPRISA Acute Infection cohort to perform a longitudinal study of HIV infection. A highly multiplexed approach will allow for HiFi sequencing of the virus in all samples to generate the largest sequence database of longitudinally collected HIV samples. The information gleaned from this database is expected to contribute to the research community’s understanding of viral evolution, latency, immunology and vaccine development.
“We are extremely excited about this project. With HiFi sequencing on the Sequel II System, a project of this scope is finally feasible.” – Daniel Sheward
Sequencing for this project will be provided by the Earlham Institute.
Asian Reference Genome
Winner: Jianjun Liu, Genome Institute of Singapore
Synopsis: With this SMRT Grant, scientists will sequence the genomes of three individuals — one each of Chinese, Indian, and Malay descent. This is part of a larger effort at the Genome Institute of Singapore to generate Asian population-specific reference genomes for improved variant calling for people in these populations. The assemblies produced through the SMRT Grant will be used to analyze structural variation, evaluate different genome assemblers and adapt the Institute’s methods for HiFi data. Ultimately, population-specific data will play an important role in the implementation of precision medicine for people of all ancestries.
“We are excited about this project and very thankful for support by PacBio and DNA Link. Genomic analysis of Asian populations has fallen behind the efforts in western populations. We hope that our effort can help to improve it by providing tools and resources that can empower the studies of Asian populations.” – Jianjun Liu
Congratulations to all our HiFi for All SMRT Grant winners! And thank you to our co-sponsors for teaming up with PacBio to make these SMRT Grants possible. Explore the 2020 SMRT Grant Programs to apply to have your project funded.
With high-throughput long-read sequencing, it is now affordable and routine to produce a de novo genome assembly for microbes, plants and animals. The quality of a reference genome impacts biological interpretation and downstream utility, so it is important that researchers strive to achieve quality similar to “finished” assemblies like the human reference, GRCh38.
Until a time when sequence data and resulting assemblies can regularly achieve reference-quality, assemblies should be evaluated in the three key dimensions: Contiguity, Completeness, and Correctness. However, the most commonly used measures of genome quality only tackle two of the three C’s.
Contiguity is often measured as contig N50, which is the length cutoff for the longest contigs that contain 50% of the total genome length. In this era of long-read genome assemblies, a contig N50 over 1 Mb is generally considered good.
Completeness is often measured using BUSCO (Benchmarking Universal Single-Copy Orthologs) scores, which look for the presence or absence of highly conserved genes in an assembly. The aim is to have the highest percentage of genes identified in your assembly, with a BUSCO complete score above 95% considered good.
Correctness, the third and final C, is more challenging to measure. Correctness can be defined as the accuracy of each base pair in the assembly and is most often measured as concordance of an assembly to a gold standard reference. Of course, when sequencing a novel species there may not be a reference against which to measure. Furthermore, concordance is only a good measure for accuracy when the gold-standard itself is very high quality and when there is little biological divergence between the reference sample and assembly sample (Figure 1).
So, how does one properly measure the accuracy of a generated genome assembly? Well, we explored several methods you might find useful and broke them down by what type of orthogonal data is needed for each.
Data needed: Transcript Annotations
One measure of correctness is the number of frameshifting indels in coding genes. Frameshifts often disrupt the production of the protein encoded by the gene and are rare. Thus, most observed frameshifts are actually assembly errors.
This approach is similar to BUSCO but rather than utilizing a small conserved set of genes, a larger set of genes are analyzed. This requires a set of transcripts from the same (or very closely related) sample, which are commonly generated as part of a genome annotation project. PacBio RNA Sequencing, using the Iso-Seq method, is a good strategy for genome annotation.
The primary advantage of this approach is that you may be able to use an annotation or RNA sequencing data that is already in existence. The primary disadvantages of this approach are that it assesses only a small percentage of the genome (often less than 1%, often some of the most conserved regions) and may underestimate accuracy since not all frameshifts are errors.
DELVE INTO HIGH-CONFIDENCE REGIONS
Data needed: Reference Genome & Short Reads
Sometimes a reference genome is available for the same species but for a different individual than the one being assembled. In such cases, it is useful to define “high-confidence regions” where the reference is a good match to the sample and then assess the assembly only within those high-confidence regions. The Genome in a Bottle Consortium has applied such an approach for human samples.
To build high-confidence regions as Kingan, et al. did for human, rice, and Drosophila, short-read sequencing data is mapped to the reference and used to exclude low-confidence regions including those with abnormal coverage or in close proximity to variants. Within the resulting high-confidence regions, concordance is a good measure of assembly accuracy.
The advantages of this approach are that it provides a good measure of assembly accuracy and explicitly identifies errors as discordances between an assembly and the high-confidence regions. Discordances can then be examined to determine how to improve the assembly.
The disadvantages are that it requires an independent reference genome from which to start, as well as additional short-read data. Also, the accuracy estimate can be somewhat optimistic by excluding “difficult” regions from evaluation or somewhat pessimistic if true biological variants are not removed from the benchmark.
EXPLORE BAC COMPARISONS
Data needed: BAC Sequences
In cases where a reference is not available but another set of high-quality sequences, such as Bacterial Artificial Chromosomes (BACs), exist for the same sample, you can measure concordance between your assembled contigs and the BAC sequences. This method was used by Vollger, et al. when validating the accuracy of one of the first human assemblies generated using PacBio highly accurate long reads, known as HiFi reads.
COUNT ERRORS WITH SHORT READS
Data needed: Short Reads
It is possible to measure accuracy even for a species with no existing reference genome by comparing the k-mers in an assembly to k-mers from short reads from the same individual. One tool to do this is yak from Heng Li. Another is merqury, developed by Arang Rhie in Adam Phillippy’s group.
The advantages of this approach are that it does not require a reference genome and does not ignore difficult regions of the assembly. It also provides a way to measure completeness by flipping the comparison and looking for k-mers present in the short reads that are missing in the assembly. Merqury has the additional ability to track the coordinates of errors: it outputs files that can be loaded as IGV tracks so the user can visualize misassembles or other errors. Merqury has many additional functions like outputting spectra-cn plots and, for users with a trio, assessing contig phasing accuracy with statistics and plots.
Similar to the k-mer approach above, short-read data can be used to count errors by aligning short reads to the assembly and identifying single nucleotide differences. An error rate can then be calculated by dividing the total count of SNVs by the number of bases in an assembly covered by at least 3 short reads. The short-read data can be from a closely related individual, although estimates of correctness are most accurate when the same individual is used as Koren, et al. did with an F1 cross of two breeds of cattle.
Like the k-mer method, this approach requires no reference genome or transcript dataset and does not ignore difficult regions of the assembly. Unlike the k-mer approach, potential errors in the assembly can be identified and characterized in order to improve the assembly method.
Looking to the Future
Exciting progress in long-read sequencing and genome assembly has made it standard to produce contiguous, complete genomes. In order to generate genomes that are not simply assembled, but are also effectively used for downstream biology, we must address the third dimension of quality: correctness. The techniques discussed above make it easy to measure correctness with a variety of different orthogonal data types. We expect these approaches will identify which sequencing workflows produce the most accurate genomes and will nudge the field towards an era of reference-grade de novo assemblies.
Telomeres and centromeres have long vexed genomic scientists. In the early days of genome sequencing, many researchers took it for granted that assembling these highly repetitive regions was essentially impossible.
That’s why a new preprint posted to bioRxiv is so exciting. Scientists from Weill Cornell Medicine and Colorado State University describe the use of PacBio long-read whole genome sequencing to analyze and assemble telomeres, characterizing the heterogeneity of these elements across three human genomes from the Genome in a Bottle collection (HG001, HG002, HG005).
“Haplotype Diversity and Sequence Heterogeneity of Human Telomeres” comes from lead authors Kirill Grigorev (@LankyCyril) and Jonathan Foox (@jfoox), senior author Chris Mason (@mason_lab), and collaborators. They took on this project to overcome existing challenges with assembling telomeres and to establish a better protocol that others could replicate.
“Given their length and repetitive nature, telomeric regions are not easily reconstructed from short read sequencing, making telomere sequence resolution a very costly and generally intractable problem,” the authors write. “We describe a framework for extracting telomeric reads from single-molecule sequencing experiments, describing their sequence variation and motifs, and for haplotype inference.”
Short reads, which are typically no more than a few hundred bases, can read DNA in telomeric regions, but during alignment they struggle to differentiate the highly repetitive regions and to represent them accurately without collapsing several repeats into one. Highly accurate long PacBio CCS reads, known as HiFi reads, produced by SMRT Sequencing can represent tens of thousands of base pairs in one long stretch. This greatly reduces the alignment challenge, facilitating the accurate assembly of even the most repetitive regions in the genome.
“We find that long telomeric stretches can be accurately captured with long-read sequencing,” the scientists report. In the preprint, they describe the ability to observe sequence heterogeneity, discover novel and known non-canonical motifs, and create motif composition maps. Their framework, known as edgeCase, was validated with PacBio sequencing data sets from the Genome in a Bottle consortium.
While the team’s results confirmed that TTAGGG, the canonical repeat associated with telomeric regions, is the dominant motif, there was “a surprising diversity of repeat variations” including known and novel variants. This previously untapped diversity was masked by “the necessary bias towards the canonical motif during the selection of short reads,” the scientists suggest. “Telomeric regions with higher content of non-canonical repeats are less likely to be identified through the use of short reads, and instead, long reads appear to be more suitable for this purpose,” they add.
The team concludes: “The identified variations in long range contexts enable clustering of SMRT reads into distinct haplotypes at ends of chromosomes, and thus provide a new means of diplotype mapping and reveal the existence and motif composition of such diplotypes on a multi-Kbp scale.”
The rarest day on the calendar is February 29th — which makes it the perfect time to celebrate Rare Disease Day. On this day, we join millions of people around the world making time to honor the patients, caregivers, healthcare professionals and scientists who deal with rare diseases every day.
And we didn’t have to look far to find someone affected.
Bioinformation John Harting, of our Applications Development group at PacBio, came face-to-face with the rare disease that is the most common genetic disease in infants — spinal muscular atrophy (SMA) — when his first child, Zoe, was three months old.
Everything seemed normal when she was born in October 2012, but John and his wife Eliza started to notice that Zoe didn’t have a lot of strength nor activity.
“We were new parents so we didn’t know any better,” John says. “We asked the doctor a couple times, but we kept getting answers like: ‘Some children develop slower,” or “Maybe she has a bit of hypotonia that will go away.’ Basically, they were reluctant to look much deeper.”
During a holiday visit to the in-laws, however, John noticed how little his daughter was moving compared to her cousin, who was a few weeks younger. The couple returned to their doctor and demanded testing. A few weeks later, they got a call from a neurologist, suggesting they come in to meet with a doctor and social worker.
“It was pretty scary to get that call and to drive to the hospital expecting bad news,” John says.
And the news was indeed bad. Zoe was diagnosed with Type 1 SMA, the most severe form of the fatal degenerative neuromuscular disease.
“They told us, basically, that within two years she was going to pass, and we could just hold her and love her and let her go. That was all we could do. There were no treatments at the time.”
Fortunately, that was not the case. The couple switched pediatricians, and the new doctor happened to have attended a conference where she heard a presentation by Stanford pediatric neurologist John Day about a new potential SMA treatment, and a clinical trial recruiting candidates.
Zoe was accepted on the trial and became the first child in the world to receive the experimental drug, Nusinersen, now approved by the FDA and sold as Spinraza.
Zoe is now a 7-year-old first grader with a strong personality and growing independence, who loves to chase her classmates around the playground in her mini motorized wheelchair.
“She’s slowly started to achieve milestones that SMA Type 1 patients never did before,” John says.
Contributing to the Cure
John says he is proud to be working at a company that is helping to make further advances in rare disease treatment possible.
While any given rare disease affects a relatively small number of people, these diseases collectively affect some 400 million people around the world. About 7,000 rare diseases are currently recognized.
In the case of SMA, the disease is triggered by a gene mutation that is actually quite common: it is estimated that 1 in 40 people carry it. When both parents have the mutation, their child can develop the disorder, which causes muscles to atrophy because they don’t receive the right signals from the spinal cord.
Genetic screening and early intervention is crucial. But this is tricky. Standard tests that use PCR (polymerase chain reaction) and short read technologies don’t always put mutations in the right context. For example, there may be “pseudogenes,” where big stretches of the genome are replicated and look almost identical, making it difficult to distinguish between true carrier genes and other kinds of unusual gene conversions.
“There can be false negatives. Someone could look like they are not a carrier, when in fact they have two copies of the gene, but they are both on one chromosome, instead of one on each chromosome” John says.
PacBio sequencing is changing that, by allowing scientists to read parts of the genome that have been difficult to reach until now. John has also been collaborating with others to dive deeper into target genes and to develop assays for diagnostics and drug development.
“A lot of these rare diseases are genetic, and are in places that are difficult to sequence, where they overlap with more common diseases,” John says.
“There’s a lot of interesting stuff that goes on that’s almost invisible to some of the other technologies and we can help to learn more about them and contribute to some important medical discoveries.”
Helping to SOLVE Rare Diseases
Rare disease researchers are increasingly turning to PacBio long-read sequencing technology to study areas of the genome inaccessible by other means, or to unravel complex disease-causing variants, such as tandem repeats, structural variants, complex rearrangements, and transposable elements.
Long reads can be a straightforward way to detect repeat changes because an adequately long read can encompass an entire expanded repeat as well as flanking unique sequences, for example.
In a review in the Journal of Human Genetics, Satomi Mitsuhashi and Naomichi Matsumoto from the Yokohama City University in Japan note that “long-read sequencing is especially highly recommended when repeat diseases or complex chromosomal rearrangements are suspected.”
The SOLVE-RD research program, a European-based consortium of more than 20 institutions, is also using the PacBio Sequel II System to sequence more than 500 whole human genomes with the aim of pinpointing disease-causing variants.
“Even with exome sequencing, as many as 50% of rare disease cases remain unsolved. The SOLVE-RD team believes that long-read SMRT Sequencing will be essential for discovering the causal elements that have proven elusive with previous approaches, and we anticipate that this research will ultimately make it easier for doctors to diagnose other patients with these rare diseases in the future,” said SOLVE-RD team member Alexander Hoischen, Associate Professor for Genomic Technologies and Immuno-Genomics at Radboud University Medical Center.
Hoischen gave examples of some recent discoveries in a number of conditions — from ALS and FTD, SCA10 and Parkinson’s disease, to Myotonic dystrophy, Bardet-Biedl syndrome, and Fragile X disorders — in a review paper in Frontiers in Genetics.
At PacBio, we are continually inspired by our users who focus on the study of rare diseases, and we are committed to supporting such research. Two recent SMRT Grants were awarded for projects devoted to advancing our understanding of spinocerebellar ataxia (Cleo van Diemen at the University Medical Center Groningen) and myotonic dystrophy (Stéphanie Tomé of the Centre de Recherche en Myologie at Sorbonne Université/INSERM in Paris).
We are also official supporters of Rare Disease Day, and we ask that you join us this Saturday by taking the opportunity to honor the entire rare disease community, including those who live with a disease and the many researchers striving to improve their situation. Participate on social media using #RareDiseaseDay, #ShowYourStripes, or #ShowYourRare. Or make it an IRL experience: find events to join in the U.S. or around the world to help show your support and raise awareness for those affected.
You can also support the Harting family’s efforts to purchase a van to transport Zoe more easily.
To hear the latest in rare disease research, register to attend our upcoming webinar on May 27: Increasing Solve Rates for Rare and Mendelian Diseases with Long-read Sequencing.
The genome of the rose is almost as complicated as its connotations when given as a gift on Valentine’s Day or other special occasions.
Although relatively small in size, at 400-750 Mb, with seven chromosomes, the cells of roses have multiple sets of chromosomes beyond the basic set. And these can vary widely between the commercial varieties. Some are diploids, with two homologous copies of each chromosome (like humans, with one from the mother and one from the father), while others can have as many as five different sets (pentaploids). Most are tetraploids, with four sets of chromosomes.
To further complicate things, many roses are “segmental allotetraploids,” which means that part of the genome is behaving like an allotetraploid (with four chromosome sets from two distinct species, which occurs during hybridization) – and part of the genome is behaving like an autotetraploid (with four sets of homologous chromosomes).
Needless to say, parsing all of this out is challenging. But researchers from the Netherlands recently presented their solution, using HiFi reads generated by the Sequel II System.
In a workshop discussion at PAG XXVIII, Bart Nijland (@bart3601) of Genetwister Technologies (@genetwister), explained how his team set out to make a haplotype-aware assembly of Rosa x hybrida L. in order to capture its full range of genetic variation, rather than rely on more traditional assemblies which collapse the haplotypes into single sequences that could be missing critical information.
“For a highly heterozygous, highly complex, commercially important species like the rose, there is a huge benefit to making a haplotype-aware assembly,” Nijland said. “A lot of the existing technologies don’t perform very well in doing this. So we were very happy when PacBio released its HiFi protocol. Due to the high accuracy of the reads, we thought this could really help us in solving this challenge.”
The next challenge was isolating DNA from the leaf tissue of a tetraploid rose variety, which is notoriously difficult because of secondary metabolites. Once that was overcome and the sample was processed to create a HiFi SMRT library, speedy sequencing of four SMRT Cells 8M was performed on the Sequel II System at Radboud UMC. The result was more than two terabytes of raw polymerase data, with an average yield of more than 500 Gb per SMRT Cell.
“We did a k-mer analysis to investigate the heterozygosity of the sample. Due to the high accuracy of the reads, we could nicely see four distinct peaks, which you would expect in a heterozygous, tetraploid sample,” Nijland said. “And when mapping the HiFi reads, we could already distinguish four haplotypes. So we were very happy to see this.”
In order to get an even better picture of the variation between the diploid and tetraploid varieties, Nijland and colleagues, including Henri van de Geest (@geesthc) and Mark de Heer, performed a de novo assembly using FALCON and Canu.
“Our assembly is very much improved and we were able to separate many of the haplotypes,” Nijland said.
The next step is to improve the assemblies even further by using Bionano or HiC technologies, which Nijland is hoping will help separate some of the alleles that were extremely similar due to being a segmental allotetraploid.
“We managed to assemble a heterozygous, polyploid genome, without the need for ultra high molecular weight DNA, which is required for a lot of other long-read sequencing,” Nijland said. “Also, the sequence coverage which is required in the assembly is lower, and because of the high accuracy, the computation of the assemblies is much less.”
“Most importantly, we’re getting a better representation and better overview of genomic content in the assembly. This provides a very valuable tool for molecular breeding efforts in rose.”
Catch up on other PAG presentations in a recent blog post and watch Nijland’s full PAG talk here: