This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
We’re excited to announce a research collaboration with Invitae focused on the investigation of clinically relevant molecular targets for use in the development of advanced diagnostic testing for epilepsy. To support this collaboration, Invitae is expanding its PacBio sequencing capacity to meet the growing demand for clinical applications dependent on highly accurate genomic information.
More than half of epilepsies can be traced to a genetic cause. When a child presents with seizures, genetic testing can help identify more than 100 underlying, often rare conditions. Early genetic testing may be the most cost-effective, direct, and accurate diagnostic tool for children, shortening lengthy diagnostic odysseys. Delays in diagnosis can be devastating for children, as some genetic epilepsies are neurodegenerative and early symptoms may be subtle and easy to misdiagnose.
The Behind the Seizure program is a prominent collaborative program established by BioMarin and Invitae that was developed to provide faster diagnosis for young children with epilepsy in many regions around the world. Participants in the Behind the Seizure program are diagnosed one to two years sooner than reported averages.
The first phase of our research collaboration is focused on a whole genome sequencing study of a large pediatric epilepsy patient cohort derived from the Behind the Seizure program. HiFi sequencing will be performed to generate comprehensive variant profiles used to investigate the genetic etiology of epilepsy. The research is intended to accelerate Invitae’s development of assays to help patients who have been unable to get a diagnosis with conventional short-read sequencing technologies and facilitate improved treatment options based on specific genetic targets.
In a statement announcing this news, Invitae Chief Medical Officer Robert Nussbaum said: “Through this research collaboration with PacBio, Invitae aims to develop innovative methods that will provide more accurate answers to individuals living with epilepsy and their healthcare providers.”
Our CEO Christian Henry added: “We are honored to partner with Invitae, a recognized leader in genetics, to co-develop methods that have the potential to support earlier genetic testing and intervention to aid treatment selection for millions of people living with epilepsy worldwide. Working with leading organizations such as Invitae is an important part of our strategy to accelerate the use of our highly accurate long-read sequencing platform in large-scale whole genome sequencing initiatives.”
Assembly and binning of metagenome data are the first steps in many metagenomics analysis pipelines, and with good reason. Metagenome assembled genomes (MAGs) and circularized MAGs (CMAGs) allow recovery of complete genes and operons, thereby improving predictions of metabolic capacities. MAGs also provide information about gene synteny and enable better taxonomic profiling. However, as discussed in a recent review by Chen et. al. draft MAGs with poor completeness or high contamination can lead to incorrect conclusions.
One way to improve assembly completeness and contiguity is to use long-read sequencing. However, not all long reads are the same. Did you know that once read lengths are longer than most of the repeats in a genome or metagenome, incremental gains in raw read accuracy improve assemblies faster than higher coverage or even large gains in read length?
One of the main hurdles in metagenome assembly is the presence of multiple closely related strains and species in the same sample, which leads to tangled assembly graphs. While long reads are helpful in resolving these, if the difference between two bacterial species (often defined as 3%) is less than the raw error rate of your sequencing data, overlap assembly remains problematic. This is because with noisy long reads, assembly is typically preceded by an error-correction step where the raw reads are mapped against each other to produce high accuracy consensus reads.
However, with metagenome data, this has the side-effect of collapsing and averaging reads that may actually be derived from different species. The ability to distinguish reads from closely related species or strains can be effectively erased during this first step, and the purity of the resulting contigs, the completeness of the MAGs, and the total size of the metagenome assembly can all be compromised. Read on for a detailed discussion and examples of how differences in read quality impact MAG assembly.
Higher read accuracy drives assembly quality
To understand how incremental changes in accuracy and differences in coverage affect metagenome assembly quality, we generated model metagenomics datasets with community member abundances that reflect a real fecal microbiome, drawing on references from Zou, et al. and the ‘Badread’ long read simulator (Wick, 2019). Noisy long reads were simulated from 160 microbial reference genomes with accuracy modes between 87.5% and 97.5%, and HiFi reads were modeled using a typical accuracy distribution (>99%) for 8 kb -10 kb reads, an insert size commonly achievable for long read metagenome sequencing. The number of bases in each dataset was modeled after conservative Sequel II System yield of HiFi data from a metagenomics run (~20 Gb) and ONT PromethION (60 Gb) reported outputs (Shafin, 2020). The resulting model datasets were assembled with Canu 2.0, using the recommended parameters for ONT and HiFi datatypes.
With Canu, it is possible to trace which reads were used to generate each contig in the assembly, and we used this capability to calculate the purity of each contig. Specifically, we determined what fraction of reads did not originate from the reference genome that contributed the majority of reads used to assemble that contig.
As shown in Figure 2, there are limited gains in contig purity even as accuracy changes from 85% to 97.5%. However, there is a sharp transition in contig purity when read accuracy surpasses 99%, exceeding the inter-species similarity commonly seen in a complex fecal community.
High-error reads compromise the assembly of low abundance species
Another challenge with using self-error correction ahead of metagenome assembly relates to the uneven proportion of different species in the data. Error correction typically requires ~30-fold coverage to be effective. However, in metagenomes, it is common for species to be present at a wide range of relative abundances. This means that even when there is enough coverage of highly abundant species for error correction, reads from lower abundance species may fail the initial error correction step and be omitted from the assembly. In the example of our model data set, even with three times more raw data, the 87.5% accuracy mode dataset assembles to less than half of the expected assembly size, with contigs that are significantly shorter than with more accurate reads. When the data accuracy surpasses the threshold of microbial interspecies differences, contiguity and assembly size leap dramatically despite lower sample coverage.
An example of how this limitation plays out in a real-world sample can be seen in a cow rumen assembly that used self-corrected PacBio CLR reads with ~89% median accuracy (Bickhart, 2019). While the PacBio CLR assembly had higher contiguity than the Illumina assembly despite a 3-fold lower depth of sequencing, the Illumina assembly had superior completeness.
Closer inspection of the PacBio CLR data revealed that “the correction step removed 10% of the total reads for being singleton observations (zero overlaps with any other read) and trimmed the ends of 26% of the reads for having fewer than 2 overlaps.” The authors further noted that “this may have also impacted the assembly of low abundance or highly complex genomes in the sample by removing rare observations of DNA sequence”.
In contrast, since HiFi reads do not need error correction, all the data, including observations from low abundance species, can be used in the assembly step. Accordingly, a more recent assembly of a sheep fecal sample that used HiFi data had significantly improved performance. In his SMRT Leiden talk, Derek Bickhart noted that while cow rumen and sheep fecal samples are different communities and therefore their assemblies are not an “apple to apples” comparison, the sheep fecal assembly, done with HiFi data, appears to have a significantly improved representation of low abundance species as gauged by the proportion of same-sample short read data that maps to the long read assembly.
One possible method for overcoming the long-read coverage bottleneck is to use short read data for error correction. However, this approach suffers from the same factors that limit short read metagenome assembly. Namely, short read data has GC bias and cannot be mapped uniquely to repetitive regions. Given that bacterial genomes can range from 13-75% GC, error correcting low accuracy long reads from all the species in a metagenome sample with short read data can be problematic.
The power of HiFi reads
With the unique combination of high accuracy and long read length, HiFi data shows promise for overcoming some of the longstanding challenges in metagenome assembly. Unlike noisy long reads, assembly of HiFi reads is unencumbered by an error correction step that can erase the variation needed to correctly assemble closely related species in complex communities and generate high quality MAGs and CMAGs. Furthermore, they show potential for improving the representation and contiguity of low abundance species in metagenome assemblies.
HiFi data has already been making waves in the world of large genome assembly, first at PAGXXVIII in January 2020 and more recently at the precision FDA Truth Challenge V2, which evaluated methods for variant calling in human genomes. We are excited to see what HiFi data will do for metagenome assembly as more researchers become aware of its potential.
Chen L-X, et. al. (2020) Accurate and complete genomes from metagenomes. Genome Research 30:1-19.
Bickhart, D., et. al. (2019) Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex microbial community by combined long-read assembly and proximity ligation. Genome Biology 20:153.
Wick RR. (2019) Badread: simulation of error-prone long reads. Journal of Open Source Software. 4(36):1316.
Shafin, K., Pesout, T., Lorig-Roach, R. et al. (2020) Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol.
Zou, Y., Xue, W., Luo, G. et al. (2019) 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat Biotechnol 37, 179–185.
Kids have lots of questions. But even the world’s top scientists don’t have all the answers — especially when it comes to rare genetic disorders afflicting children.
Our HiFi reads, highly accurate long reads, generated by our Sequel II and new Sequel IIe Systems, are helping researchers uncover disease-causing genetic variants that had previously gone undetected by other technology, contributing to increased solve rates for rare diseases.
We’re particularly excited to see this technology applied to translational research in children. We will be collaborating with Children’s Mercy Kansas City as part of its Genomic Answers for Kids (GA4K) program, which aims to collect genomic data and health information for 30,000 children and their families over the next seven years, ultimately creating a database of nearly 100,000 genomes.
“We are delighted to be collaborating with the innovative scientists at PacBio as we bring their long-read sequencing data to bear on some of our most difficult cases of rare pediatric disease to give patients and families the answers they deserve,” said Tomi Pastinen, director of the Center for Pediatric Genomic Medicine at Children’s Mercy.
It is estimated that as many as 25 million Americans — approximately 1 in 13 people — are affected by a rare condition. Whole-genome and whole-exome sequencing is often employed to try to diagnose these conditions, but often this involves short-read sequencing, and causes are found in only ~25% to 50% of cases — leaving the majority of cases unsolved.
Hoping to overcome these odds, Children’s Mercy has recently invested in Sequel II Systems, with plans to use our Single Molecule, Real-Time (SMRT) Sequencing technology to generate HiFi reads to detect what the short-read methods might have missed. Early results are encouraging, and have already demonstrated increases in pathogenic variant and disease-gene discovery beyond what was possible with short-read methods.
The researchers will also be working with the Microsoft Genomics team to build Microsoft Azure cloud-based analysis solutions and a data repository for this unique dataset.
“The diagnosis journey for a child with a rare disease and their families can be long and often inconclusive. We believe the advancement of precision medicine with specialized technologies will be key to gaining a better understanding and early diagnosis of these debilitating and deadly diseases,” said Gregory Moore, corporate vice president, Microsoft Health.
We look forward to making a meaningful impact by increasing solve rates through this important partnership.
More information about how Children’s Mercy scientists are using HiFi sequencing will be presented in PacBio’s ancillary workshop Monday, October 26 from 1:00-2:00 pm ET during the American Society of Human Genetics (ASHG) Annual Meeting. Emily Farrow, Director of Laboratory Operations at the Genomic Medicine Center at Children’s Mercy, will give a talk entitled “Applications of Third Generation Sequencing in Unsolved Disease.” Free virtual event registration is available here.
See additional examples of the use of SMRT Sequencing in rare disease research and learn more about structural variant detection:
- Webinar: Increasing Solve Rates for Rare and Mendelian Diseases with Long-Read Sequencing
- The Pathologist: Solving Rare Disease with SMRT Sequencing
- A Rare Opportunity to Help Tackle Daughter’s Rare Disease
- Review: Long-Read Sequencing Helps Uncover Genetic Basis for Rare Disease
- SOLVE-RD Team Adopts PacBio Sequel II System to Solve Rare Diseases
- In Alabama, Scientists Use HiFi Data to Solve Rare Neurodevelopmental Disorders
As the world faces an unprecedented pandemic caused by a novel coronavirus, the scientific spotlight has shone brightly on infectious disease research. And although interest in Public Health England’s (PHE) Culture Collections is often focused on its historical cultures, its relevance in our modern world has never seemed sharper.
The National Collection of Pathogenic Viruses (NCPV) has been helping scientists from around the world address the current history-making infectious disease event. It is also anticipating future outbreaks, and building collections of pathogenic viruses to aid research into potential threats to human health.
“The question of which virus will be next to make the jump from relative obscurity to frontpage news is important to ask, but difficult to answer,” wrote PHE Lead Virologist Barry Atkinson in an NCPV blog post. “One group has shown a propensity to cause large outbreaks after decades of apparent inactivity or low-level circulation – the arthropod-borne viruses (arboviruses).”
While viruses have dominated headlines in 2020, bacteria have also driven epidemics and outbreaks, whether through community spread, in hospitals, or in our food and water systems. In fact, the majority of the microorganisms in PHE’s culture collections are bacteria, dwarfing the number of viruses and fungi.
Currently celebrating its 100 year anniversary, the National Collection of Type Cultures (NCTC) continues to remain highly relevant, recently releasing new antimicrobial resistance reference strains and resources.
The historical collection of more than 6,000 expertly preserved and authenticated bacterial cultures has been at the forefront of advances in the field, implementing the latest, greatest new technology in order to provide the best, most comprehensive resources for microbiology laboratories in a range of different sectors and in research institutes worldwide.
Most recently, when the NCTC decided to expand their collection to include genome sequencing information, PacBio Single-Molecule Real-Time (SMRT) Sequencing was the technology of choice. Starting in 2014 and in coordination with the Wellcome Sanger Institute, NCTC created reference quality genomes for 3,000 bacterial strains. Professor Julian Parkhill, who initiated the project while he was at the Wellcome Sanger Institute, stated, “If you’re trying to generate reference genomes that are going to be valuable to as many people as possible, with as much information in them as possible, then Pacific Biosciences has the edge in terms of generating more complete data.”
Released four years later, the collection includes several of the most important known drug-resistant bacteria, such as tuberculosis and gonorrhoea, and some varieties of historical significance, such as a dysentery-causing Shigella flexneri isolated in 1915 from a soldier in the trenches of World War 1, and a sample from the nose of penicillin discoverer Alexander Fleming.
More than 60% of NCTC’s historic collection now has a closed, finished reference genome, assembled from PacBio sequencing.
“If NCTC is to continue to supply relevant authentic bacteria for use in scientific studies, then the quality of our own characterization and authentication data must be outstanding,” said Julie E. Russell (@Julieru13), Head of Culture Collections at Public Health England.
“Combining sequences, strain metadata and links to other resources in the public domain will ensure that this e-resource provides a unique comprehensive source of data to underpin microbial research, and improve the provision of diagnostics and public health interventions for medically important bacteria and viruses.”
Still soaring from the success of last year’s launch of the award-winning Sequel II System, we’re excited to announce the next evolution of the instrument: the Sequel IIe System.
This evolution includes increased computing power and advanced on-instrument data processing. This means the instrument can directly produce the widely coveted, highly accurate long reads, known as HiFi reads, that have made the original Sequel II System indispensable for many labs — and save users time and money in the process.
Just how much of an improvement does the new system represent? By completing all the primary data processing for HiFi reads on the instrument, the Sequel IIe System provides as much as a 90% reduction in file storage needs, and a 70% reduction in secondary analysis processing time.
Additionally, the release includes powerful new tools in SMRT Link v10.0 software to enable complete workflow integration on the AWS cloud, and a new Genome Assembly analysis application for generating reference-quality de novo assemblies from HiFi reads.
“HiFi reads allow the accurate and simultaneous detection of single nucleotide and structural variants, paving the way for advancements in human genetics and greatly expanding the utility of SMRT Sequencing, ” said Fritz Sedlazeck (@sedlazeck), Assistant Professor, Human Genome Sequencing Center at Baylor College of Medicine.
“Generating HiFi reads directly on the Sequel IIe System now has the potential to further accelerate cost-effective access to this information-rich sequencing data.”
HiFi sequencing has provided important data for a number of high-profile global research projects, including the Telomere-to-Telomere Consortium, Darwin Tree of Life, the Human Pangenome Reference Consortium, and the Solve-RD Project, among others. The precisionFDA Truth Challenge V2 evaluated methods for variant calling in human genomes and highlighted how approaches that use HiFi reads delivered the highest precision and recall in all categories including genome-wide, specifically in difficult-to-map regions, and in the major histocompatibility complex.
Our CEO, Christian Henry, noted: “The new Sequel IIe System represents the next advancement in our technology, and makes HiFi sequencing accessible to any project where high accuracy, long read lengths, and affordability matter.”
See how HiFi sequencing combines the best aspects of short reads and long reads into a single easy-to-use technology.
Want to learn more? Attend our workshop featuring the Sequel IIe System and HiFi sequencing applications for human biomedical research on Monday, October 26, from 10-11 a.m. PDT or visit the product page.
Want to discuss the benefits of HiFi sequencing and the Sequel IIe System for your research? Connect with a PacBio Scientist.
Nearly gapless, reference-quality chromosome-level assemblies — in less than a day? Yes, it’s possible, thanks to the high accuracy and low computational needs of PacBio HiFi reads.
Kevin Fengler, computational genomics lead at Corteva Agriscience, welcomed watchers to the brave new world of the pangenome during the recent webinar, “Beyond a Single Reference Genome – The Advantages of Sequencing Multiple Individuals.”
We are now living in an era where you can generate a reference genome assembly that’s specific for each application or trait of interest, Fengler said.
“Often we’re interested in getting the sequence of a single disease resistance gene or the sequence of a particular QTL, and we’ll do a whole genome just for that,” Fengler said. “It may seem like overkill, but we have found that the best approach — the fastest, easiest, simplest, most cost effective way to do that — is just to generate whole genome reference quality assemblies.”
Fengler cited several benefits of HiFi reads that have made this possible. Foremost among them were lower computational demands, and high accuracy with a low error rate, even with relatively “short” reads.
“I used to be a long-read junkie, always trying to get 50 kb, 60 kb reads. But with reads that are only 15 kb in length, we’re now able to achieve these highly contiguous, highly accurate assemblies,” Fengler said. “The HiFi reads are so accurate that you don’t need to do any additional polishing with Illumina, or even additional polishing with PacBio, which used to be a step.”
In many cases, Fengler said he has been about to assemble through the centromere.
“This is another cool thing that’s developed with HiFi. With our previous CLR assemblies, we never would have assembled through the centromere for plants, but now we are able to get a single scaffold per genome.”
Fengler emphasized that pangenome assemblies need to be robust, “because misassembly is not SV, and sequencing error is not variation.”
He shared several examples of crops that have received the pangenome treatment, including cotton, which, he noted, “would not be considered, historically, an easy genome to assemble by any means.”
“But here we’re getting single contigs in most cases for most of the chromosomes,” he said. “This is what you really need. This is the goal, this is what we’re trying to achieve.”
He also discussed two of the tools he uses to analyze the sequence diversity between the genomes and make it actionable, TagDots and PANDA.
Watch Fengler’s full presentation:
Crossing a continental crow divide
Pangenome collections are not only valuable for comparing commercial crop breeds for certain traits, they can also help answer questions about the evolution and population dynamics of non model species.
Matthias H. Weissensteiner (@MWeissensteiner), a postdoc at Pennsylvania State University, discussed his work studying structural variation among several songbird species in the genus Corvus, some of which were included in his recently published Nature Communications paper.
About 60 species of the genus display the typical all-black crow plumage pattern, but there are also a few black-and-grey and black-and-white forms. In Europe, there is a ‘crow divide,’ with all-black crows in the west, black-and-grey crows in the east, and a narrow hybrid zone in between.
“They look like two species, they behave like two species, but when we looked at the genetic differentiation based on single nucleotide changes, we found out that they are actually genetically more or less the same,” Weissensteiner said. “Only 83 nucleotides out of a genome of 1.3 billion base pairs are fixed, meaning that there are only about 80 differences which are diagnostic for these two crow populations.”
In order to uncover the secrets of their speciation, Weissensteiner and colleagues sequenced 33 crows and created a dataset comprising the full phylogenetic range of the genus.
For Weissensteiner, the value of long reads was clear: By enabling him to anchor his reads completely to the reference, he could more confidently capture the correct sequence and identify insertions, deletions, inversions and other variations.
“We combined different types of sequencing and mapping technologies and found that long-read sequencing in particular is able to reveal a stunning amount of genetic variation.”
Having assemblies from across the entire genus made filtering the data a bit easier, as well, Weissensteiner said. Because the researchers had large phylogenetic distances within their data, they were able to remove variants that seemed to be segregating across the clades.
“If you have a variant that is polymorphic within the crow clade and polymorphic within the jackdaw clade, it’s likely to be an error because over these large phylogenetic distances, there should not be any segregating variation,” he explained.
Once he had a reliable set of variants, Weissensteiner looked for causal mutations for plumage differentiation and identified the most promising candidate: A 2.25 kb LTR retrotransposon insertion located 20 kb upstream of the NDP gene.
Watch Weissensteiner’s full presentation:
Watch the entire webinar, including an introduction to HiFi reads by PacBio Sequencing Application Specialist Kristin Mars:
See additional examples of the use of SMRT Sequencing for the generation of pangenomes:
- Pangenome of Soybean Generated to Capture Genomic Diversity
- Project to Rapidly Sequence Maize Pangenome Delivers Publicly Available Resource
- Sequencing 101: Looking Beyond the Single Reference Genome to a Pangenome for Every Species
- Case Study: Pioneering a Pan-Genome Reference Collection
- Video: Dawn of the Crop Pangenome Era
Even in the field of genomics where new breakthroughs occur every few months, completion of the first-ever fully sequenced human autosome is a momentous achievement. Highly accurate, no gaps, no mis-joins — just chromosome 8 in all its glory. It’s a remarkable feat and we are honored that PacBio HiFi reads played a pivotal role in helping to achieve it.
This work is described in a preprint recently posted to bioRxiv from lead author Glennis Logsdon (@glennis_logsdon), senior author Evan Eichler, and their collaborators in the Telomere-to-Telomere (T2T) Consortium. It is part of the broader T2T initiative to sequence and assemble the first truly complete human genome and follows the earlier release of the fully sequenced X chromosome.
“Since the announcement of the sequencing of the human genome 20 years ago, human chromosomes have remained unfinished due to large regions of highly identical repeats located within centromeres, segmental duplication, and the acrocentric short arms of chromosomes,” the authors note. “The advent of long-read sequencing technologies and associated algorithms have now made it possible to systematically assemble these regions from native DNA for the first time.”
Chromosome 8 made an attractive target for the T2T’s first autosome due to its manageable centromere (previously estimated at 1.5 Mb to 2.2 Mb long). But the chromosome is also home to “one of the most structurally dynamic regions in the human genome—the β-defensin gene cluster located at 8p23.1—as well as a neocentromere located at 8q21.2, which have been largely unresolved for the last 20 years,” the scientists write. The β-defensin cluster plays a key role in innate immunity and structural variation in this region has long been implicated in human disease.
The new assembly, which addresses all five of the previously intractable gaps in the human reference genome, was built with a clever method using several data sets, including accurate long reads: “More than half of the PacBio HiFi data is contained in reads greater than 17.8 kbp, with a median accuracy exceeding 99.9%.” After a scaffolding step based on Oxford Nanopore reads, contigs assembled from PacBio HiFi reads were swapped in to provide the base-pair resolution. “We improved the base-pair accuracy of the sequence scaffolds by replacing the raw ONT sequence with several concordant PacBio HiFi contigs,” the team reports.
The complete chr8 sequence clocks in at 146 Mb and includes more than 3 Mb missing from GRCh38. As Logsdon et al. write, “The result is a whole-chromosome assembly with an estimated base-pair accuracy exceeding 99.99%.”
The scientists also tackled that persnickety β-defensin gene cluster, “which we resolved into a single 7.06 Mbp locus—substantially larger than the 4.56 Mbp region in the current human reference genome,” they note. Nearly all of that sequence data — 99.9934% of it, to be precise — came from HiFi reads. The complete centromere, meanwhile, accounted for 2.08 Mb.
With this beautiful assembly in hand, the T2T team took it out for a spin. First, they validated it with a host of orthogonal tools, such as optical mapping. Next, they generated HiFi data for the chromosome 8 orthologs in chimpanzee, macaque, and orangutan to compare the sequence data and reconstruct the evolutionary history of the human autosome. “Comparative and phylogenetic analyses show that the higher-order α-satellite structure evolved specifically in the great ape ancestor, and the centromeric region evolved with a layered symmetry,” the team writes. “We estimate that the mutation rate of centromeric satellite DNA is accelerated at least 2.2-fold, and this acceleration extends beyond the higher-order α-satellite into the flanking sequence.”
Finally, the researchers performed an analysis of full-length transcripts produced with the Iso-Seq method. That process identified “61 protein-coding and 33 noncoding loci that map better to this finished chromosome 8 sequence than to GRCh38, including the discovery of novel genes mapping to copy number polymorphic regions,” they report. Twelve of these new genes were uncovered in that tricky β-defensin locus alone.
For so many of us in the genomics community, this paper represents far more than the sequence of a single human chromosome. It’s a statement about what science can accomplish now, and where that may lead us in the years to come. As the authors summarized: “Now that complex regions such as these can be sequenced and assembled, it will be important to extend these analyses to other centromeres, multiple individuals, and additional species to understand their full impact with respect to genetic variation and evolution.”
You can hear more details from Logsdon directly at a free online conference co-hosted by the T2T Consortium and Human Pangenome Reference Consortium (HPRC) on September 22/23. Speakers will offer new insights on chromosome 8 and report on further T2T progress towards a complete human genome assembly. At the same event, the HPRC will present its complementary effort to sequence hundreds of human genomes to high quality. Presenters include: Karen Miga (@khmiga), Eric Green (@NHGRI_Director) Adam Phillippy (@aphillippy), Sergey Koren (@sergekoren), Sergey Nurk (@sergeynurk), Valerie Schneider (@dnadiver), Tina Graves-Lindsay, Arang Rhie (@ArangRhie), Mitchell R. Vollger (@mrvollger), Erich Jarvis (@erichjarvis), Mark Chaisson (@mjpchaisson), Mike Schatz (@mike_schatz), Heng Li (@lh3lh3) and many more. We’ll be glued to our computers for it and we hope you’ll have a chance to join as well!
Analysis of 16S ribosomal RNA has been used for phylogenetics and identifying prokaryotes for decades. But just as scientists have had to refine the Linnaean taxonomy system based on genomic discoveries, improvements in sequencing technology are changing 16S analysis best practices.
Researchers at the Joint Genome Institute, for instance, conducted a detailed benchmarking study and found that traditional methods of 16S analysis — which look at just a piece of the gene — are less accurate than analysis based on full-length sequencing of the entire V1-V9 16S gene. The full-length 16S advantage was clear in their study of the metagenome of a meromictic lake, where partial 16S sequencing led to incorrect matches or failed to differentiate the phylogenies present at different lake depths. The scientists wrote, “A resurgence of [full-length] sequences used as ‘gold standards’ has the potential to yet again transform microbial community studies, increasing the accuracy of taxonomic assignments for known and novel branches in the tree of life on previously unobtainable scales.”
Similarly, in a recent webinar, George Weinstock (@geowei) at the Jackson Laboratory for Genomic Medicine noted that only full-length 16S sequencing can resolve all the bacterial clades commonly found in the human gut microbiome down to the species level. He said, “With V1-V9, almost all [sequences] could be accurately identified to the species level. With V4, more than half could not be identified at the species level, so you are sort of locked into the genus level or higher… The full-length sequences are definitely the gold standard, there is no question about it.”
Why does species resolution matter? Weinstock later explained using the example of 16S data from healthy stool donors. “Even though at the genus level there is a certain frequency of Bacteroides species for these samples … to do some statistical analysis based on their Bacteroides, you are going to miss a lot of important information, because the species are quite different.”
That’s where PacBio HiFi reads come in. By sequencing around and around the same molecule, HiFi sequencing produces long, highly accurate, single molecule consensus reads. At a microbiome meeting held last year at Cold Spring Harbor Laboratory, our own Meredith Ashby (@AshbyMere) presented a poster showing how HiFi reads provide both accurate and complete results for full-length 16S sequencing.
To make 16S HiFi sequencing easily accessible to PacBio users, we have developed a new and improved one-step PCR protocol and worked with several DNA sequencing service providers to add the application to their menu of services, for as little as $50 per sample. The new protocol reliably delivers more than 30,000 reads per barcode at 96-plex in a single Sequel II System run.
Here’s some information about a few of the service providers who worked with us to validate the new protocol:
DNA Services Lab, University of Illinois at Urbana-Champaign
Scientists Mark Band and Alvaro Hernandez are part of a team with extensive experience processing samples types from customers all over the world, from those that have minimal amounts of DNA, to those that have strong PCR inhibitors, such as those from corals, lakes and soils. They have 192 barcodes for generation of full-length 16S amplicons. Typical turnaround time from sample receipt to data delivery is just two weeks.
Maryland Genomics, University of Maryland School of Medicine
Part of the Institute for Genome Sciences, Maryland Genomics is led by scientists who were part of the earliest genomic efforts and who pioneered the field of metagenomics. They have expertise in working with challenging samples, particularly for microbiome studies or metagenomic applications, and operate a dedicated Microbiome Services Laboratory that provides complete sample-to-results services for full-length 16S profiling.
Biomarker Technologies, Beijing
According to CTO Liu Min, Biomarker is particularly experienced with soil and water samples for 16S sequencing, among many other types. In addition, Biomarker uses an in-house developed concatenation step to link together multiple 16S full-length amplicons before library creation, allowing them to increase throughput and reduce sequencing costs by multiplexing as many as 700 samples per SMRT Cell 8M. A current promotion offers customers 5,000 HiFi full-length 16S reads per sample for as little as $40. Learn more (Chinese language) here and here.
Finally, for customers who prefer an all-in-one 16S solution, Shoreline Biome offers V1-V9 and StrainID solutions that include DNA extraction, amplification, and analysis. The University of Delaware Sequencing and Genotyping Center uses Shoreline Biome technology for 16S sequencing, particularly for clinical projects such as studies of microbial communities in medical settings.
Lab director Bruce Kingham tells us, “Using PacBio long reads to resolve the 16S gene is relatively new, and has been disruptive to the field of ribotyping. The layers of genetic detail that we can elucidate from full-length 16S data could not be grasped until it was performed at the current scale, and its use continues to grow at a rapid pace.”
Mark Driscoll, the CSO of Shoreline Biome adds “Bruce’s group recognized early that the additional resolution offered by Shoreline Biome’s 2500 bp StrainID amplicon is a powerful multiplier for researchers seeking strain-level resolution beyond what is possible with the 16S gene alone. Near-perfect, contiguous HiFi reads of StrainID amplicons covering the 16S, 23S, and variable spacer between the genes enable longitudinal tracking of strains in complex fecal microbiomes in humans and model organisms such as the mouse. Researchers seeking single clone 16S sequences from their archived strain banks have been able to pack hundreds of strains in a single run.”
Ready to start planning your full-length 16S experiment? Connect with a PacBio scientist.
It’s one of the questions we hear most often from scientists working with small organisms: Is it possible to generate truly high-quality, long-read data from minuscule amounts of DNA? With our new kit for ultra-low DNA input projects, the answer is: Absolutely!
The new workflow dramatically reduces the requirements for DNA quantity. Now, scientists need only 5 ng of genomic DNA to kick off a SMRT Sequencing project — that’s less than 2% of the starting volume needed for our current low DNA input protocol. This opens up access to HiFi sequencing for researchers studying the tiny arthropod species that comprise much of the diversity of the tree of life. In addition, the new protocol enables comprehensive variant detection in input-limited human samples such as needle biopsies.
Ultra-low DNA input sample preparation relies on the SMRTbell gDNA Sample Amplification Kit (PN: 101-980-000), which uses PCR amplification to help users get enough material for sequencing. The kit contains enough reagents to process up to 18 samples and can be used for de novo genome assembly of arthropods with genomes no larger than 500 Mb or for human variant calling. Of course, if your sample quantity is not limited (> 5 μg of DNA is available), we encourage you to follow the standard HiFi protocol for best results.
To put it to the test, we used the ultra-low DNA input kit to sequence the sand fly (Phlebotomus papatasi), starting with just 5 ng of DNA from a single insect, which we sequenced on our Sequel II System to 55-fold coverage. We generated nearly 2 million HiFi reads with accuracy of at least Q20, producing nearly 25 Gb of HiFi reads from one SMRT Cell 8M. Mean read length was 12 kb, and mean read quality was 99.97%, or Q36. This tiny insect had a genome size of 363 Mb, and our assembly featured a contig N50 of more than a megabase.
PacBio customers have been utilizing the new protocol as well. At the Max Planck Institute, scientists performed ultra-low DNA input sequencing for Phyllotreta armoraciae, the horseradish flea beetle.
In addition, in their new preprint, the Max Planck researchers and collaborators used the ultra-low DNA input kit to sequence two species of springtail (Collembola). In the preprint, the authors stated, “Our study shows that it is possible to obtain high quality genomes from small, field-preserved sub-millimeter metazoans, thus making their vast diversity accessible to the fields of genomics.” To hear more details about research using the ultra-low DNA input kit, watch the on-demand webinar.
To get started using the ultra-low DNA input protocol for your next project, review the protocol, connect with a scientist at PacBio to discuss your research, or visit our Product and Services Page to purchase your consumables.
With PacBio HiFi sequencing data now readily available for organisms of any size, many exciting results have been published featuring new de novo assembly methods optimized for highly accurate long reads. These methods have produced assemblies for a variety of organisms at quality levels never before thought possible — as measured by completeness, contiguity and correctness. We feel privileged to collaborate with the scientific community on the development of these tools.
From Small to Tall
When the USDA wanted to rapidly assemble the Asian Giant Hornet as part of its real-time invasive species response initiative, they turned to a tool developed by our research and development team. Improved Phased Assembly (IPA), developed at PacBio by Ivan Sović (@IvanSovic) and Zev Kronenberg (@zevkronenberg), is an assembler that delivers highly accurate, contiguous, and phased assemblies at very high speeds.
Another new assembler called hifiasm proved useful when a PacBio team wanted to sequence the genome of the tallest living organism on earth: a California Coastal Redwood. PacBio’s own Greg Concepcion (@phototrophic) and colleagues were able to assemble a 48.5Gb redwood genome in just 6-days with 33-fold HiFi read coverage using the method, which was developed out of Heng Li’s (@lh3lh3) lab at Harvard. As described in this pre-print, co-authored by Concepcion, the method uses HiFi reads to produce haplotype-resolved de novo assembly with phased assembly graphs.
Tackling the Tough Stuff
Not to be outdone, a new assembler from Sergey Koren (@sergekoren) and Adam Phillippy’s (@aphillippy) team at the National Human Genome Research Institute (NHGRI), HiCanu, demonstrates how it is now possible to sequence through even the most challenging regions of a human genome. PacBio’s own Rob Grothe is a co-author on the recently published work, which was also featured in a previous blog post.
To hear more about the NHGRI’s work on Human Pangenome and Telomere-to-Telomere assemblies, check out the upcoming workshop on September 21-23, 2020: “T2T / HPRC Towards a Complete Reference of Human Genome Diversity.”
Expanding What’s Possible
We are pleased by the possibilities of combining HiFi reads with these new rapid and high-quality genome assembly tools. Not only are they allowing us to tackle new organisms and reach new regions, they are enabling scientists to change the reference genome paradigm to create stand-alone de novo assemblies and pangenome collections with unparalleled speed and ease.
To learn more, join us on September 16 for our webinar: “Beyond a Single Reference Genome – The Advantages of Sequencing Multiple Individuals”. Or contact a PacBio Scientist to find out how HiFi reads can benefit your next research project.
Meet the ‘happy toad’ Atelopus laetissimus, a harlequin toad found on the slopes of the Sierra Nevada de Santa Marta mountains of Colombia. This toad is brightly colored, with an almost comical slow walk and lots of other unique attributes.
But the reason it is delighting scientists and conservationists is its ability to adapt and survive while its relatives are on the brink of extinction.
The 2020 Plant and Animal Sciences SMRT Grant Program co-sponsored by PacBio and the DNA Sequencing Center at Brigham Young University will enable a team of researchers from Colombia’s Universidad de los Andes and Universidad del Magdalena to delve deep into the A. laetissimus genome to discover the secrets of the toad’s evolutionary success.
Watch their outstanding video abstract submission:
The project garnered more than 6,000 votes as part of a video competition, thanks in large part to the science communication chops of its star. Biologist Carlos Guarnizo (@guarnitron) is the founder of Bogota’s Ciencia Cafe, which hosts in-person events as well as online networking and resources on Facebook and Twitter (@CienciaSumerce).
Guarnizo and the rest of his multidisciplinary team hope to gain insights into the basic biology of the colorful toad, as well as its evolution, speciation and adaptation.
Herpetologist and conservationist Vicky Flechas (@vickyflechas) is curious to find out what makes A. laetissimus resistant to the chytrid fungus that is killing so many other frogs and toads locally and globally. Does it have resistance genes, or perhaps anti-microbial skin peptides?
Herpetologist Beto Rueda (@herpetobeto), who lives among the toads in Santa Marta, wants to learn more about their vision, thermal biology, and some of their unique characteristics, such as their potential ability to sense their prey via vibrations, or the male’s ability to survive three months without food while clinging onto a female’s back in a long mating embrace. He is also eager to compare the A. laetissimus DNA and RNA sequences to other harlequin toad species from other altitudes within the unique habitat, which ranges from Caribbean beaches to peaks with permanent snow. This snow-capped mountain range, the Sierra Nevada de Santa Marta, has never been connected to other mountains such as the Andes, so it is host to a variety of unique, endemic species.
Evolutionary geneticist Andrew Crawford (@CrawfordAJ) is fascinated by another of the toad’s unique traits; it is nocturnal, while almost all other toads and frogs in the area are diurnal.
“The toad seems to have gone from nocturnal to diurnal and back to nocturnal,” Crawford said. “It could become a model system for nocturnality across frogs.”
Crawford, a council member of the Vertebrate Genomes Project, and chair of its Amphibians working group, is excited to be able to do the research using HiFi sequencing data.
“Frog genomes are a bit big, and a bit repetitive,” he said. “Some of the assemblies attempted in the past have been so fragmented. But the VGP and the Sanger Institute are getting to nice assemblies, and the foundation of that is PacBio HiFi data.”
Conservation of the species is the ultimate goal. The genus as a whole is endangered; of 97 species, 80% are critically endangered, and the status of many others is uncertain, as they have not been seen for 30 years or more.
Having a high-quality reference genome would also enable the team to do ‘museumomics’ — using stored samples to gain an understanding of species that have gone extinct or are hard to find.
“PacBio will be the salvation of frog genomes,” Crawford said.
We’re excited to support this research and look forward to seeing the results. Thank you to our co-sponsor and Certified Service Provider, the DNA Sequencing Center at Brigham Young University, for supporting the 2020 Plant and Animal Sciences SMRT Grant Program. Explore the 2020 SMRT Grant Programs to apply to have your project funded.
It has recently become apparent how important it is to sequence more than one individual to characterize the genomic variation within a species. This makes sense if you consider that sexually reproducing organisms are a mix of their parents and, therefore, not identical. This is just as true in crops as it is in humans. So, it’s not surprising that when a group of researchers from several institutions in China embarked on de novo genome assemblies of several accessions of wild and cultivated soybeans, they captured thousands of variants.
In what one reviewer described as “a landmark paper for genomics,” the Cell study describes 26 PacBio-based de novo assemblies selected from 2,898 globally collected soybean germplasms for their phylogenetic relationships and geographic distributions.
Soybeans have an astounding 60,000 accessions adapted to different ecoregions, and intergenomic comparisons have shown extensive genetic diversities between wild and cultivated soybeans, as well as among cultivated soybeans from different geographic areas.
A previous soybean reference genome (Wm82, released in 2010) only identified single nucleotide polymorphisms (SNPs) and small indels, leaving the larger structural variants (SVs) almost overlooked. This limited the capture of the full landscape of genetic variations and the pinpoint of causal variations in QTL cloning and genome-wide association studies, the authors stated.
Two other reference genomes were released in 2018 and 2019, one from ‘‘Zhonghuang 13’’ (ZH13), the most widely planted soybean cultivar in China, and another from a wild soybean. Comparisons between Wm82 and these two genomes further demonstrated that a considerable amount of CNVs and PAVs exist in different accessions.
Through a comparative genome analysis of the 26 genomes, plus three previously reported genomes, Zhixi’s team identified a total of 14,604,953 SNPs and 12,716,823 small insertions and deletions; 723,862 present and absent variations; 27,531 copy number variations; 21,886 translocation events; and 3,120 inversion events.
“The high-quality genomes enabled the identification of numerous complex variations that cannot be detected by simply mapping the short reads to a single genome,” the authors wrote.
They said de novo construction of the pangenome using long-read technology was valuable in identifying larger structural variants; most of the PAVs had a length from 1 kb to 2 kb, translocations were concentrated from 10 kb to 30 kb, inversions mainly ranged from 100 kb to 200 kb, and the CNVs varied from 2 to >10 with an enrichment of 2 and 3.
“The large number of SVs from dozens of independently de novo assembled genomes enabled us to clarify clearer evolutionary processes that cannot be detected from one or a few genomes,” the authors wrote.
This will be key in the creation of new experimental populations for both functional studies and breeding, especially as genetic diversity bottlenecks during soybean domestication and improvement has resulted in narrowed genetic diversity among modern cultivars of the important crop, they added.
In order to overcome some of the limits of conventional linear references, which are unable to show the genotypes of different alleles from each locus, the team devised a way to represent it graphically, creating the first reported graph-based plant genome.
“The graph-based genome offers a new platform to map short-read data to determine the genetic variations at the pan-genome level instead of a single genome and prevent erroneous variation calls around SVs,” they wrote.
Coupled with RNA sequencing data from individual accessions, the platform will make it possible to link SVs with gene expression, which could greatly promote gene discovery. The graph-based genome could also provide an opportunity to re-analyze and “rejuvenate” previous sequencing data to generate more comprehensive information than ever, they said.
This pangenome collection, along with the methods of generating a graph-based representation of the soybean pangenome, will serve the genomics community greatly as we embark on larger projects to characterize all variation within a species for better breeding programs and conservation efforts.
To learn more about pangenomes, register for our upcoming webinar on September 16: Beyond a Single Reference Genome – The Advantages of Sequencing Multiple Individuals.
See additional examples of the use of SMRT Sequencing for the generation of pangenomes:
- Project to Rapidly Sequence Maize Pangenome Delivers Publicly Available Resource
- Sequencing 101: Looking Beyond the Single Reference Genome to a Pangenome for Every Species
How do bacteria manipulate plant biology to cause blight and rot? Why are some pathogen strains more virulent than others? How can we engineer resistant staple food crops? These are pressing questions facing researchers looking to sustain and increase crop production against the backdrop of a changing environment.
For one major clade of pathogens, Xanthomonas spp, the answers lay locked within TAL effector genes (TALEs), but assembling these highly variable, repetitive regions was a long-standing obstacle. The key to finally unraveling the tangled assemblies was PacBio long-read sequencing.
Plant pathologist Adam J Bogdanove (@AdamBogdanove) and colleagues at Cornell University have been using SMRT Sequencing to elucidate the structure and function of TALEs and generate new insights into the mechanisms of Xanthomonas virulence.
“Repeats render TAL effector genes nearly impossible to assemble using next-generation short reads… long-read, single molecule real-time (SMRT) sequencing solves this problem,” Bogdanove wrote in one study.
Bogdanove was one of two researchers to break the TALE-DNA code in 2010 by searching for patterns in protein sequence alignments and the promoter sequences of genes upregulated by TALEs.
What are TALEs?
TALEs are proteins secreted by Xanthomonas bacteria when they infect various plant species (including pepper, rice, citrus, cotton, tomato, and soybeans), causing localized leaf spot and leaf streak, or systemic black rot and leaf blight disease.
How do they work?
TALEs bind promoter sequences in the host plant and activate the expression of plant genes that aid bacterial infection. They recognize plant DNA sequences through a central repeat domain consisting of a variable number of tandem ~34 amino acid repeats that vary at only 2 positions. These two critical amino acids correspond to a DNA base in the target promoter sequence. Bogdanove figured out that HD binds to C, NI to A, NH to G, etc.
The enabling technology
How did Bogdanove come to rely on SMRT Sequencing for TALE research? In a 2015 paper in Microbial Genomics, he described comparing PacBio assemblies to existing Sanger-based reference genomes of X. oryzae pathovars. The exercise revealed errors and omissions in the Sanger sequences, and the team concluded that PacBio sequencing was the best tool for generating de novo, whole-genome assemblies for Xanthomonas that accurately capture TALE genes.
Accurate assembly of TALE genes can be ensured by pre-assembling reads that contain TALE gene sequences. Together with co-author Nicholas J. Booher and others at Cornell, Bogdanove created a workflow for TALE gene assembly and prediction of their target sequences, the “pbx toolkit,” available on Github. Since then, the Bogdanove lab, along with other researchers worldwide, have used SMRT Sequencing to elucidate the molecular mechanisms of TALE function and the dance of susceptibility and resistance between pathogen and host.
Bogdanove and colleagues have used PacBio sequencing on Xanthomonas strains that infect important food crops worldwide. He writes, “Identification of the complete sets of TALEs in different isolates can be an important first step toward development and targeted regional deployment of resistant soybean varieties, and comparison of whole genome structure across strains can yield insight into the overall genetic diversity of the pathogen.”
A 2019 Genome Biology and Evolution paper describes the complete assembly, including all TALE genes, of Xanthomonas axonopodis pv. glycines isolates collected from infected soybean plants. Surprisingly, they found that the TALEs of the three strains they sequenced were highly similar, despite having been collected over a span of 30 years and on two different continents. Bogdonove concludes that if there is “little to no genetic variation at their targets across commonly grown soybean varieties, such that there is no selective pressure on the tal genes to adapt,” these genes may be good targets for the development of resistance.
In another paper published in Frontiers in Microbiology, Bogdanove and colleagues in Iran used SMRT Sequencing to better understand the role of TALEs in the virulence of bacterial leaf streak in wheat caused by Xanthomonas translucens pv. undulosa (Xtu).
They sequenced the genome of the highly virulent Iranian strain ICMP11055, generating a closed 4.5 Mb genome. They then compared it to the XT4699 strain from the United States, finding two major re-arrangements, nine genomic regions unique to ICMP11055, and one region unique to XT4699, as well as differences in TAL effector genes. Mutagenesis and complementation experiments indicated that at least a subset of the TALEs contribute to the virulence of these strains in wheat.
“Our results lay the foundation for identification of important host genes activated by Xtu TALEs as targets for the development of disease resistant varieties,” the authors wrote.
Tracking the SWEET tooth of TALEs
Finding the target genes of TALEs in plant hosts is critical to understanding the co-evolution of bacterial virulence and plant resistance. In a Nature Communications paper, Bogdanove and his team used PacBio sequencing to link Xanthomonas citri subsp. malvacearum (Xcm) TALEs to SWEET (‘sugars will eventually be exported transporter’) target genes in cotton. By correlating cotton transcriptome profiling with Xcm TALE DNA binding site prediction, they postulated a connection between TAL effector Avrb6 and the induction of sucrose transporter GhSWEET10.
In follow-up experiments, the authors found that “activation of GhSWEET10 by designer TAL effectors (dTALEs) restores virulence of Xcm avrb6 deletion strains, whereas silencing of GhSWEET10 compromises cotton susceptibility to infections.” “These findings advance our understanding of the disease and resistance in cotton and may facilitate the development of cotton with improved resistance to BBC.”
Another study of SWEET genes focused on host resistance to the rice pathogen Xanthomonas oryzae pv. oryzae (Xoo). SWEET activation by TALEs leads to sucrose export into the xylem vessels, facilitating Xoo proliferation in rice. Researchers have identified and cultivated 42 bacterial blight resistance genes in rice, called Xa genes, which can prevent binding and activation by the cognate TALEs via two distinct mechanisms, reducing susceptibility to Xoo. All but one of the resistance genes are SWEET alleles that lack binding sites for TALEs. The final resistance gene is a mutation in a transcription factor gene that prevents TALE binding to the plant transcriptional machinery.
However, bacterial strains that can defeat these resistance mechanisms have arisen in India and Thailand. So, together with collaborators in those countries, Bogdanove sequenced the genome of one such strain from each country. While examination of the encoded TALEs revealed how the Xoo strains escaped the protective mutation of a key transcription factor gene, the mechanism for escaping protective mutations in SWEET alleles remained unclear but the data suggested an experimental path for resolving the mystery.
“The findings open a door to mechanistic understanding of the role SWEET genes play in susceptibility and illustrate the importance of complete genome sequence-based monitoring of Xoo populations in developing varieties with effective disease resistance,” the authors wrote.
SMRT Sequencing also helped correct a long-held belief regarding Brassicaceae-infecting Xanthomonas campestris (Xc). TALE-encoding genes were thought to be absent from Xc genomes based on four reference genomic sequences. But as reported in New Phytologist, Bogdanove and colleagues from the Université de Toulouse discovered TAL genes in 26 of 49 Xc strains isolated worldwide.
Using a combination of SMRT and TALE amplicon sequencing, they created a “TALome,” a near-complete description of the TALEs found in Xc. The new resource will “open novel perspectives for elucidating TALE-mediated susceptibility of Brassicaceae to black rot disease,” the authors wrote.
The complexity of bacterial genomes
While bacteria have a reputation for having small and tractable genomes, in truth there are many clades where the presence of numerous genes from highly repetitive gene families is common. Adding to the assembly complexity, these genes are often flanked by similarly repetitive mobile elements. PacBio sequencing offers a simple, affordable solution to closing even the most challenging bacterial genomes, enabling new insights into key biological processes.
In an exciting new preprint, scientists from the HudsonAlpha Institute for Biotechnology and the University of Alabama at Birmingham describe the use of PacBio highly accurate long-read sequencing to identify pathogenic variants responsible for previously undiagnosable, rare neurodevelopmental disorders.
Lead author Susan Hiatt (@suzieqhiatt), senior author Gregory Cooper, and collaborators conducted genomic analyses of several family trios in an attempt to find causal genetic variants that had been missed with earlier studies.
“Large fractions of [neurodevelopmental disorders] cannot be attributed to currently detectable genetic variation,” they report. “This is likely, at least in part, a result of the fact that many genetic variants are difficult or impossible to detect through typical short-read sequencing approaches.”
The project involved six family trios with children affected by neurodevelopmental disorders who had previously had their genomes sequenced with short-read technology. In all cases, “no causal genetic variant, or even potentially causal variant, was found,” the scientists write.
To test their idea that the disease-causing variants were missed by short-read sequencing platform, they next turned to PacBio for long-read whole genome sequencing. Using SMRT Sequencing, on a Sequel II System, they generated HiFi reads (highly accurate long reads) that were each >99% accurate.
HiFi results “were used to detect variation within each trio and generate de novo genome assemblies, with a variety of metrics indicating that the results are more comprehensive and accurate, especially for complex variation, than those seen in short-read datasets,” the authors note. “Detection of simple-repeat expansions and variants within low-mappability regions, for example, was far more accurate in [HiFi] data than that seen in [short reads], and many complex SVs were plainly visible in [HiFi] data but missed by [short reads].”
For all six trios, SMRT Sequencing was performed in HiFi mode on the Sequel II System, covering each proband genome to an average of 30x and each parent’s genome to an average of 16x. An analysis of structural variants — the type of variation most likely to be missed by short reads — found that each trio collectively had about 56,000 structural variants, with an average of nearly 60 candidate de novo variants per child.
For two of the probands, the results got even better. In one case, the team identified a de novo heterozygous insertion of nearly 7,000 bases in an intron of the CDKL5 gene that they deemed likely pathogenic. Since CDKL5 has been associated with early infantile epileptic encephalopathy 2, a condition characterized by many symptoms connected to this proband’s case, “we prioritized this event as the most interesting candidate variant,” Hiatt et al. report. “To determine the effect of this insertion on CDKL5 transcripts, we performed RT-PCR from RNA isolated from each member of the trio.” Results supported their theory that the variant has a loss-of-function effect for the individual.
For the other proband, the team found a de novo structural variant that affected two genes, DGKB and MLLT3. “[HiFi] reads and contigs from the proband’s de novo assembly support the existence of at least three breakpoints, suggesting that a ~250 kb fragment harboring three coding exons of DGKB are removed from chromosome 7 and inserted into an intron of MLLT3 on chromosome 9,” the scientists report. A qPCR analysis demonstrated that the proband had less than two-thirds the expression of MLLT3 than her parents or unrelated individuals. The variant, while intriguing, was classified as a variant of uncertain significance.
“The breadth and quality of variant detection coupled to finding variants of clinical and research interest in two of six probands with unexplained [neurodevelopmental disorders] strongly support the value of long-read genome sequencing for understanding rare disease,” the scientists conclude. “Further, as [HiFi reads] can capture complex variation in addition to essentially all variation detectable by short-read sequencing, it is likely that it will become a powerful front-line tool for research and clinical testing within rare disease genetics.”
See additional examples of the use of SMRT Sequencing in rare disease research and learn more about structural variant detection:
In the recent precisionFDA Truth Challenge V2, which evaluated methods for variant calling in human genomes, approaches that use PacBio HiFi reads delivered the highest precision and recall in all categories: genome-wide, specifically in difficult-to-map regions, and in the major histocompatibility complex (Figure 1). The challenge had 64 total entries: 17 using PacBio HiFi reads, 24 using Illumina reads, 3 using Oxford Nanopore reads, and 20 using multiple technologies. Twenty-five of the 26 overall most accurate callsets used PacBio HiFi reads (12 PacBio-only, 13 multi-technology), including all of the top 12 (3 PacBio-only, 9 multi-technology).
A submission from Google DeepVariant using HiFi reads achieved the highest genome-wide accuracy of any single-technology callset, with better performance for single-nucleotide variants (SNVs) and indels and 5.8× fewer total errors than the popular combination of GATK with Illumina reads (Figure 2).
The challenge was launched to evaluate variant calling for difficult regions of the human genome. Until recently, the Genome in a Bottle (GIAB) benchmarks did not measure variant calling accuracy across the most difficult 12% of the human genome, which includes many medically relevant genes. To address this, GIAB released an expanded benchmark (v4) for one of its reference samples, HG002, that covers an additional 6.3% of the genome. GIAB then developed expanded benchmarks for two other samples, HG003 and HG004. Before those benchmarks were released, the new precisionFDA challenge was used to assess currently available variant-calling techniques.
The challenge provided short reads from the Illumina NovaSeq, PacBio HiFi reads from the Sequel II System, and long reads from the Oxford Nanopore PromethION for HG002, HG003, and HG004. Competitors were invited to submit calls for HG003 and HG004, which were then evaluated against the not-yet-released “truth” variant calls for those samples. Variant calling accuracy was measured for SNVs and indels in the full genome, in difficult-to-map regions, and in the major histocompatibility complex (MHC).
The best overall performance was achieved using HiFi reads, which are both accurate (99.8%) and long (15-20 kb). HiFi read accuracy translates into accurate variant calls, and read length improves mappability to difficult regions of the genome. Equally important were advances in variant calling software, including DeepVariant (see Google AI blog for latest release) and DNAscope, to better model the properties of HiFi reads and utilize the long-range information that HiFi reads provide.
DeepVariant with only HiFi reads achieved 99.9% precision and recall for SNVs and 99.4% precision and recall for indels. In comparison, DeepVariant with Illumina reads had 4.2× more SNV errors but 1.5× fewer indel errors. The best Oxford Nanopore callset had 3.8× more SNV errors and 58.2× more indel errors (Figure 2).
The precisionFDA contest was an important opportunity to evaluate variant-calling methods, and it demonstrates how HiFi reads provide more comprehensive and accurate variant detection. We are excited to see how researchers apply this capability to search for new disease genes and to solve rare disease cases that have gone undiagnosed by other approaches.
Hear Aaron Wenger, a Principal Scientist at PacBio, present a summary of the results from the precisionFDA Truth Challenge V2: