This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
An ambitious project to sequence 5,000 microbial genomes was jointly initiated by a consortium of 10 institutions across China, including Nankai University, China CDC, Academy of Military Medical Science, Third Institute of Oceanography-Ministry of Natural Resources, South China Sea Institute of Oceanology-CAS, China National Center for Food Safety Risk Assessment, Shandong University, Tianjin University of Science & Technology, East China University of Science and Technology, and Tianjin Biochip Corporation (TBC).
TBC, a PacBio service provider in China, has led the sequencing phase of the project, which is expected to be completed by the end of 2019. We recently sat down with Sun Yamin, general manager of TBC, to learn more about the project.
What’s the difference between the Prokaryotes 5,000 Complete Genomes Project (P5KCGP) and other microbial sequencing projects?
Previous microbial genome projects were scattered and typically based on one researcher’s own interests and directions. As a result, many common microbial species’ genomes have been sequenced repeatedly, while less commonly studied microbial species have still not been sequenced at all.
The current microbial genome database has an obvious species imbalance. Many microbial genomes have only low-quality genomic scaffolds. Our goal is to create a genomic database that covers a much broader array of microbial diversity, including pathogenic microorganisms, food safety microbes, marine microbes, and terrestrial resource microbes.
We expected to add at least 500 new microbial genomes that are currently not found in the NCBI database by the completion of the project. Our goal is to submit a high-quality, closed genome with no gaps for each of the 5,000 microbial genomes included in our project. In order to achieve this goal, we chose the PacBio Sequel System as our sequencing platform, as SMRT Sequencing technology combines long read lengths, high accuracy, and no GC content bias.
At present, only the Sequel System can meet our project requirements, given the challenges presented by many bacterial genomes. Using the latest version 3.0 reagents, the average read length of 22 kb on the Sequel System is sufficient to span repeats that can be more than a dozen kilobases in length in some bacterial genomes. In addition, we have seen GC content up to 70% in microbial samples we’ve sequenced. Even so, assembly can be accomplished easily with PacBio data.
What is the significance of the P5KCGP project?
While microorganisms were the first genomes to be sequenced by scientists, the sum of all microbial sequencing data worldwide is less than the amount of data produced by a laboratory that performs human genome sequencing. Although the genomes of microorganisms are relatively small, the enormous species and functional diversity of microorganisms in nature means that microbial genomics has not been given sufficient attention. For pathogenic and foodborne microorganisms in particular, it is important to have reference-quality genomes.
What challenges has the P5KCGP project encountered?
1) Sample collection. On average, each partner needs to provide 400-500 microorganisms. Since our goal was to include bacterial species that are rare in nature, it can take a long time to isolate and grow samples.
2) Controlling costs. Generating closed microbial genomes requires more resources than simply coming up with a bunch of draft genomes. To manage sequencing costs, we have succeeded in multiplexing 16 microbial samples on each SMRT Cell 1M by optimizing the library preparation process.
3) Dealing with difficult-to-sequence microbes. The habitat of microorganisms in nature is diverse, and some live in extreme environments requiring quite high GC content in their genomes. Sequencing of such microbial genomes is more difficult.
What groundwork does this project lay for future research efforts?
We want to better understand how microbes that are widely distributed in nature have evolved and adapted to diverse environments with the much more complete survey of microbial genomes made available through this sequencing project. In addition, some rare microorganisms living in extreme environments often have potential industrial value. Two examples sequenced through this project are the extremely acidophilic methanotroph isolate V4, Methylacidiphilum infernorum, and the Geobacillus thermodenitrificans.
Patients with myotonic dystrophy type 1 (DM1) want to know their size — the size of the expansion of repeats of the unstable CTG sequences that cause the progressive deterioration of neuromuscular functions that they might face.
Size matters to them, because it has been found to correlate with the severity and onset of symptoms, which can range from severe cardiac and respiratory abnormalities and intellectual impairment in children, to muscle weakness, hypersomnolence or cataracts in adults. The earlier the onset, the more severe the symptoms tend to be. The autosomal disorder, which is the most common form of inherited muscular dystrophy in adults, also tends to get progressively worse with each generation. But the manifestations vary widely between patients, and even within families, making it extremely difficult to predict how it will affect any individual.
Stéphanie Tomé would like to arm genetic counselors with more information to help patients navigate through their difficult diagnoses and prognoses, and to inform their decisions about their own lives and those of their offspring. Ultimately, she would also like to be able to provide them with new options to manage or even alter their diseases.
To do so, she needs to be able to read the repeats, which can be encoded in sections as large as 3,000 triplets. So she has turned to PacBio SMRT Sequencing, which is capable of capturing sequences of long stretches of DNA, including complete regions of repeats found in patients with DM1 and other expansion disorders, such as Huntington’s Disease and Fragile X.
Tomé, an investigator at the Centre de Recherche en Myologie at Sorbonne Université/INSERM in Paris, is the winner of the 2019 Targeted Sequencing SMRT Grant. Along with the 10 other scientists in her research group, led by Geneviève Gourdon, and collaborators from around the world, Tomé will sequence sections of mutated genes in DM1 patients to determine the exact size and pattern of CTG repeats.
“We have some idea of what may be going on at either end of these regions, but we don’t have any information about what is happening in the middle,” Tomé said. “Improving our knowledge of the entire repeat sequence will help us make clearer correlations between the genetic instability and the clinical manifestations of DM1.”
Information generated in the project could also help researchers advance their understanding of some of the mechanisms behind the degenerative disorder.
If the disorder is characterized by long lengths of trinucleotides gone haywire, then it would be advantageous to be able to shrink the repeat regions back down to an asymptomatic size. Researchers have found cases where the regions have naturally contracted, and others where there are interruptions in the repeat codes.
Tomé and colleagues are pursuing this avenue of research, hoping to be able to harness knowledge about contractions and/or interruptions to induce them as a way to prevent and/or treat DM1 and other disorders. Tomé said drug screens on mouse models have already identified some potential compounds that could induce contraction, but they need to be tested and modified for use in humans.
Putting it in Perspective
Tomé admits that the data gathered from this project will likely not lead to immediate solutions, but it could provide some immediate relief to patients hungry for more insight into their disorder. And she hopes that SMRT Sequencing could become an alternative method of molecular diagnostics to ameliorate the prognosis and counseling offered to patients.
“Currently, the clinical labs tend to use Triplet Prime PCR. With this technique, we can say whether a patient is going to become sick or not sick, but it’s difficult to provide any sort of prognosis,” Tomé said. “To be able to give the patient more precise information, quickly, is very important, I think. Many patients are anxious, and don’t understand why there is so much variability between their son and daughter. They want to know.”
By collaborating with clinicians and a multidisciplinary group of 10 teams at Centre de Recherche en Myologie, Tomé embraces any opportunity to get different perspectives on the disorder, including the patient perspective.
“It’s very interesting to talk to the patients. By staying in the lab, you can lose sight of the bigger picture. By leaving the lab, you get new ideas, you learn more about what the problems are and what you might be able to do to improve the lives of patients,” Tomé said.
As the behavior of repeat regions appears similar between triplet diseases, Tomé said the project’s findings might also be applicable to 13 other expanded repeat disorders.
“This widens the potential impact of our study considerably,” she said.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win free sequencing. Thank you to our co-sponsor and Certified Service Provider, the McDonnell Genome Institute at Washington University in St. Louis, for supporting the 2019 Targeted Sequencing SMRT Grant Program.
The annual meeting of the European Society of Human Genetics — held last month in the sleek Swedish Exhibition & Congress Center in Gothenburg, Sweden — was a terrific assembly of thousands of scientists who are together pushing the boundaries of what’s possible in genome research. The PacBio team particularly enjoyed seeing so many impressive ESHG presentations with scientific results from SMRT Sequencing pipelines featuring applications such as de novo whole genome sequencing, structural variant detection, the Iso-Seq method, and targeted sequencing.
For example, in a plenary talk, the University of Southern California’s Mark Chaisson (@mjpchaisson) spoke about using long-read PacBio sequencing to analyze structural variation across human genomes. Representing the Human Genome Structural Variation Consortium, he talked about the growing number of available de novo sequenced human genomes, along with the need to characterize their complete universe of structural variants, many of which are missed in short-read assemblies.
Chaisson presented results from trio sequencing projects run by the consortium, showing that this approach allows for reliable and accurate phasing even of large structural variants, thanks to the use of long-read data. He noted that with the PacBio Sequel II System, it is now feasible to fully sequence a human genome in a single run. Chaisson concluded, “We are now in a realm where large scale human genome sequencing studies can be done using a long-read approach.”
Other great presentations came from Jozef Gecz at the University of Adelaide, who spoke about a repeat expansion associated with a heritable form of epilepsy, and Michael Talkowski of the Broad Institute, who presented on structural variant discovery and the use of sequencing systems for genomic medicine. There were also several posters from PacBio users with exciting results, such as a Swedish reference genome, clinical sequencing of the SMA gene, and amplification-free sequencing of a repeat expansion that causes corneal dystrophy.
Our own team presented posters at ESHG as well. Billy Rowell shared “Comprehensive Variant Detection in a Human Genome with Highly Accurate Long Reads” while Jenny Ekholm presented “Sequencing the Previously Unsequenceable Using Amplification-free Targeted Enrichment Powered by CRISPR/Cas9.”
We’d like to thank all of the scientists who checked out our posters or stopped by the PacBio booth to learn more about SMRT Sequencing applications and the new Sequel II System. We appreciate your time and interest!
A new preprint evaluates the utility of PacBio HiFi reads for assembly of a human genome. The study is a follow-up to a recent publication in Nature Biotechnology that introduced a technique to generate sequencing reads with both long read length and high accuracy.
“Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads” comes from lead authors Mitchell Vollger and Glennis Logsdon, senior author Evan Eichler, and collaborators at the University of Washington, PacBio, and other research institutes. For this project, they focused on sequencing a hydatidiform mole human cell line (CHM13), a useful model system because it is haploid unlike typical diploid human cells. “We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets,” the scientists write.
The team generated 24-fold coverage of CHM13, the same sample used to produce a previous assembly with CLR data. They employed the Sequel II System, producing an average of 19.1 Gb of HiFi reads with each SMRT Cell 8M. The HiFi and CLR assemblies had similar contiguity: contig N50 of 29.5 Mb for HiFi and 29.3 Mb for CLR. The HiFi assembly was much more accurate, with an estimated Phred quality value of Q45, compared to Q40 for the CLR assembly. Further, the authors note that, due to divergence in BAC clones used to measure accuracy, the quality value for the HiFi assembly is “a lower bound of the true QV.”
Next, the scientists performed an analysis of segmental duplications (SDs), which are notoriously challenging elements to assemble correctly. The HiFi assembly resolved more of these duplications than the CLR assembly. “HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of large tandem repeats, as validated with orthogonal analyses… This is the highest fraction of resolved SDs for any of the published assemblies analyzed thus far,” they report.
“We conclude that there are three essential strengths of the HiFi technology over CLR technology,” the authors conclude, citing reduced compute time to generate a de novo assembly, superior assembly accuracy, and improved ability to assemble the most difficult regions of the genome. “Our results suggest that HiFi may currently be the most effective stand-alone technology for de novo assembly of human genomes.”
Crucial assembly sites and mitosis mediators, centromeres are central to every cell, but missing from even the most complete genome assemblies.
In a PLOS Biology paper, Amanda Larracuente and colleagues at the University of Rochester and Barbara G. Mellone of the University of Connecticut, described how they sequenced the repetitive regions of the fruit fly genome, including its centromeres, using SMRT Sequencing.
Embedded in blocks of highly repetitive satellite DNA, centromeres have eluded efforts at assembly.
Only recently, long-read single molecule sequencing technologies have made it possible to obtain assemblies of highly repetitive parts of multicellular genomes such as the human Y chromosome centromere and maize centromere 10. This is the first time researchers have sequenced all the centromeres in any multicellular organism.
“Our study shows that combining long-read sequencing with ChIP-seq and chromatin fiber FISH is a powerful approach to discover centromeric DNA sequences and their organization,” the authors wrote. “Our overall strategy therefore provides a blueprint for determining the composition and organization of centromeric DNA in other species.”
Drosophila melanogaster proved the ideal model to investigate centromere genomic organization, as it has a relatively small genome (roughly 180 Mb), organized in just three autosomes (chromosome 2, 3, and 4) and two sex chromosomes (X and Y). The estimated centromere sizes in Drosophila cultured cells range between 200 and 500 kb and map to regions within large blocks of tandem repeats.
It has been believed that satellites are likely the major structural elements of Drosophila, human and mouse centromeres. By tracking the histone H3 variant centromere protein A (CENP-A), the team was able to identify the fruit fly centromeres and found that they primarily occupy islands of complex DNA enriched in retroelements flanked by large blocks of simple satellites. They estimate that approximately 70% of the functional centromeric DNA of D. melanogaster is composed of complex DNA islands, which are rich in non-LTR retroelements and buried within large blocks of tandem repeats.
“They likely went undetected in previous studies of centromere organization because three of the five islands are either missing or incomplete in the published reference D. melanogaster genome … having an improved reference genome assembly is crucial for identifying centromeric DNA sequences,” the authors state.
The retroelements they found were not merely present near centromeres, but were components of the active centromere cores.
“Why retroelements are such ubiquitous components of centromeres and whether they play an active role in centromere function remain open questions,” the authors wrote.
Additional avenues worth exploring include identifying associated tandem repeats, as well as mapping the span of the CENP-A domain and its binding sites.
“Knowing the identity of D. melanogaster centromeric DNA will enable the functional interrogation of these elements in this powerhouse model organism,” the authors wrote.
To enable better understanding of biology, sequencing data must be accurate and complete. This is especially true when seeking out variants and determining their implications.
Luckily, technical and software improvements for SMRT Sequencing are making it easier to efficiently generate genome assemblies with unparalleled accuracy.
As presented in a webinar by PacBio Staff Scientist Sarah Kingan (@drsarahdoom) and GoogleAI Genomics Project Lead Andrew Carroll (@acarroll_ATG), HiFi reads enabled by circular consensus sequencing (CCS) on the new Sequel II System challenge the notion that sequencing technologies require a tradeoff between length and accuracy.
Kingan highlighted several benefits to using HiFi data for genome assembly:
- Higher accuracy of assemblies due to the high inherent base quality of HiFi reads
- Dramatic time-savings in generating a genome assembly
- Algorithmic improvement in the FALCON assembler that enhance the performance of HiFi assemblies
HiFi reads are extremely accurate because they utilize single-molecule consensus, rather than multiple-molecule consensus, which is required for traditional long-read assembly methods. The resulting HiFi assemblies have higher base accuracy than assemblies produced by continuous long reads.
HiFi reads are also more efficiently produced by CCS due to algorithmic enhancements that reduce compute time. CCS for a single SMRT Cell 8M run on the Sequel II System will be able to be completed in 3.5 hours with the upcoming software release.
Because the HiFi reads are already error corrected, the genome assembly process is simplified and streamlined, requiring only 20% of the compute time for a human genome compared to a continuous long read assembly.
HiFi data needed HiFi-ready assembly tools
In order to make the most of these improvements, some assembly and analytical programs have also been modified.
While testing the system on several human and animal genomes, Kingan said the PacBio team achieved equivalent or higher contiguity in multiple species, such as the fruit fly and bluefin tuna. But in a complex plant genome such as rice, with its multitude of repeat-induced overlaps, the results weren’t as robust.
So Kingan and colleagues modified the FALCON-Unzip assembler to make the most of the higher accuracy HiFi reads. By ignoring indel differences, they were able to better assemble the plant genomes. These latest features will be added soon to the already-incorporated improvements of faster read tracking and polishing.
Deep learning digs deeper
When it comes to assessing the “unknown unknowns,” artificial intelligence and machine learning is better than even the most robust human-designed algorithms, said GoogleAI’s Carroll.
His team has developed DeepVariant, a germline variant caller distinguished by its best-in-class accuracy. The open source program is also extensible – it can be re-trained for new technologies without writing new software — and this is exactly what his team did, in order to better handle HiFi data.
HiFi read errors are different from short-read errors, Carroll explained. Short-read data can lead to mapping complexity and coverage variability. HiFi reads are much more mappable and uniform, but can have noisier indel lengths in homopolymers, he said.
Carroll’s team fed the DeepVariant program millions of examples and labels from Genome in a Bottle to update weights in the model. Considering the range of uses and needs of PacBio users, they included data collected on both SMRT Cells 1M and 8M, featuring a variety of insert sizes and coverage levels.
Better training yields better results
The team saw somewhat improved SNP accuracy, which was already very high; substantially improved indel accuracy; and robust, more uniform coverage titrations. The developers were surprised to see that DeepVariant was also able to call some structural variants without specifically being trained to do so. And the improved DeepVariant 8.0 was able to confidently call regions that were previously deemed “difficult,” “non-confident,” or “non-callable.”
“AI-based programs actually benefit from more data and more difficult – and different – data,” Carroll said. “There are thousands more variants we can now call confidently.” Improved haplotype phasing and the ability to call variants in other HiFi data types are also on the horizon, Carroll said.
Watch the complete webinar and visit www.pacb.com/HiFi to learn more:
Variety is the spice of life, and one of the drivers of genetic variation is gene splicing.
After a gene is transcribed, there are alternatively spliced transcripts that add even more variety to that gene’s expression and its menu of phenotypes.
It appears that there are types of disorders that take advantage of these varieties. Top amongst them are myeloid disorders, where somatic mutations in splicing factors lead to cell proliferation in myelodysplastic syndromes (MDS) and blood cancers.
Christopher R. Cogle, a physician-scientist at the University of Florida, would like to understand why, in hopes that such knowledge could be used to develop new therapeutic strategies to target acute myeloid leukemia (AML).
With the help of the Icahn Institute for Data Science and Genomic Technology at Mount Sinai, the 2019 RNA Sequencing SMRT Grant recipient will be able to interrogate the differential isoforms within AML cell lines and test the effects of a novel splicing factor depletion agent his lab has created.
AML is the result of a multistep transforming process of hematopoietic stem and progenitor cells (HSPCs) which enables them to proceed through limitless numbers of cell cycles and to become resistant to cell death. Interference with DNA replication using a combination of chemotherapy drugs has been the mainstay in AML therapy for more than fifty years, but the relapse rate is still very high.
Cogle’s lab has found that some of these leukemia cells embed within blood vessels to protect themselves from these drugs, so he developed a vascular disrupting agent to disrupt such sanctuary. He treated around 40 patients with the therapy, with success, but also some side effects.
Seeking a similar, but better tolerated alternative, he returned to the lab, growing AML cells on endothelial cells and then testing the leukemia killing activity of 31 million compounds. He identified several promising compounds that selectively killed AML cells within the vascular niche, while sparing endothelial cells and normal lymphocytes.
Extensive proteomic studies on one of the hit compounds showed that it binds and inhibits a splicing repressor. To understand the role of this splicing repressor in AML, Cogle’s team generated cell lines of human AML with and without knock-down of the splicing repressor and found that depletion of the splicing repressor leads to AML cell death and failure to engraft in mice. But downregulation of the splicing repressor in normal hematopoietic cells don’t affect cell viability or proliferation.
Indispensible – or not
Why is the splicing repressor seemingly indispensable in AML, yet dispensable in normal hematopoietic cells? This is the question Cogle is hoping the Iso-Seq method will answer.
He plans to use the PacBio RNA Sequencing technique to compare and contrast the gene expression and transcript isoform expressions in AML versus normal HSPC with and without knock-down of the splicing repressor.
“We’ve used conventional RNA sequencing to examine the splicing repressor depleted AML cell lines, and wished we had longer and more reads to detect the full variety of isoforms under the control of the splicing repressor,” Cogle said.
The initial RNA sequencing data was able to illuminate some biology — and several gaps that Cogle hopes to fill with more robust Iso-Seq data. He will be working with his UF colleague Ana Conesa, who has experience in functional RNA splicing as well as Big Data, including the development of a newly released computational tool, tappAS.
“In order to get to the resolution needed, you need to move beyond conventional RNA sequencing. Iso-Seq will be an important tool in dissecting these splicing mechanisms and matching them to cancer phenotypes.”
Cogle’s ultimate goal is to get these new therapies into the clinic, and full-length transcript sequencing and isoform analysis will help this endeavor in many ways. It will help explain the oncogenic mechanisms of blood cancer and how his pharmacological agents work — information that can be used in validation studies, toxicology studies, designing bioactivity assays for early phase clinical trials, and expansion campaigns to identify additional compounds.
It will also help answer some fundamental biological questions.
“This is where PacBio will have one of its greatest impacts in science,” Cogle said. “It allows people to look at alternative splicing to a depth and breadth that conventional RNA sequencing cannot.”
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win free sequencing. Thank you to our co-sponsor, the Icahn Institute for Data Science and Genomic Technology at Mount Sinai, for supporting the 2019 RNA Sequencing SMRT Grant Program.
Today we offer the final post in our blog miniseries about early access users’ experiences with the new Sequel II System. Shane McCarthy, a scientist at the University of Cambridge who was able to use the new sequencing system at the Wellcome Sanger Institute, gave a presentation on his experience generating data for tree-of-life sequencing projects.
McCarthy participates in several of these large-scale projects, such as the Vertebrate Genomes Project, the Sanger 25 Genomes Project, and the Darwin Tree of Life Project. For all of them, the goal is to produce high-quality, phased, chromosome-level assemblies with minimal gaps.
Through Sanger’s early access to the Sequel II System, McCarthy and his team were able to evaluate the new sequencing system’s performance on several animal genomes. These included fish (brown trout, sterlet, ploughfish, and milkfish), amphibians (Gaboon caecilian and common frog), and others; most had been sequenced previously so there were existing genomic resources to use for comparison.
The genomes were assembled with a mix of continuous long read (CLR) data and HiFi data, the latter of which is produced via circular consensus sequencing (CCS). For the CLR sequencing mode, the new SMRT Cell 8M yield was 80 Gb to 90 Gb. In CCS mode, the cells often produced more than 250 Gb of raw data. “We were quite happy” with the yields, McCarthy said, noting that the system performed consistently.
After giving an overview of his work, McCarthy dove into detailed looks at two of the fish samples to help webinar attendees understand the Sequel II System’s performance. For the sterlet, which has a genome made more challenging due to an unresolved whole genome duplication that left some residual tetraploidy, his team used two SMRT Cells of CLR data for the assembly. They compared the results for this fish to previous assemblies of its parents, using trio binning to assign haplotypes to their maternal or paternal origin. A BUSCO analysis found that more than 92% of genes were complete in each haplotype, a level that McCarthy considers very good at this stage of the assembly. He also presented data on milkfish, which similarly led to strong results (at least 95% of genes were complete) from BUSCO analysis.
McCarthy noted that data from these projects are being made available through the VGP. As for the Sequel II System, he concluded, “it’s a huge leap in scaling and affordability for these tree-of-life genome assembly projects.”
For more details, watch McCarthy’s full presentation.
A new publication released in PLOS One from scientists at the Mayo Clinic offers a great look at our CRISPR/Cas9-based, amplification-free targeted sequencing method and its utility for accurately sizing a clinically important repeat expansion.
“Amplification-free long-read sequencing of TCF4 expanded trinucleotide repeats in Fuchs Endothelial Corneal Dystrophy” comes from lead author Eric Wieben, senior author Michael Fautsch, and collaborators. This is the second group to use the amplification-free technique for this disease; the first performed their work on a PacBio RS II System, while this team used the newer Sequel System.
What makes the disease such an interesting target for this approach? While Fuchs endothelial corneal dystrophy (FECD), a late-onset degenerative eye disease, affects just 4% of Caucasians in the U.S., more than 75% of those cases can be traced to an expansion of a CAG repeat found in the TCF4 gene. That makes FECD “the most common disease that is attributable to the expansion of a trinucleotide repeat,” according to the paper. Intriguingly, Mayo Clinic investigators have found that a fraction of patients with the repeat expansion don’t develop the disease; in this project, they aimed to test their hypothesis that interruptions in the repeat sequence may explain the phenomenon.
But identifying interruptions required sequencing the entire length of the repeat expansion, something that could not be done with PCR amplification due to its likelihood of introducing confounding artifacts in sequencing data. The team turned to PacBio’s recently launched amplification-free protocol (we call it No-Amp), which uses the CRISPR/Cas9 system to target specific sequences of interest. “This method permits the enrichment and direct sequencing of targeted sequences without PCR amplification,” the scientists write, adding that SMRT Sequencing technology “also permits the generation and analysis of full-length sequences from even expanded repeats.”
To evaluate this method, scientists compared results to those from an STR assay and from Southern blots. The data were highly concordant: all of the amplification-free “size estimates for sub-pathological length repeats match the STR results within 1 repeat triplet,” they report. Also, “the sequencing was successful in identifying a previously described interruption within an unexpanded allele and provided sequence data on expanded alleles greater than 2000 bases in length,” the team notes.
While this study found no novel repeat interruptions that might explain why certain individuals do not develop FECD, it did generate some interesting results. First, two samples were found to have “novel variation in the AGG repeats that immediately precede the CAG repeats,” which could help scientists hone their hypothesis about these patients. Another intriguing discovery was the heterogeneity in repeat lengths of the expanded allele. “Given that these samples were not PCR amplified, this suggests somatic instability of the expanded repeat sequence and consequent mosaicism within the population of leukocytes used for the analysis of each specimen,” the scientists report. “In contrast, there is very little heterogeneity of subpathogenic alleles (<40 repeats).”
What makes one strain of cannabis have potent psychoactive properties, and another more suitable for medicinal purposes? Scientists are several steps closer to figuring it all out, thanks to PacBio long-read sequencing and transcriptome analysis of the Cannabis sativa plant.
In a recent webinar, Kevin McKernan (@Kevin_McKernan) of Medicinal Genomics (MGC), described how his company’s efforts to create a Cannabis Pan-Genome have already netted interesting results.
Using MGC’s assembly of the female Jamaican Lion cultivar as a baseline, genomic DNA from a sibling male plant and multiple offspring were isolated and sequenced with the Sequel II System to identify structural and other types of important genetic variations.
This “family” sequencing strategy yields a recombination map and is the basis for creating a pan-genome of cannabis. It has helped the team identify the genetic variations that cause a plant to produce the important cannabinoids of THC (tetrahydrocannabinol, which can cause intoxication), CBD (the active ingredient in medicinal cannabis), or a mixture of the two, referred to as chemotypes (I-IV) — key to breeding for cannabis yield, potency, and a host of other traits.
McKernan explained how PacBio sequencing is what made the project possible. He used a combination of short- and long-read sequencing technologies in the past, but found the short reads could not capture structural variation sufficiently, especially in a genome that is many times more complex than the human genome — 25-35% of the genome was not mappable with short-reads, he said.
Relying on SNP chips is also not ideal, as they require primers that assume a certain reference sequence.
“When there are variants under these primers, of which there are lots in cannabis, the assays fail to produce clean signal,” McKernan said. “A high number of SNP chip assays will fail as they were designed before this was known.”
Using SMRT Sequencing, McKernan’s team found more than 116 Mb of structural variation in their trio “family” sequencing, accounting for 1/8th of the genome, including more than 1 million small (less than 50 base pairs) insertions and deletions. The human genome, by comparison, contains about 7 Mb in structural variation, or less than 1% of the genome.
“Genotyping needs to dance around all this variation,” McKernan said. “Now we have it all beautifully resolved with PacBio.”
They also conducted single-molecule mRNA isoform sequencing (Iso-Seq analysis), on five different parts of the plant. This allowed them to characterize between 45-85K genes expressed in cannabis, and to create maps of methylation, splicing sites, and recombination hotspots, in collaboration with Phase Genomics and New England Biolabs.
Because of this, they now have a better picture of the Y chromosome, which could shed insight into hermaphroditism within the species.
“Life always finds a way. When cannabis is stressed, it switches sex and self pollinates,” McKernan said. “We recently found an S1 gene that might be playing a role in the process.”
Among the surprises they encountered was the level of chloroplast heteroplasmy in cannabis. Chloroplast genomes are the most popular targets for improving yield, so they are important to understand, McKernan noted.
“We identified eight different haplotypes. We are still sorting through what this means,” he said.
They also learned that cannabinoid synthases had introns and identified previously unknown genes involved in pathogen defense. The discovery of other novel genes “might allow us to resurrect other synthases that have been bred out of existence,” McKernan said.
In addition the insights MGC has gleaned, the data they accumulated and posted publicly has already been incorporated into research by other teams. Conor Jenkins and Ben Orsburn of Think20 Labs in Maryland, for example, recently constructed a draft map of the cannabis proteome.
Other applications enabled by the enhanced reference genome and transcriptome data include an mRNA LAMP assay to distinguish between hemp and cannabis. By law, hemp is defined as having less than 0.3% THC. By testing active CBDA levels, MGC will be able to test whether a strain is hemp, as CBDA expression and THCA expression are inversely correlated.
Watch the complete webinar to learn more:
Seeking to sequence and characterize entire transcriptomes in one go? Our new Iso-Seq protocol and reverse-transcriptase PCR kit makes it easier, speedier and cheaper.
Run on the new Sequel II System, the completely revamped Iso-Seq Express workflow achieves whole transcriptome characterization from a single SMRT Cell 8M delivering up to 400 Gb, and at a third of the cost, or less. Yield has also increased on the Sequel System, with 3.0 sequencing chemistry typically delivering up to 30 Gb per SMRT Cell 1M for our RNA sequencing application.
The new protocol requires three times less RNA input (300 ng) and minimizes handling-induced cDNA damage. Preparation is also simplified, allowing you to go from total RNA to SMRTbell library in one day. And no need for size selection: the ProNex bead system (used in place of AMPure PB beads) allows you to tune the size preference of full-length transcripts.
Early access users have praised the new protocol for its ease, output and speed. “Never got this much data and long polymerase read length,” noted one customer. “cDNA length is also way better than any previous data.”
The new Iso-Seq Express also supports multiplexing up to 12 samples for additional savings. This will be particularly useful for researchers interested in capturing gene diversity expressed from different tissues, timepoints, or experimental conditions.
In addition, our improved Iso-Seq workflow in SMRTLink 7.0 allows you to analyze your data and identify full-length transcripts with a faster turnaround time.
Identify, Annotate, Characterize
For plant and animal researchers who need high-quality gene models, full-length transcript sequencing is a key tool for annotating reference genomes or providing a reference transcriptome in the absence of a genome.
For human genetics researchers who are trying to identify disease-causing variants in rare and Mendelian disorders, the improvements allow unambiguous characterization of complex alternative isoforms.
Transcriptome sequencing has also proven important for understanding the functional consequences of cancer genome mutations, including structural rearrangements. The genomes of the COLO 829 melanoma and matched normal peripheral blood cell lines were recently sequenced to 50-fold coverage and paired long-read transcriptome data was generated using the updated Iso-Seq Express method. The results were shared at the AACR 2019 Annual Meeting and the complete data set is now available for download.
In this blog miniseries, we’re recapping presentations from early access users of the Sequel II System. Today, we summarize Luke Tallon’s report from Maryland Genomics, a PacBio Certified Service Provider.
Like the other early access users, Maryland received 32 SMRT Cells for use in evaluating the new sequencing system. They tested them across a range of applications: continuous long-read (CLR) sequencing for humans, plants, insects, and bacteria; and HiFi mode, powered by circular consensus sequencing, for human, microbiome, and other samples.
Tallon reports that each SMRT Cell 8M averaged about 92 Gb of sequence data in long-read mode; for HiFi mode, that rose to an average of about 260 Gb with a unique molecular yield of more than 16 Gb and a mean quality score of Q32. Compared to the Sequel System, they saw “comparatively longer reads from long libraries,” he said. In comparisons of libraries run on both the Sequel System and the Sequel II System, his team saw a more than 10-fold increase in sequencing capacity.
Tallon also offered three vignettes to illustrate his lab’s experience with the new sequencing system. First, he reviewed results from sequencing a 16-plex microbial pool totaling about 80 Mb. The pool was run on a single SMRT Cell 8M, yielding 215 Gb of data which allowed the team to generate high-quality assemblies. Twelve of the 16 microbial chromosomes were assembled into a single contig, and all assemblies achieved at least 99.9% completeness. Based on a detailed analysis, Tallon concluded that with the Sequel II System, “we can certainly multiplex at a much deeper level than this.”
Next, he presented data from full-length 16S amplicon sequencing, which provides more precise taxonomic information than conventional short-amplicon 16S profiling. Using HiFi mode, Tallon’s team ran two sets of 96-plex 16S amplicon libraries consisting of mock communities and infant gut samples. The resulting data had quality scores as high as Q80, with no degradation seen at the 3’ end. After demultiplexing the data, the team had a mean of ~12,000 HiFi reads per sample; 87% of those reads could be assigned to a species-level identification, and in some cases even sub-species level assignment was possible. In addition, the Sequel II System data was able to more faithfully recapitulate the composition of the mock community sample compared to a short-amplicon approach, showing the advantage of full-length 16S sequencing with the Sequel II System.
Finally, Tallon showed results from metagenome shotgun sequencing of five vaginal microbiome samples with the goal of assembling complete genomes. Despite having limited starting material and significant host contamination, Tallon reports, “we were still able to assemble complete genomes” — including for the uncultureable bacterial strain known as BVAB1 for which no reference genome previously existed. BVAB1 is a Clostridiales species linked to poor outcomes when present in the female reproductive tract.
For more details, watch Tallon’s full presentation:
Maryland Genomics is currently offering a SMRT Grant to explore metagenomes in high resolution using their Sequel II System. Submit your 250 word proposal by August 2 for a chance to win! They will also be exhibiting at ASM Microbe 2019 in San Francisco kicking off today – visit them at booth 861.
To learn more about users’ experience with the Sequel II System, check out our summary of Kiran Garimella’s presentation on how it performed at the Broad Institute.
As we geared up for the launch of our new Sequel II System, we had the good fortune of working closely with several expert customers in an early access program. Recently, three of those customers reported on their experience with the new sequencing system in a webinar. In this blog series, we’ll be summarizing each speaker’s presentation, and the full recording is available to view.
First up was Kiran Garimella (@KiranGarimella), a senior computational scientist at the Broad Institute who focused on the use of HiFi reads, which are long (>10 kb) and accurate (>99%) sequences produced by the Sequel II System with circular consensus sequencing (CCS). Garimella and the team at the Broad Institute used the early access program to sequence trios from the Human Genome Structural Variation Consortium (HGSVC), clinical samples, and tumor/normal pairs.
Garimella reported average raw yields of 300 Gb per SMRT Cells 8M across 32 runs. Using a cloud-based pipeline he developed, the Broad Institute processed raw reads into HiFi reads and variant calls in 1-2 days. The HiFi reads, which averaged 10 subread passes, achieved quality scores from Q23 to Q25, which is comparable to the Q24 to Q25 of recent short-read data from Platinum Genomes. Garimella called the level of accuracy “remarkable” for long reads. “We’re very impressed by the PacBio Sequel II data,” Garimella added.
Garimella used the HiFi reads to look at structural variation and haplotype phasing, which has been difficult to detect with short reads. He showed an example of a heterozygous structural variant in the well characterized NA12878 that is clear in HiFi reads but difficult to detect with short reads. He also showed an example of variant calling in complex loci like the HLA genes. This is “why the Broad is so excited about long-read sequencing,” he added.
The NA12878 HiFi dataset, and others from the HGSVC, will be released publicly to help with establishing ground truth benchmarks for structural variation.
For more details, watch Garimella’s full presentation:
Single Molecule, Real-Time (SMRT) Sequencing continues to get smarter and more powerful, with the recent launch of the Sequel II system increasing capabilities and efficiencies of the long-read DNA and RNA PacBio sequencing technology even further. In a special issue devoted entirely to the technology in the MDPI open access journal Genes, guest editors Adam Ameur of Uppsala University and Matthew S. Hestand of the Cincinnati Children’s Hospital Medical Center present eight articles highlighting research conducted using SMRT Sequencing.
As this special issue demonstrates, the benefits of SMRT Sequencing to many different areas of research are becoming evident, not only in basic science, but also in more applied areas such as agricultural, environmental, and medical research. Examples from each of these areas are included in this issue.
Maximizing Minimum Sample Sizes
A new mosquito genome assembly generated via a collaboration between PacBio and the Wellcome Sanger Institute highlighted the capabilities of one of the latest SMRT Sequencing advancements: a new low DNA input protocol. The Sanger scientists used the new protocol to create a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA.
Protecting the Fungus Among Us
Scientists in China and Michigan used SMRT Sequencing to elucidate the medicinal properties of Gloeostereum incarnatum, a precious edible mushroom that is widely grown in Asia. They assembled a high-quality genome of the fungus — the first complete genome to be sequenced in the family Cyphellaceae — and identified gene clusters associated with terpenoid and polysaccharide biosynthesis.
Another team from Jilin Agricultural University and Shenyang Agricultural University in China also investigated edible mushrooms — and a mycoparasite that threatens them. They assembled a high-quality genome of Cladobotryum protrusum, which causes cobweb disease on cultivated mushrooms. They found that the C. protrusum genome, the first complete genome to be sequenced in the genus Cladobotryum, encodes a large and diverse set of genes involved in pathogen–host interactions, mycotoxins, and pigments, and harbors arrays of genes with the potential to produce bioactive secondary metabolites and stress response-related proteins that are significant for adaptation to hostile environments.
Improving Protein Production… via Insects
A new genome assembly of the cabbage looper moth, Trichoplusia ni, may have implications for large-scale genome engineering. As reported by scientists from the National Cancer Institute’s Frederick National Laboratory for Cancer Research, insect cell protein production has emerged as a viable alternative to bacterial and mammalian cells for the production of therapeutically relevant proteins, with several approved vaccines generated in baculovirus-infected insect cells. However, improved protein production using these lepidopteran hosts has been hindered by limited genomic data. By performing de novo genome assembly of the Trichoplusia ni-derived cell line Tni-FNL, the team hopes the reference will bolster future large-scale genome engineering work in recombinant protein production hosts.
Detecting Distinction in Bone Marrow Subpopulations
In a study led by Anne Deslattes Mays and Anton Wellstein from the Lombardi Comprehensive Cancer Center at Georgetown University, the transcriptomes of freshly harvested human bone marrow progenitor (lineage-negative) and differentiated (lineage-positive) cells were analyzed with SMRT full-length RNA sequencing. This Iso-Seq analysis revealed a ~5-fold higher number of transcript isoforms than previously detected and showed a distinct composition of individual transcript isoforms characteristic for bone marrow subpopulations. Check out an additional Q&A with Mays here.
Expanding Genetic Diversity in Human Dataset
Swedish scientists used SMRT Sequencing to expand the diversity of the human genome dataset. They performed de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual, and around 6 Mb of novel sequences (NS) shared with a Chinese personal genome. “Our results highlight that the GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data,” the authors wrote.
Solving Methylation Calling Challenges
Lastly, a team of bioinformaticians from Iowa, Ohio, and Tokyo presented a statistical solution for observing personal diploid methylomes and transcriptomes. As they report, CpG methylome pairs of homologous chromosomes that are distinguishable with respect to phased heterozygous variants (PHVs) is challenging due to scarcity of PHVs in personal genomes. While SMRT Sequencing is a promising avenue to addressing this challenge as it outputs long reads with CpG methylation information, phasing the CpG sites can still come with errors. Their paper proposes a model that reduces the error rate to 1%, thereby calling CpG hypomethylation in each haplotype with >90% precision and sensitivity.
A preprint released this week from the Genome in a Bottle (GIAB) Consortium describes a benchmark set of structural variants (SVs), differences ≥50 bp, in the genome of a human male named HG002. The GIAB benchmark is the first to allow measuring precision (false positives) and recall (false negatives) of different approaches to detecting structural variants. The GIAB Consortium also developed a tool, Truvari, to support evaluation of variant call sets against the benchmark.
Earlier GIAB benchmarks, first released in 2014 and last updated in 2017, have led to enormous improvements in the quality and consistency of calling single-nucleotide and small indel variants. However, due to prior limits of DNA-sequencing technologies, the benchmark has not included SVs, which are fewer in number than small variants but in total cover more base pairs.
To extend the benchmark to SVs, the GIAB Consortium sequenced HG002 and his parents with short-, linked-, and long-read technologies; analyzed the reads with 26 different software variant callers; and integrated the different methods into a final set of 12,745 high-confidence SVs across 2.69 Gb of well-characterized “Tier 1” regions in the 3 Gb human genome. The high-confidence SVs match the expected size distribution for a human genome, with the number of variants decreasing with variant size except at the size of ALU (300 bp) and LINE1 (6 kb) repeats. The high-confidence SVs also show nearly perfect Mendelian consistency, with the genotype in HG002 being consistent with inheritance from his parents.
PacBio long reads, which provide high precision and recall for structural variants, were particularly important to the benchmark. GIAB required support from PacBio long reads for all of the high-confidence variants. Further, GIAB reports “many SVs only detectable with long reads [especially in tandem repeats]” and concludes “[t]hese results confirm the importance of long read data for comprehensive SV detection.”
If you would like to use the benchmark to evaluate how well you detect SVs, GIAB provides DNA reference material and datasets, including 32-fold coverage of accurate long reads from the PacBio Sequel II System. We also offer a tutorial on how to use the GIAB datasets and the SV benchmark to evaluate precision and recall.
In the future, the GIAB Consortium plans to extend the SV benchmark to the other genomes in its portfolio, namely HG001/NA12878 and HG005. They also plan to incorporate new data, such as highly-accurate long HiFi reads from PacBio, to improve the quality and scope of the benchmark.
Two years ago, Carola Greve and colleagues at the Zoological Research Museum Alexander Koenig in Bonn, Germany, were seeking to #SeqtheSlug as part of the 2017 Plant and Animal SMRT Grant competition, and the popular project was a close runner-up. Greve didn’t give up on her quest to sequence the ‘solar-powered’ sea slug. We caught up with her recently at the SMRT Leiden Scientific Symposium, where her update on the sea slug project earned her a Best Poster award.
Why the sea slug?
Although Mollusca represents the second largest animal phylum with around 85,000 extant species, only 23 mollusc genomes are publicly available on NCBI genome database, and when we started our project, no reference genome had been generated for any sacoglossan (algae-ingesting) mollusc. Some of these sacoglossa species are particularly interesting because of their ability to sequester chloroplasts from its food algae. These ‘stolen’ plastids, also known as kleptoplasts, are then stored in a functional state in the digestive gland cells of the slugs and presumably allow them to endure weeks or even months of starvation.
How do the slugs keep the chloroplasts active which continuously produce starch inside the slugs? No one knows! But up to now, this spectacular phenomenon, termed functional kleptoplasty, is unique among animals. We wanted to scour the genome for genes associated with this unique ability and further provide a valuable genomic resource for future genome-wide comparative analyses to organisms with similar lifestyles, i.e. those stealing useful parts out of their prey and incorporating, instead of digesting them.
How has the project progressed?
We created a mitochondrial genome for Plakobranchus cf. ocellatus (van Hasselt 1824), a sea slug species found in the Philippines, using Illumina short reads and another team (Huimin Cai, et al. 2019) beat us to a draft genome assembly earlier this year, of the Elysia chlorotica species, using a hybrid Illumina/PacBio approach. Their genome assembly comprised 9,989 scaffolds, with a total length of 557 Mb, and their annotation identified 176 Mb (32.6%) of repetitive sequences with 24,980 protein-coding genes.
We’d like to improve the genome using PacBio, and we are now working with Kornelia Neveling a molecular geneticist at the Genome Diagnostics Nijmegen, Radboud University Medical Center to get some good reads from our P. ocellatus samples. Once we get the results, we want to compare the genomes of the two slug species, as well as examine the secondary metabolites that they produce, such as polypropionates, which might be interesting for pharmaceutical applications.
What have you learned so far?
We’ve learned that it’s really difficult to extract high molecular weight DNA from these slimy creatures, and there seems to be something in the process that is inhibiting the enzymes involved in sequencing. The challenges have helped me in my new role, however. I am now at the Senckenberg Research Institute and Natural History Museum in Frankfurt, where I am leading the Translational Biodiversity Genomics lab. One of our main purposes is to extend biodiversity research to non-model organisms, and make it more accessible for basic and applied research. As part of this, we are establishing DNA extraction protocols for species that are hard to work with, such as worms, insects, slugs, and historical museum samples.
Have an interesting project you’d like to try using SMRT Sequencing? For a look at our ongoing and future SMRT Grant programs, check out www.pacb.com/smrtgrant
Malaria is a complicated killer, and efforts to develop effective vaccines have been hindered by gaps in our understanding of both the parasite that causes the infection, Plasmodium falciparum, and its transmitter, the mosquito.
Like many virulent parasites, P. falciparum has evaded close genetic scrutiny due to its complex and changing composition. Its 23 Mb haploid genome is extremely AT rich (~80%) and contains stretches of highly repetitive sequences, especially in telomeric and subtelomeric regions. To make matters more complicated, it expands its genetic diversity during mitosis via homologous recombination, leading to the acquisition of new variants of virulence-associated surface adhesion molecules.
Attempts to decode the P. falciparum genome with short reads have resulted in extremely fragmented assemblies of more than 20,000 contigs each, with N50 contig sizes less than 2Kb in length — barely long enough to contain a single intact gene, let alone to allow resolution of very homologous gene families that are often chained end-to-end across large regions. Early shotgun sequencing missed polymorphisms such as insertions and deletions, copy number variants, chromosomal rearrangements and structural variants in P. falciparum’s hypervariable and highly repetitive regions.
In 2016, PacBio collaborated with scientists from Institut Pasteur and Cold Spring Harbor to create a complete telomere-to-telomere de novo assembly in which all 14 chromosomes were resolved into single contigs. Even extremely AT-rich regions were resolved with uniform coverage and subtelomeric regions of all chromosomes were successfully assembled in a single run for the first time.
As PacBio microbiology expert Meredith Ashby explained in a recent Labroots webinar, the assembly was “game changing.”
The most challenging parts of a genome are often the most important to decode, she explained. Regions of high homology facilitate recombination events, a key mechanism for rapid genome evolution, for example. Or, in other words, “where most of the exciting things are happening” in terms of immune invasion and drug resistance, for example.
Not only has the new reference genome facilitated better analysis of these areas, as well as structural variants and large-scale changes in the genome, it has also enabled better SNP calling, Ashby said. This is important because some traits, including drug resistance, may be SNP driven.
By mapping short reads to single-molecule sequenced reference genomes, you can more confidently tell the difference between genes and pseudogenes , or between genes and new duplications, she said. And many clinical isolates are sequenced using short-read technology.
Since the new reference genome was published, an additional 15 plasmodium isolates have been assembled to near completeness. There have already been several publications about new discoveries enabled by these new assemblies, from asexual replication to locally divergent selection.
Host with the most
To fully understand malaria, however, we must also understand its host. The genome of malaria vector Anopheles coluzzii was recently assembled from a single individual using our new low input protocol.
PacBio technology has also enabled the assembly of a much improved genome of Aedes aegypti, which transmits yellow fever, zika, dengue and chikungunya.
Like Plasmodium, the A. aegypti genome is highly repetitive, and early sequencing attempts resulted in assemblies with 37,000 contigs. SMRT Sequencing reduced this to around 2,500 contigs, and increased their N50 sizes from 84 K to 11.8 Mb – much more contiguous.
The AaegL5 assembly revealed an enormous number of new genes, including a far more comprehensive catalogue of odorant, gustatory and ionotropic receptors, which could provide important information for pest control strategies based on feeding and mating. The Rockefeller University researchers also identified hotspots that were under selective pressure for insecticide resistance.
Also of interest to infectious disease researchers: findings involving serine proteases, which mediate immune responses, and metalloproteases, which are linked to mosquito–Plasmodium interactions. Half of the 404 serine and metalloproteases gene models were improved in the AaegL5 assembly, and 49 novel proteases were discovered, Ashby said. Other vector competence hotspots were also identified, such as QTLs on chromosome 2 that were linked to systemic dengue virus dissemination in midgut-infected mosquitoes.
“Malaria, yellow fever, zika, dengue and chikungunya cause millions of deaths worldwide every year,” Ashby said. “Hopefully these new references will yield new insight into all kinds of things that are important to reduce the global burden of infectious diseases.”
To learn more about the application of SMRT Sequencing technology in infectious disease research, watch the full Labroots presentation, or visit the PacBio team at American Society for Microbiology (ASM) Microbe 2019 at booth 1160.
A recent review article published in Frontiers in Genetics offers a great look at the landscape of long-read sequencing. Authors Tuomo Mantere, Simone Kersten, and Alexander Hoischen from Radboud University Medical Center in the Netherlands focus on emerging applications in medical genetics for long-read technologies.
“With the recently demonstrated success in identifying previously intractable DNA sequences and closing gaps in the human genome assemblies, long-read sequencing (LRS) technologies hold the promise to overcome specific limitations of NGS-based investigations of human diseases,” the scientists write. “LRS has the potential to grow into a technology that is used not only to produce high-quality genome assemblies (i.e., the platinum human reference genome), but also to capture clinically relevant genomic elements which are problematic for conventional approaches.”
The team cites four particular use cases for which long-read sequencing is particularly well suited, diving into detail about each:
- Discovering disease-causing structural variants missed by short-read technologies
- Direct and accurate sequencing of repeat expansions or GC-rich regions
- Enhanced phasing of variants to determine haplotypes and parent of origin
- Distinguishing between genes and pseudogenes
The review highlights peer-reviewed research publications describing novel discovery across each application in a wide variety of disease areas including: X-Linked Parkinsonism, ALS and FTD, SCA10 and Parkinson’s disease, Myotonic dystrophy, Bardet-Biedl syndrome, and Fragile X disorders.
The authors also examine transcriptome analysis, indicating that long-reads have been successfully employed to sequence full-length mRNA transcript isoforms as a complement to existing short-read RNA sequencing approaches. “Ultimately, this knowledge can be implemented to improve WES/WGS-based variant filtering, prioritization and prediction of their functional impact,” Mantere et al. report.
The authors conclude with a prediction that, in the future, “the broader use of [targeted long-read sequencing] could significantly increase the diagnostic yield of genetic testing and discover novel disease genes.”
When looking to understand the functional implications of genetic variability, scientists should seek out the Iso-Seq method, according to Cold Spring Harbor researchers.
In a recent paper published in Frontiers in Genetics, Doreen Ware, Bo Wang, and colleagues reviewed the state of transcript sequencing and analysis technologies, and concluded that single-molecule sequencing from PacBio provided several advantages over other methods.
A major challenge in molecular biology continues to be the complex mapping of the same genome to diverse phenotypes in different tissue types, development stages and environmental conditions, the paper states.
“A better understanding of the transcripts and expression of gene regulation is not only non-trivial but lies at the heart of this challenge,” the authors write.
RNA sequencing can support both the discovery and quantification of transcripts using a single high-throughput sequencing assay. But methods that rely on short reads have several limitations in revealing gene regulation, the protein-coding potential of the genome and ultimately the phenotypic diversity.
Long-read SMRT Sequencing for RNA characterization has the advantage of rendering, in vitro and without ambiguity, a full-length transcript sequence without depending on the error-prone computational step of assembly. As a result, they allow a more precise detection of alternative splicing events and eventually novel isoforms, making it easier to build gene models for species which are poorly studied or have an incomplete or missing reference genome, the authors state.
“With the development of single-molecule sequencing technology, ‘one read is one transcript’ is not a dream anymore, and scientists can get the intact sequence of each isoform by sequencing a single cDNA molecule,” the authors write.
The Iso-Seq approach offers particular advantages in the characterization of polyploid transcriptomes, which have a large number of repeats and homeolog genes, and in the profiling of allele-specific expression, Ware and Wang state.
They also detail experimental and informatic pipelines and highlight several downstream applications of the Iso-Seq method, including:
- alternative splicing
- alternative polyadenylation (APA)
- fusion transcripts
- long non-coding RNAs (lncRNAs)
- isoform phasing, and
- genome annotation
Regarding the last item, the team state that the Iso-Seq method can increase the accuracy of automated genome annotation by improving genome mapping of sequencing data, correctly identifying intron-exon boundaries, directly identifying alternatively spliced transcripts, identifying transcription start and end sites, and providing precise strand orientation to single exons genes. Mapped against a reference genome, the full-length transcripts that are uncovered can be used to improve or add de novo structural and functional annotation to a genome, improve genome assembly and existing gene models, they state.
“Iso-Seq is known to retrieve longer isoforms as well as more number of isoforms… This has revolutionized our understanding of the biology of a number of organisms, including plants and animals, since transcript diversity usually represents functional diversity,” the authors write.
Iso-Seq analysis has also benefited evolutionary studies, as it allows scientists to compare the splicing variants between species and better understand the conservation of genes/isoforms, the divergence of splicing patterns, and the significance of their expression levels.
The next challenge? What to do with all the new isoforms identified from the Iso-Seq method.
The growing number of isoforms identified from different tissues/conditions within an organism will need to be ranked and prioritized for community research. And not all of them will have a meaningful impact on the cellular biological processes of the cell, Ware and Wang note, so the results will have to be carefully validated and characterized.
“Experimental approaches such as CRISPR could help by targeting the role of each isoform, and see if there are redundant or complementary functions among these different splicing isoforms,” they conclude.
Hundreds of SMRT scientists came together recently in Leiden to learn about the latest updates to PacBio technology and to showcase their data analysis tools. Extremely useful information was shared, and future collaborations were sparked. For those who weren’t able to jet to the Netherlands to attend, we’ve rounded up the top tools and tips presented at the European SMRT Informatics Developers Meeting. For an in-depth report on the event, check out this blog post by PacBio Principal Scientist Elizabeth Tseng.
- SMRT Link – Of course our own open-source SMRT analysis software suite will be top of the list. Updates to the system have resulted in many improvements, including 8x faster time-to-results for CCS generation and 20x faster mapping with minimap2 using our own wrapper pbmm2; important improvements to CCS to support PacBio’s HiFi data type; detection of more types of structural variants; increased automation; and PDF reports.
- Bioconda – Want to be the first to try out new and improved analysis tools? Many updates to PacBio algorithms, assembly packages, and other tools are available on Bioconda before their official release, including the latest Sequel II System changes.
- pbsv – Our structural variant (SV) calling and analysis tool has also been updated. What’s new? An increase in sensitivity for large insertions and deletions, and calling of duplications and copy number variation… meaning that pbsv now calls all major SV types 20 bp and longer.
- DAZZLER Suite – Need to find all significant local alignments between reads? Or to remove chimeras, adaptamers, and low-quality dropouts? Da’ Gene Myers (@TheGeneMyers) has an app for that. Or several, actually, including DALIGNER, DASCRUBBER and DAMASKER. Myers announced he has updated the suite to better support highly accurate, long HiFi reads.
- PRINCESS – Prolific toolmaker Fritz Sedlazeck (@sedlazeck), creator of the SV caller Sniffles, unveiled his work-in-progress, PRINCESS, a Snakemake pipeline to call and phase SNPs and SVs. Keep an eye on his Github site to snag it when it drops.
- TAMA – The all-in-one Transcriptome Annotation by Modular Algorithms tool by Iso-Seq expert Richard Kuo (@GenomeRik) can do many things, including: mapping RNA reads to transcript annotation, merging annotations (can combine PacBio with references like ENSEMBL), identifying coding regions and associating them with known genes.
- SQANTI – This quality control pipeline by Ana Conesa (@anaconesa) can categorize Iso-Seq data against a reference annotation. It allows users to see which genes/transcripts are novel/known and offers detailed annotations on canonical/non-canonical junctions. A modified version of SQANTI is SQANTI2 by Elizabeth Tseng (@magdoll).
- TAPPAS – A Java-based application, also by Ana Conesa, that creates beautiful visualizations utilizing information at both the transcript and protein level. It can identify differential expression at both the isoform level and the gene level.
- pyPaSWAS – Program for DNA/RNA/protein sequence alignment, read mapping and trimming, by Sven Warris (@swarris).
- WhatsHap – Software from Tobias Marschall (@tobiasmarschal) for read-backed phasing of variants. Jana Ebler discussed an extension to WhatsHap to simultaneously call and phase variants in long reads.
Variant callers are not all the same – in fact, there are times when their algorithms don’t agree. So, what do you do? Ryan E. Mills (@ryan_e_mills), an assistant professor at the University of Michigan, laid out the problem — and two of his solutions — in a presentation at the Labroots Genetics and Genomics conference:
- VaPoR – A structural variant validator that uses a dotplot of PacBio reads against the reference genome to visualize and automatically score candidates for patterns that suggest deletions, insertions, tandem duplication or inversions.
- PALMER – The Pre-mAsking Long reads for Mobile Element InseRtion tool detects non-reference MEI events (LINE, Alu and SVA) and other insertions, by using the indexed reference-aligned BAM files from long-read technology as inputs. It uses the track from RepeatMasker to mask the portions of reads that aligned to these repeats, defines the significant characteristics of MEIs (TSD motifs, 5′ inverted sequence, 3′ transduction sequence, polyA-tail), and reports sequences for each insertion event.