This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
Maize researchers have been rejoicing over a New Year’s gift delivered by a group of 33 scientists: A 26-line “pangenome” reference collection.
The multi-institutional consortium of researchers used the Sequel System and BioNano Genomics optical mapping to create the assemblies and high-confidence annotations. They released the results on January 9, and in several presentations at the Plant and Animal Genome XXVIII Conference, less than two years after the ambitious project was funded by a $2.8 million National Science Foundation grant.
The collection includes comprehensive, high-quality assemblies of 26 inbreds known as the NAM founder lines — the most extensively researched maize lines that represent a broad cross section of modern maize diversity — as well as an additional line containing abnormal chromosome 10.
Scientists can download the project’s raw whole genome sequencing data, RNA sequencing data, optical map data, gene annotations and gene models at MaizeGDB. The site also features browsing and data visualization tools.
Led by faculty investigator R. Kelly Dawe (@corncolors), Distinguished Research Professor at the University of Georgia, Matt Hufford (@mbhufford), associate professor at Iowa State University, and Doreen Ware, a computational biologist at USDA and Cold Spring Harbor Laboratory, the NAM Consortium also included scientists from Corteva Agriscience, who are conducting their own large-scale sequencing effort of the company’s maize lines as well.
“People have been using these particular lines for years, so everybody has been really excited to get these new references as a resource,” Hufford said. “The assemblies that have come out are better than anything else that’s out in maize.”
Maize has been extremely challenging to sequence because the vast majority of its 2.3 Gb genome — a staggering 85 percent — is made up of highly repetitive transposable elements. It is also amazingly diverse. A study comparing genome segments associated with kernel color from two inbred lines revealed that 12 percent of the gene content was not shared – that’s much more diversity within the species than between humans and chimpanzees, which exhibit more than 98 percent sequence similarity.
The 26 varieties were prepped at the Arizona Genomics Institute, sequenced at the University of Georgia, Oregon State University, and Brigham Young University, and assembled by the NAM Consortium using PacBio long reads. Scaffolds were validated by BioNano optical mapping, and ordered and oriented using linkage and pan-genome marker data. RNA-seq data from multiple tissues were used to annotate each genome using a pipeline that included BRAKER, Mikado and PASA.
“We spent a lot of time on gene model annotation, validation and benchmarking against B73 (the first reference genome annotations for maize, created by Ware’s lab in 2009, and updated in 2017) and other maize genes that have been manually curated by the community,” Hufford said.
Now comes the fun part: Peering into all the data and seeing what secrets it will reveal.
“For the last few months, we have started to see the cool biology emerging,” Hufford said. “What we are seeing is a lot of structural variation linked to phenotypic traits we haven’t been able to explain before.”
In addition to answering questions about basic biology and agronomic variation, the data is shedding light on the evolution of the different maize lines.
“We’re learning about the tempo of gene loss following a genome doubling event several million years ago. It appears to be ongoing, and still in flux,” Hufford said.
Next steps for the consortium include additional functional annotations for the NAM gene models, such as transposable elements, SNPs and insertions, as well as methylome and ATAC-Seq data.
“These data will help the maize community assess the role of variation in the determination of agronomic traits,” Hufford said.
Hufford will also be using SMRT Sequencing on the Sequel II System for two other large assembly projects for teosintes, a wild relative of maize, and other grass species.
“I think it’s really going to help with some of these complex varieties,” he said.
Learn more about the methods and workflow for PacBio whole genome sequencing.
By Zev Kronenberg, Senior Engineer of Bioinformatics at PacBio
Since the introduction of HiFi reads the community has embraced these long and highly accurate reads for human genome assembly and paralog resolution [1-5]. At PacBio, the assembly team (Figure 1) is working to build on the accuracy of HiFi data for direct phasing during assembly.
In diploid organisms, phasing an assembly means separating the maternally and paternally inherited copies of each chromosome, known as haplotypes. Each phased contig, or haplotig, is made up of reads from the same parental chromosome (Figure 2). Phased genomes give better quality than collapsed genomes; they provide allelic information, which can be important for studying human diseases, crop improvement, evolution, and more.
Figure 2. Phased de novo assembly. A collapsed haploid assembly meshes contigs from different haplotypes (unphased assembly), while a partially phased assembly may still switch between the two haplotypes in its primary contigs. A fully phased assembly would cleanly separate the two haplotigs.
FALCON-Unzip is a diploid-aware genome assembler that has been used to assemble and phase many PacBio genomes . It first creates a collapsed assembly, then uses heterozygous single nucleotide variants to partition the reads by haplotype and reassembling them into haplotigs. The assembly outputs are primary contigs with associated haplotigs (Figure 3).
Figure 3. FALCON-Unzip phasing and haplotig assembly steps. In the first stage primary contigs and associate contigs are produced, reads are aligned to the primary contigs, and phased. The phase is then re-introduced to the assembly graph, followed by re-assembly.
While FALCON-Unzip has consistently given our users excellent results, it was built for long reads with higher error rates and does not take advantage of the high accuracy of the HiFi reads. In 2019, FALCON-Unzip was adapted for HiFi data, producing high-quality results . However, the current implementation still requires iterative assembly, and does not use indels for phasing. Therefore, we have started working on a new graph cleaner called Nighthawk that simplifies the assembly graph by removing cross-haplotype alignment overlaps, which can significantly speed up and improve assembly. While still a work in progress, the preliminary results are promising.
Nighthawk: A smart, efficient assembly graph cleaner
Nighthawk uses that classical bioinformatics data structure, the De Bruijn graph, to identify genetic variants (substitutions, insertions, and deletions) and remove cross-haplotype overlaps in the assembly string graph.
Most long-read genome assemblers follow the overlap-consensus-layout (OLC) workflow. The overlap stage begins with a pairwise alignment of all reads (Figure 4A). For each read, a pile of alignments to all other reads is generated. The goal of Nighthawk is to detect and remove cross-haplotype overlaps — that is, alignments between reads that come from different haplotypes. It also needs to remove other false alignments that come from paralogs, repeats, etc.
Given a pile of reads, Nighthawk builds a read-colored k-mer De Bruijn graph , where each node represents a k-mer; node colors denote a unique set of reads (Figure 4B). For each read overlap, Nighthawk calculates a read similarity score (RSS). The RSS is the number of shared variants between two reads. A positive RSS indicate that reads are in phase with another, while a negative RSS suggest the read overlap is cross-haplotype and should be removed (Figure 4C). Nighthawk removes overlaps with a negative RSS. The remaining overlaps are then passed on for the layout and consensus stage of assembly (Figure 4D).
It is amazing to see how clean a HiFi-based De Bruijn graph is (Figure 5). This is often a work of art in itself! After running Nighthawk, the overlaps can then be passed into string graph assemblers such as FALCON for assembly.
Figure 4. The Nighthawk workflow. Nighthawk builds a colored De Bruijn graph from read overlaps. Overlaps are scored by shared variants between two reads. Overlaps with negative RSS indicate cross-phase overlaps and are removed. The resulting overlaps are passed to a string graph assembler (such as FALCON) for phased assembly.
Figure 5. A HiFi De Bruijn graph for a pile of reads from Drosophila genome sequencing. Each dot represents a k-mer (k=23), the edges denote neighboring k-mers. The larger red dots mark the head of heterozygous bubbles.
Testing Nighthawk on a HiFi data set
We evaluated how well Nighthawk’s RSS could distinguish in-phase and cross-phase overlaps against three ground truth sets (Table 1). In all three data sets, Nighthawk’s RSS was able to distinguish in-phase read overlaps (true positives) from cross-phase read overlaps (true negatives) while having very few false positives and false negatives.
But what effect does Nighthawk’s graph cleaning have on the assembled genome? Our team patched Nighthawk into FALCON and assembled a heterozygous (0.6%) F1 Drosophila HiFi data set. The haploid genome size is 140 Mb, so a perfectly assembled diploid genome would consist of a total of 280 Mb total in primary and associated contigs.
Our Nighthawk-FALCON assembly produced 247.1 Mb of primary contigs and 14.9 Mb associated contigs, creating a diploid genome that’s a total of 262 Mb (93.9%). The phasing accuracy, as measured by parental k-mers, was much better using Nighthawk for both primary and associated contigs compared to other methods.
Toward a truly phased assembly
We have shown that HiFi data alone can be used to effectively phase a Drosophila genome. Our new tool, Nighthawk, is an assembly graph cleaner that uses the accuracy of HiFi reads for variation detection. The phasing of the primary and associate contigs improves compared to FALCON when Nighthawk is used to filter out cross-phase alignment overlaps.
Nighthawk is still a work in progress, and many challenges remain. One such challenge is the use of alignment identity as a filter to identify cross-phase overlaps. Setting the right identity threshold is a Goldilocks problem: a filter that’s too stringent would fragment the assembly, while a filter that’s too relaxed would not remove all the false overlaps. Another challenge is complex graph structures that may arise from repeat structures, homozygosity, lack of overlap coverage, etc.
Nighthawk is only the first piece in the overlap-layout-consensus assembly process. Our team is continuing to modify string-graph algorithms to recognize the graph structures Nighthawk generates. We are excited about the new possibility HiFi data brings and believe that fast, direct phased assemblies will be feasible in the not-too-distant future.
The PacBio assembly team would like to thank Tobias Marschall (@tobiasmarschal) for the inspiration to use De Bruijn graphs for variant calling (NCBI Hackthaon 2019) and Mark Chaisson (@mjpchaisson) for technical guidance on avoiding common pitfalls.
 Wenger et al., “Accurate Circular Consensus Long-Read Sequencing Improves Variant Detection and Assembly of a Human Genome”, Nature Biotechnology (2019)
 Vollger et al., “Improved Assembly and Variant Detection of a Haploid Human Genome Using Single-Molecule, High-Fidelity Long Reads”, Annals of Human Genetics (2019)
 Vollger et al., “Long-Read Sequence and Assembly of Segmental Duplications”, Nature Methods (2019)
 Garg et al., “Efficient Chromosome-Scale Haplotype-Resolved Assembly of Human Genomes”, bioRxiv (2019)
 Porubsky et al., “A Fully Phased Accurate Assembly of an Individual Human Genome”, bioRxiv (2019)
 Chin et al., “Phased Diploid Genome Assembly with Single-Molecule Real-Time Sequencing”, Nature Methods (2016)
 Kronenberg et al., “High-quality Human Genomes Achieved through HiFi Sequence Data and FALCON-Unzip Assembly”, ASHG Poster (2019)
 Garg et al., “A Graph-Based Approach to Diploid Genome Assembly”, Bioinformatics (2018)
 Patterson et al., “WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads.” In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014)
 Koren et al., “De Novo Assembly of Haplotype-Resolved Genomes with Trio Binning”, Nature Biotechnology (2018)
A hearty congratulations to Cleo van Diemen at the University Medical Center Groningen for winning the 2019 Neuroscience SMRT Grant!
Van Diemen’s impressive proposal involves using PacBio long-read sequencing to find new genetic mechanisms associated with spinocerebellar ataxia (SCA). While some 70% of SCA patients can get clear diagnostic and prognostic information because they have one of the ~37 genes known to be associated with this condition, 30% of patients have no such clarity. In this project, van Diemen and her colleagues will use their SMRT Grant award to generate highly accurate long reads for two SCA patients with unknown disease etiology.
As team-leader of the research & development unit of the genome diagnostics section of the genetics department, van Diemen aims to introduce new technologies to help her colleagues achieve their research and diagnostic goals. In this case, she is working with a scientist focused on SCA patients to find a way to diagnose previously unsolvable cases.
So far, existing approaches have included standard linkage analysis, SNP arrays to look for some known structural variants, exome sequencing, and gene expression analysis. Now, van Diemen hopes that adding structural variant detection with SMRT Sequencing will provide some new answers. Repeat expansions are among the possible culprits. “Repeat genes have been identified in a lot of ataxias,” van Diemen says. With SMRT Sequencing, it will finally be possible “to do this genome-wide approach for new repeat genes.”
Structural variation is another potential source of causal mechanisms for the unexplained SCA cases. “There is some evidence that structural variants may play a role in ataxias,” van Diemen says. But SNP arrays lack the ability to discover new variants or to detect complex situations, such as inversions. And short-read sequencing often misses these large elements. “With long-read sequencing, it’s easier to identify them,” she adds.
Ultimately, the goal is to give all SCA patients the DNA-based information that will help them manage their condition. “There are some differences in the phenotypic spectrum, so knowing the genetic basis can help patients understand what they will face in the future and also makes it possible to consider genetic testing for family counseling,” van Diemen says. “That’s the clinical importance of having a genetic diagnosis.”
This SMRT Grant represents van Diemen and her team’s first use of PacBio sequencing. She believes it will be “a good starting point” that will help them understand how to apply long-read sequencing for larger-scale studies in the future. “We are looking forward to it,” she says. “It’s a great opportunity.”
We’re excited to support this research and look forward to seeing the results. Thank you to our co-sponsor and Certified Service Provider, the Center for Genomic Research at the University of Liverpool, for supporting the 2019 Neuroscience SMRT Grant Program.
Learn more about upcoming SMRT Grant Programs for a chance to win free sequencing.
A new preprint from lead authors David Porubsky and Peter Ebert, senior authors Evan Eichler and Tobias Marschall (@tobiasmarschal), and collaborators reports a method for generating fully phased, de novo human genome assemblies without parental data. The approach combines PacBio HiFi reads (>99% accuracy, 10-20 kb) with the short-read, single-cell Strand-seq technique. The authors provide a proof-of-principle through assembling the genome of a Puerto Rican female from the 1000 Genomes Project.
The work extends a recent publication from many of the same authors in which HiFi reads were used to produce an accurate and contiguous assembly of the human haploid genome, CHM13. To help assemble a phased diploid genome, the newer work adds Strand-seq, “a single-cell sequencing method able to preserve structural contiguity of individual homologs in every single cell.” The authors used Strand-seq to group HiFi reads by chromosome, order and orient contigs, and phase variants over long genomic distances. “Taken together, these features make Strand-seq the method of choice to be combined with high-accuracy long-read sequencing platforms to physically phase and assemble diploid genomes.”
The team generated 33.4-fold HiFi read coverage of the selected sample using the Sequel II System. They called single nucleotide variants in the HiFi reads with DeepVariant and phased variants using Strand-seq and HiFi reads. That “resulted in chromosome-length haplotypes with >95% … of all these heterozygous variants placed into a single haplotype block,” the scientists report. “With such global and complete haplotypes we assigned ~81% of the original PacBio HiFi reads to either parental haplotype 1 (H1) or haplotype 2 (H2).”
The team then used two tools, Canu and Peregrine, to assemble the haplotype-separated reads. A small number of chimeric contigs were corrected with Strand-seq data and the SaaRclust algorithm. The final contig N50s of the fully phased assemblies were 25.8 Mb and 28.9 for each haplotype. Assemblies were found to be highly accurate, with basepair quality scores higher than QV40; nearly all gene-disrupting indels in the sequence were found to be true biological events, not assembly artifacts. By titrating HiFi read coverage, the authors found that around 15-fold coverage of each haplotype is sufficient to produce an accurate, contiguous assembly.
“Our assembly strategies allow us to transition from ‘collapsed’ human assemblies of ~3 Gbp to fully phased assemblies of ~6 Gbp where all genetic variants, including [structural variants], are fully phased at the haplotype level,” the scientists report. In addition to the importance of using this method for assembling individual genomes, the authors note, “Fully phased, reference-free genomes are also the first step in constructing comprehensive human pangenome references that aim to reflect the full range of human genome variation.”
With the release of the award-winning Sequel II System, 2019 was an exciting year for the SMRT Sequencing community. We were inspired by our users’ significant contributions to science across a wide range of disciplines. As the year draws to a close, we have taken this opportunity to reflect on the many achievements made by members of our community, from newly sequenced plant and animal species to human disease breakthroughs.
“It has been another phenomenal year for science. The introduction of the Sequel II System will accelerate discovery even more, and I can’t wait to see what 2020 will hold.”
Jonas Korlach, Chief Scientific Officer
Human Biomedical Research
The year brought incredible insights into human genetics. Some researchers homed in on single mutations, while others zoomed out to explore variation on a population scale. PacBio technology was also selected for new large-scale sequencing projects, including the NHGRI Human Genome Reference Program and the All of Us program. Here are some of our favorite publications from the year:
- The mystery cause of progressive myoclonic epilepsy in a family that eluded detection in standard whole-exome sequencing was revealed with PacBio whole genome sequencing, as reported in Journal of Human Genetics and on our blog.
- New insights into specific human populations were revealed in several studies, including Melanesians, as reported in Science, and Tibetans, as reported in National Science Review.
- Double mutations in the PIK3CA oncogene were found to influence targeted therapy, as highlighted in Science and our blog.
- The importance of comprehensive variant detection was featured in several papers. University of Washington researchers Mitchell R. Vollger and Evan Eichler reported that “HiFi may be the most effective standalone technology for de novo assembly of human genomes” in their Annals of Human Genetics paper (read our blog), while members of the Human Genome Structural Variation Consortium reported “the most comprehensive assessment of SVs in human genomes to date” in Nature Communications. University of Michigan researchers Steve S. Ho and Ryan E. Mills shared their review entitled “Structural variation in the sequencing era.”
- A PLoS One paper by Mayo Clinic researchers demonstrated the use of No-Amp targeted sequencing to interrogate the sequence structure of expanded repeats in Fuchs Endothelial Corneal Dystrophy.
- The utility of the PacBio Iso-Seq method for studying disease risk genes was showcased in a Frontiers in Genetics paper by PacBio and Duke University researchers studying transcripts across synucleinopathies.
Plant & Animal Sciences
Commoner’s law of ecology states that “everything is connected to everything else,” and this was highlighted in several studies that showed the interdependence of microbes, plants, insects, and other animals. International consortia such as the Vertebrate Genomes Project, the Earth Biogenome Project, and the Sanger Institute’s 25 Genomes Project released many new reference genomes, which will only bolster our understanding of individual species as well their interactions with their ecosystem cohabitants. Here are some of our favorite publications from the year:
- Korean scientists provided a great example of mutualistic interactions in their Nature Communications paper examining the relationships between Streptomyces bacteria, strawberry plants, and pollinating bees.
- A USDA project to sequence the spotted lanternfly showcased the power of SMRT Sequencing to rapidly generate high-quality genomes from the DNA of single insects to fight invasive species.
- The latest Nature publication from the Cantu Lab delved into a largely unexplored feature of plant genomes — structural variants — in a study of the population genetics in grapevine domestication.
- Pathologists interested in uncovering the secrets of plant immunity used PacBio targeted sequencing to create inventories of NLR genes, which are candidates for engineering new pathogen resistance (read our blog).
- For shrimp, which have notoriously hard genomes to sequence, an isoform-level transcriptome reference generated with the Iso-Seq method was reported on Fish and Shellfish Immunology and summarized in our blog.
Microbiology & Infectious Disease
From C. difficile to symbiotic defense systems, we were treated to new insights in the realm of microbiology. We also learned about a new way to use an old method to provide unprecedented taxonomic resolution at species and strain level and gained insight into intra-bacterial defense. Here are some of our favorite publications from the year:
- Not only has the Mount Sinai Pathogen Surveillance Program adopted SMRT Sequencing for continuous monitoring and disease control, the accumulated PacBio data has also inspired new research, including a paper published in Nature Microbiology on the discovery of a conserved orphan methyltransferase that drives C. difficile infection persistence (read our blog).
- A team of researchers at the Jackson Laboratory published a study in Nature Communications, and featured on our blog, using HiFi sequencing to unlock the full potential of 16S rRNA Sequencing to provide taxonomic resolution of the human gut microbiome at species and strain level.
- PacBio reference genomes enabled a groundbreaking study published in Nature of intra-bacterial defense genes in the human gut microbiome by researchers at the University of Washington.
- As published in Science, long reads were also used to reconstruct a tripartite symbiotic factory for a marine toxin, involving bacteria, algae and a sea slug.
Did we miss one of your favorite publications of 2019? Tweet your favorites to us @PacBio, using #PoweredbyPacBio. And check out our searchable publications database for more than 1300 examples of outstanding SMRT Science from 2019.
Neurexin genes, which have been associated with certain neuropsychiatric disorders, are known to make heavy use of alternative splicing. In a recent study, scientists used the Iso-Seq method with SMRT Sequencing to better understand splice variants in neurons derived from human induced pluripotent stem cells (hiPSCs).
The study, “Neuronal impact of patient-specific aberrant NRXN1α splicing,” was published in Nature Genetics. Lead authors Erin Flaherty (@erinkflaherty) and Shijia Zhu, senior author Kristen Brennand (@kristenbrennand), and collaborators at the Icahn School of Medicine at Mount Sinai and other institutions undertook the project to help shed light on disorders linked to exonic deletions in the neurexin-1 gene, including schizophrenia.
“Deletions occur non-recurrently (with different boundaries) between patients, and the mechanisms underlying variable penetrance and diverse clinical presentations remain unknown,” the scientists write, adding that mouse models have been of limited value in elucidating this biology. “To better understand the clinical impact of NRXN1+/− mutations, it is critical to evaluate how distinct patient-specific deletions alter the NRXN1 isoform repertoire and impact synaptic function in a human context.”
Central to the study are four patient samples with rare heterozygous intragenic deletions in NRXN1 with severe psychosis disorder. Two patients contained a 136 kb deletion in the 3’ region of NRXN1 (3’-NRXN1+/-), while the other two shared a 115 kb deletion in the 5’ region (5’-NRXN1+/-).
To understand how the deletions affect splicing, the authors incorporated long-read and short-read sequencing of the NRXN1 gene on hiPSC-derived cell types. In hiPSC and other human samples, the team identified more than 120 human NRXN1α isoforms that are predicted to be translated. A comparison showed that “hiPSC-neurons modeled well the NRXN1α alternative splicing diversity found in vivo, particularly the high-abundance isoforms,” the authors report.
They showed that patient-derived NRXN1+/- hiPSC-neurons have a >2-fold reduction in wild type NRXN1α isoforms and an increase in novel isoforms from the mutant allele. “Across the two 3’-NRXN1+/- cases, we observed reduced abundance of 50% of the wild-type isoforms,” they note. The authors add that they “further detected 31 mutant NRXN1α isoforms unique to the 3’-NRXN1+/- hiPSC-neurons that resulted from splicing across the three deleted exons not found in controls.”
This alteration of isoform expression may be affecting neuronal maturation and activity. Compared to control, the 5’-NRXN1+/- and 3’-NRXN1+/- hiPSC-neurons had fewer mature neurons and decreased neuronal activity. Interestingly, over-expression of certain wild-type isoforms increased neuronal activity in the 5’-NRXN1+/- hiPSC-neurons, while over-expression of other mutant NRXN1α isoforms decreased neuronal activity in control hiPSC-neurons. “Our data supports a model whereby functional deficits in 5’-NRXN1+/- neurons arise from NRXN1 haploinsufficiency and can therefore by rescued by overexpression of wild-type NRXN1α isoforms,” the authors write, “but unexpectedly, haploinsufficiency in 5’-NRXN1+/- neurons is exacerbated by novel dominant-negative acitivity of mutant splice isoforms, and so cannot be rescued by simply increasing wild-type NRXN1α levels.”
“Our report links patient-specific, heterozygous intragenic deletions in NRXN1 to isoform dysregulation and impaired neuronal maturation and activity in a human and disease-relevant context,” the authors note. “Mutant NRXN1α isoforms may be particularly biologically relevant as our experimental data demonstrated that overexpression of even a single mutant isoform was sufficient to perturb neuronal activity in control neurons.”
Ultimately, the scientists believe their findings from this project could have significant benefits for understanding and potentially treating schizophrenia. “Evaluating how loss and/or gain of specific NRXN1 isoforms impact neuronal fate, maturation and function in a cell-type-specific and activity-dependent manner represents a critical first step towards a more genetics-based form of precision medicine,” they conclude. “Understanding how NRXN1+/− deletions perturb the splice repertoire and alter neuronal function could ultimately improve genetic diagnosis, prognosis and/or lead to new therapeutic targets.”
Bat lovers and animal researchers have been waiting for insights into the evolution and remarkable genetic adaptations of our winged mammalian friends, ever since the global Bat1K initiative announced its quest to decode the genomes of all 1,300 species of bats using SMRT Sequencing and other technologies.
Now, the first six reference-quality genomes have been released on the Hiller Lab Genome Browser, and described in a pre-print by Sonja Vernes (@Sonja_Vernes), Michael Hiller (@hillermich) and Gene Myers (@TheGeneMyers) of the Max Planck Institute, Emma Teeling (@EmmaTeeling1) of the University of Dublin, and 26 others.
What did the researchers find? Enough to excite evolutionary biologists, immunologists, and bat enthusiasts alike.
Bat Evolution – New Insights
The phylogeny of Laurasiatheria and, in particular, the position of bats, has been a long-standing, unresolved evolutionary question. Phylogenetic analyses of 12,931 protein coding-genes and 10,857 conserved non-coding elements identified across 48 mammalian genomes helped to resolve bats’ closest extant relatives within Laurasiatheria, supporting a basal position for bats within the clade Scrotifera.
Bats are suspected reservoirs for some of the deadliest viral diseases, including Ebola, SARS (severe acute respiratory syndrome), rabies, and MERS (Middle East respiratory syndrome coronavirus). But they appear to be asymptomatic and survive these infections. Figuring out why could increase our understanding of immune function and help prevent viral spillovers into humans.
A screen of the six new genomes — Greater horseshoe bat (Rhinolophus ferrumequinum); Egyptian rousette (Rousettus aegyptiacus); Pale spear-nosed bat (Phyllostomus discolor); Velvety free-tailed bat (Molossus molossus); Kuhl’s pipistrelle (Pipistrellus kuhlii); and Greater mouse-eared bat (Myotis myotis) — revealed selection on immunity-related genes which may underlie bats’ unique tolerance of pathogens, as well as several inactivated genes and expansion of APOBEC3 genes, which produce DNA and RNA editing enzymes with roles in lipoprotein regulation and somatic hypermutation.
“Together, genome-wide screens for gene loss and positive selection revealed several genes involved in NF-kB signalling, suggesting that altered NF-kB signalling may contribute to immune related adaptations in bats,” they wrote.
In order to further understand the bats’ viral responses, the researchers screened the genomes to ascertain the number and diversity of endogenous viral elements, considered as ‘molecular fossil’ evidence of ancient infections.
They found a surprising diversity of endogenous retroviruses, with some sequences never previously recorded in mammalian genomes, confirming interactions between bats and complex retroviruses, which endogenize exceptionally rarely.
“These integrations…can help us better predict potential zoonotic spillover events and direct routine viral monitoring in key species and populations,” the authors wrote.
Bats also exhibit extraordinary longevity—they can live up to 10 times longer than expected given their small body size and high metabolic rate. Only 19 mammalian species are known to live proportionately longer than humans given their body size, and 18 of these are bats.
“Bats show few signs of senescence and low to negligible rates of cancer, suggesting they have also evolved unique mechanisms to extend their health spans, rendering them excellent models to study extended mammalian longevity and ageing,” the team writes.
Within the six new genomes, the team found a loss of the oncogenic miR-374 gene, which promotes tumour progression and metastasis in diverse human cancers, and selection for PURB, a gene that plays a role in cell proliferation and regulates the oncogene MYC74, and exhibits a unique anti-ageing transcriptomic profile in long-lived Myotis bats.
Studying the genetics of echolocation, vocal learning, and sensory perception in bats could shed light into human blindness, deafness, and speech disorders.
The Bat1K group found two genes expressed in the cochlea and associated with human disorders involving deafness — LRP2 (also called megalin) and SERPINB6 — which seemed to be linked to echolocation. There were bat-specific substitutions in both genes; echolocating bats showed a specific asparagine to methionine substitution in LRP2, whereas the non-laryngeal echolocator Rousettus substituted for a threonine.
For sequencing and bioinformatics buffs, the pre-print also features detailed descriptions of new sequencing pipelines and assembly techniques, including a novel TOGA bioinformatics pipeline.
For each of the six bats, they generated: PacBio long reads, 10x Genomics Illumina read clouds, Bionano optical maps, and Hi-C Illumina read pairs, generating contigs that are ≥355 times more contiguous than the recent Miniopterus assembly generated from short read data, and ≥7 times more contiguous than a previous Rousettus assembly generated from a hybrid of short and long read data.
“This gene annotation completeness of our bats is higher than the Ensembl gene annotations of dog, cat, horse, cow and pig, and is only surpassed by the gene annotations of human and mouse, which have received extensive manual curation of gene models,” they noted.
They annotated between 19,122 and 21,303 coding genes using the PacBio Iso-Seq method of RNA sequencing. They also annotated non-coding RNAs and microRNAs, which can serve as developmental and evolutionary drivers of change, and identified important differences in ncRNAs between bats and other mammals.
This revealed extensive loss of ancestral miRNAs, gains of novel functional miRNA and a striking case of miRNA seed change that alters target specificity, pointing to a possible evolution of regulatory roles in cancer, development, and behaviour in bats.
“This is the first laboratory validation of novel bat microRNA function and highlights how Bat1K genome assemblies can enable the discovery of both non-coding and coding adaptations,” they wrote.
Learn more about the charter and progress of the Bat1K Project in this seminar featuring consortium director Sonja Vernes.
How do pernicious pathogens like Clostridioides difficile spread through hospitals and persist so tenaciously in the human gut, leading to about half a million infections and 30,000 deaths each year?
It’s a mystery scientists have been anxious to solve, and they’ve invested countless hours of research into the bacteria’s physiology, genetics and genomic evolution.
A team from Mount Sinai School of Medicine in New York City has uncovered an important new clue by studying an overlooked aspect of C. difficile’s biology: Epigenetics.
Using PacBio SMRT Sequencing and comparative epigenomics, Pedro H. Oliveira (@pholive81), Gang Fang (@iamfanggang), and colleagues mapped and characterized the DNA methylomes of 36 human C. difficile isolates.
As described in a recent Nature Microbiology paper, while they observed substantial epigenomic diversity across C. difficile isolates, they noticed one methyltransferase (MTase) was highly conserved across all of the isolates (and, they later discovered, in another ~300 published C. difficile genomes). This MTase, which they dubbed camA, shared a common methylation motif — CAAAAA, with the last adenine methylated at the N6 position, namely 6mA.
“Despite the small sample size, I got excited wondering if this methylation pattern might be conserved in this critical pathogen and play important roles in regulating its physiology,” Fang wrote in a Behind The Paper feature.
That left the question, how does it work? The Mount Sinai team reached out to other experts in the field, Aimee Shen at Tufts University and Rita Tamayo at the University of North Carolina, to do some in vitro and in vivo studies.
They found that inactivation of the gene encoding this MTase compromises spore formation, a key step in both the transmission of C. difficile and its ability to persist in the intestinal tract.
“Further experimental and integrative transcriptomic analysis suggested that epigenetic regulation by DNA methylation also modulates the cell length, host colonization and biofilm formation of C. difficile,” the authors wrote.
The discovery could have a direct translational impact. The fact that camA is conserved across all of the C. difficile genomes but is present in just a few Clostridiales makes it a promising, highly specific drug target. Furthermore, as the MTase does not seem to impact the general fitness of C. difficile, a drug that specifically targets it might also have a lower chance for resistance.
“These findings provide a unique epigenetic dimension to characterize medically relevant biological processes in this important pathogen,” the authors concluded.
The authors noted that such high-resolution mapping of bacterial DNA-methylation events has only recently become possible with the advent of PacBio’s single molecule, real-time sequencing.
“This technique enabled the characterization of the first bacterial methylomes and, since then, more than 2,200 (as of September 2019) have been mapped, heralding a new era of bacterial epigenomics,” they added.
Learn more about the methods and workflow for direct detection of epigenetics using PacBio sequencing.
It’s time to revisit the way scientists are using 16S rRNA gene sequencing to study microorganisms, according to a team of Jackson Laboratory researchers.
Popular targets for taxonomy and phylogeny studies because of their highly conserved nature, amplified sequences of the 16S ribosomal RNA genes can be compared with reference databases to determine the identity of the microorganisms that comprise a metagenomic sample. Sequences with a > 95% match are generally considered to represent the same genus, for example, while > 97% matches are considered the same species.
However, these matches are often made by sequencing only part of the nine-region, ~1500 bp 16S gene, either single regions like V4 or V6, or variable regions like V1–V3 or V3–V5, as done in the Human Microbiome Project. In a paper published recently in Nature Communications, Jethro S. Johnson, George M. Weinstock and colleagues point out that it is time to revisit this compromise that arose only because of past technological limitations. Given recent advances in long-read sequencing accuracy, the entire 16S gene should now be interrogated, the authors suggest.
Circular consensus sequencing (CCS, the method used in PacBio HiFi Sequencing), in particular, combined with sophisticated denoising algorithms, means it is now possible to sequence the entire gene with sufficient accuracy to discriminate among millions of sequence reads that differ by as little as one nucleotide, they write.
“Together, these technological and methodological advances mean that for the first time, it is becoming possible to exploit the full discriminatory potential of 16S in a high-throughput manner,” the authors write.
Using an in-silico dataset of 16S sequences taken from the Human Microbiome Project database, the researchers demonstrated that commonly targeted sub-regions were unable to recapitulate the taxonomic information present in the full 16S gene.
“The V4 region performed worst, with 56% of in-silico amplicons failing to confidently match their sequence of origin at this taxonomic level,” they wrote.
“Our simple in-silico experiment demonstrates that it is not valid to assume that ever finer clustering of these sub-regions will result in the improved taxonomic resolution necessary to reflect species.”
They also found that different sub-regions showed bias in the bacterial taxa they were able to identify at the species level. For example, while V1-V3 gave good results for Escherichia and Shigella, good results for Klebsiella required the V3-V5 region, whereas Clostridium and Staphylococcus required V6-V9 sequencing. Since all of these strains may be present in the human gut, the only way to ensure good taxonomic identification of all species is to sequence the full gene from V1-V9.
However, the team points out that it may be possible to obtain even better taxonomic resolution, down to the strain level. Bacteria have between 1-15 copies of the 16S genes. While the number of copies is consistent within a species, the intragenomic variation among the copies is strain specific. The Jackson Lab team believes this intragenomic variation presents an opportunity.
For example, sufficient nucleotide variation exists to distinguish E. coli strain K-12 MG1655 from the infection-causing O157 Sakai strain. The team provides proof of concept evidence for this approach with full-length PacBio 16S sequencing data from 381 isolates selected from the Human Microbiome Project sample bank. They show that the vast majority of these bacteria can be uniquely assigned to a specific strain using the intragenomic 16S variation revealed by PacBio 16S HiFi data.
“Thus, we argue that, when appropriately accounted for, multiple polymorphic 16S copies are not an inconvenience to be overlooked, rather they will enable the 16S gene to be used in strain-level microbiome analysis,” they add.
“Analysis of microbial communities at these taxonomic levels promises to provide a very different perspective to the one afforded by genus-level abundance estimates.”
Learn more about the methods and workflow for PacBio full-length 16S sequencing.
Two recent review articles discuss the idea that structural variants (SVs) — genetic differences that involve at least 50 base pairs — are numerous, important to human biology, and best detected with long reads. The authors review years of studies that have applied PacBio SMRT Sequencing to identify around 20,000 SVs per human genome. The reviews also report on cases in which SMRT Sequencing has helped scientists discover pathogenic variants that explain diseases for which there had previously been no clear genetic cause.
In Nature Reviews Genetics, Steve Ho, Alexander Urban, and Ryan Mills from the University of Michigan and Stanford University consider the algorithms and detection platforms that have enabled a wave of new discovery related to SVs. SVs cannot be reliably detected using short reads since many of the variants are significantly longer than those reads. “Because of this, the degree to which contemporary genomics has studied SNVs compared with SVs is significantly skewed,” Ho et al. write. “A recent analysis found that PacBio long reads were approximately three times more sensitive than a short-read ensemble maximized for sensitivity, implying that a large subset of SVs, many 50–2,000 bp in length, are unresolvable without long reads.”
Ho et al. also discuss the software tools that are useful for calling SVs in long reads, including Sniffles and pbsv. They summarize important projects that have used SMRT Sequencing to look for these variants — including Euan Ashley’s publication on Carney complex and Naomichi Matsumoto’s report on a large deletion that causes epilepsy.
The other review appears in Genome Biology, contributed by lead authors Medhat Mahmoud and Nastassia Gobet, senior author Fritz Sedlazeck, and collaborators at the University of Lausanne, Baylor College of Medicine, and other institutions. “Recent research into structural variants (SVs) has established their importance to medicine and molecular biology, elucidating their role in various diseases, regulation of gene expression, ethnic diversity, and large-scale chromosome evolution,” the authors state. “SVs are increasingly being recognized as an important class of variants, which need to be considered in evolutionary, population, and clinical genomics.”
The team reviews the value of long reads for finding SVs, noting that they “are advantageous for SV calling because they can span repetitive or other problematic regions.” The scientists also walk through the pros and cons of various alignment and SV-calling tools developed for long reads, including NGMLR, minimap2, Sniffles, and pbsv.
SMRT Sequencing provides high precision and recall for SVs in a human genome with just one SMRT Cell on the Sequel II System. By multiplexing two samples per SMRT Cell 8M, the approximate reagent cost is $670 per sample to detect structural variants.
Every year since 2008, The Scientist has canvassed the life-science community to find out which newly released products are having the biggest impact on research. We were proud to have the Sequel System selected as one of the Top 10 Innovations of 2016. And now we’ve been honored again, with the Sequel II System making the Top 10 Innovations of 2019 list.
“Our goal is to identify those products and services that are poised to revolutionize research and advance scientific knowledge,” Scientist editors wrote.
As part of the competition, a carefully selected panel of expert, independent judges were asked to rank the tools, techniques, methodologies, software, and products according to their potential to foster rapid advances or address specific problems in their respective fields.
The Sequel II System was chosen for its ability to generate longer reads with greater accuracy, at greater throughput, at a significantly lower cost.
“PacBio sets the standard for long-read sequencing and this upgrade of their instrument should have high impact on genomics sciences,” said judge H. Steven Wiley, Senior Research Scientist and Laboratory Fellow at Pacific Northwest National Laboratory.
The launch of the Sequel II System in April represented a significant improvement of our long-read sequencing technology. It contains updated hardware to process the new SMRT Cell 8M, which provides ~8x DNA sequencing data output, as well as reduced project costs and timelines compared to the prior version of the system. The Sequel II System also delivers highly accurate individual long reads (HiFi reads), which provides Sanger-quality reads (>99.9% accuracy).
As part of the early access program, customer Evan Eichler declared it the ‘most effective stand-alone technology for de novo assembly,’ and Shawn Levy, a geneticist at the nonprofit HudsonAlpha Institute for Biotechnology, noted that the Sequel II System is also good for analyzing highly repetitive or homologous regions of the genome. Long reads allow researchers to identify structural variants – including translocations in the genome, copy number variants, insertions and deletions – which are complementary to information generated using short read-based approaches.
Our customers have eagerly embraced the new technology and they are enjoying the improved throughput. Over 75 Sequel II Systems have been installed worldwide and they are averaging 160 Gb per SMRT Cell – a ~10-fold increase in yield over the previous Sequel System. The total amount of data generated on the Sequel II Systems this year has already surpassed the data generated on all installed Sequel Systems over the past four years.
The Sequel II System also recently received the Gold Award for the Most Innovative New Product in Genomics – a Life Sciences Industry Award. Since 2002, the Life Science Industry Awards have recognized manufacturers of the “tools of science” that help advance biological research and drug discovery.
We’d like to extend a sincere thanks to everyone who attended our two-day North America User Group Meeting, held this year at our Certified Service Provider, the University of Delaware Sequencing and Genotyping Center (@UD_DNAcore). With representation from 80+ organizations and over 160 attendees, the event was a great environment for sharing best practices and networking with the SMRT Sequencing community. Also, a big thanks to our host, Bruce Kingham (@bkingham) and team, as well as our partners: Agilent, Biosoft Integrators, Circulomics, Covaris, Diagenode, Perkin Elmer, Sage Science and Shoreline Biome. If you weren’t able to attend the meeting, we’ve summed up the highlights below and you can download several of the presentations and view the recordings.
Our CSO Jonas Korlach kicked off the meeting by describing the latest releases and performance metrics for the Sequel II System. The longest reads being generated on this system with the SMRT Cell 8M go beyond 175,000 bases, while maintaining extremely high consensus accuracy. HiFi mode, for example, uses circular consensus sequencing to achieve single-molecule accuracy of Q40 or even Q50. He also talked about the new, user-friendly no-amp protocol and the recent update for low-input samples. Eventually, the goal is to reduce sample requirements to as little as 1 nanogram, he noted; this protocol is currently in development at PacBio.
HiFi and Microbial Sequencing
HiFi sequencing was a central theme for a few talks. Mitchell Vollger (@mrvollger) from the University of Washington spoke about using this approach to study segmental duplications in the human genome. The technique significantly reduced the complexity of accurately mapping these nearly identical sequences throughout the genome; it also reduced the amount of compute power needed compared to a previous PacBio assembly using continuous long reads instead of HiFi reads. Despite generating less data with the HiFi assembly, the team still resolved 30% more segmental duplications with the new approach.
PacBio scientist Meredith Ashby (@AshbyMere) also spoke about the benefits of HiFi reads, focusing her talk on metagenomics and microbiome characterization. She presented several examples of analysis — from full-length 16S sequencing to shotgun sequencing — showing how SMRT Sequencing enables accurate representation of these complex communities, in some cases even without fully assembling genomes. New updates will provide users with a dedicated microbial assembly pipeline, optimized for all classes of bacteria, as well as increased multiplexing on the Sequel II System, now with 48 validated barcoded adapters. That throughput could reduce the cost of microbial analysis to just $70 per sample, she noted.
While we’re on the subject of microbial sequencing, two lightning talks offered nice illustrations of this as well. Masako Nakanishi from the University of Connecticut Health Center presented a study of how the gut microbiome alters an organism’s susceptibility to colonic ulceration; next, she plans to examine cause and effect by evaluating results of fecal transplants in mice. Shawn Polson from the University of Delaware spoke about viral metagenomes, which are more challenging to distinguish than their bacterial counterparts because viruses have no 16S equivalent. With SMRT Sequencing, his team has generated higher-resolution data about viral genomes and aims to use this information as a guide to how these genomes function.
We also saw great presentations with varied applications of Iso-Seq data. Shawn Trojahn (@trojahn_shawn) from Washington State University presented results from transcriptome sequencing of grizzly bears. The analysis focused on differential gene expression during hibernation and active cycles, potentially offering human-relevant information about muscle atrophy and insulin resistance. Thanks to SMRT Sequencing, the team was able to identify more unique isoforms just from liver tissue than had been previously characterized in the entire reference genome. Of particular interest: more than 2,000 transcripts differentially expressed between hibernation and active season, including 86 genes that have isoforms expressed in opposite directions.
From the University of Wisconsin-Madison, Nic Wheeler (@wheeler_worm) spoke about RNA sequencing for filarial nematodes associated with understudied tropical diseases. His team used Iso-Seq analysis to improve gene models and achieve better transcriptome coverage for these worms, which typically have poorly annotated and fragmented genome assemblies. While getting enough RNA to study is a technical challenge, the group still managed to generate full-length isoforms, many of which were novel or contained novel junctions.
Ana Conesa (@anaconesa) from the University of Florida spoke about Iso-Seq analysis tools developed by her group, which created the popular SQANTI tools for Iso-Seq data QC. They’re also working on IsoAnnot to perform functional annotation at isoform resolution; validation has already been done on various species. Currently it’s a set of scripts, but her team is working to produce a more user-friendly version. Finally, tappAS is for functional diversity analysis and for prioritizing genes for validation.
We also had two Iso-Seq-focused lightning talks. Vince Magrini from Nationwide Children’s Hospital spoke about using Iso-Seq analysis as part of a comprehensive profiling strategy for pediatric cancer research; in one example, full-length isoform sequencing provided a clear view of a challenging mutation associated with a drug-targetable pathway. Alexandra Pike (@amimspike) at MIT presented a study of TIN2, a telomere-binding protein, which is mutated in some short telomere syndromes. By pairing the Iso-Seq method with CRISPR, her team revealed a previously uncharacterized TIN2 isoform that may have a functional difference for individuals with these syndromes.
Finally, PacBio scientist Kristin Mars spoke about recent updates, such as the single-day library prep that’s now possible with the Iso-Seq Express workflow. She also noted that one SMRT Cell 8M is sufficient for Iso-Seq experiments; that means whole transcriptome sequencing is feasible for $1,300 per sample, while multiplexed, targeted Iso-Seq analyses can cost as little as $185 per tissue.
Two speakers offered attendees a clear view into potential future clinical use of SMRT Sequencing technology by showing how it’s performing in clinically oriented research labs. Melissa Smith (@lissagoingviral) from the Icahn School of Medicine at Mount Sinai shared results from using more than 1,300 SMRT Cells over the years — most of them for disease-focused research, but also covering microbial sequencing, immune profiling, epigenetics, ecology, and more. Her team has been working with the Sequel II System since January for applications ranging from honing targeted assays for disease-associated genes to performing targeted Iso-Seq for phasing drug targets with severity loci.
They’re also using SMRT Sequencing to detect structural elements — including extremely long and GC-rich repeat expansions — and to characterize diversity in the immunoglobulin loci. Going forward, she added, they aim to use the scIso-Seq method to resolve isoform diversity at the single-cell level.
In a separate talk, LabCorp’s Brian Krueger (@h2so4hurts) discussed the use of SMRT Sequencing for clinical research related to HLA typing, viral genome sequencing, high-throughput variant confirmations to reduce the need for Sanger sequencing, and more.
We also had several speakers focused on technology development or sample prep protocols. Erin Bernberg (@ErinBernberg) from the University of Delaware reported on using the Agilent Femto Pulse for high-resolution, highly sensitive fragment analysis and on the low-input protocol, which her team used for a recent study of ice worms.
Eugenio Daviso from Covaris talked about the use of adaptive focused acoustics for gentle cell lysis and extraction of high molecular weight DNA. Mount Sinai’s Ethan Ellis presented results from the HLS-CATCH method, which involves the use of the SageHLS instrument with CRISPR design methods to target and extract large genomic fragments for sequencing while avoiding pseudogenes and other confounding regions.
In a lightning talk, NEB’s Kelly Zatopek shared data from RADAR-seq, an amplification-free method for detecting and quantifying a wide variety of DNA damage types across a genome. Finally, Shana McDevitt from the California Institute for Quantitative Biosciences shared the core lab perspective as she discussed sample size, purity requirements, and extraction protocols for PacBio sequencing.
We enjoyed connecting with many of our users in Delaware and learning about their latest discoveries and ideas for future research. A huge thanks to our wonderful SMRT Sequencing community for an engaging and exciting meeting!
The most important creatures in a tropical rainforest aren’t necessarily the ones you can see. They work their magic underground, recycling organic matter and processing and transporting vital nutrients for their leafy neighbors above ground.
Microbiologist Joe Taylor wants to learn all about what they are and what they do. And now a grant from PacBio and Maryland Genomics will enable him to reveal some of the secrets of the soil in endangered South-East Asian rainforests.
Taylor, a Career Development Fellow in Microbiomes and Metagenomes at the University of Salford in the United Kingdom, was selected to receive the 2019 Metagenomics SMRT Grant to explore the influence of nutrient concentrations on microbial community composition and nutrient cycling capacity in the soils of an old-world tropical rainforest in Danum Valley, Borneo.
Despite worldwide attention on South-East Asian rainforests due to their charismatic megafauna of orangutans, clouded leopards and elephants, and the threat to the habitat by overlogging, oil palm plantations and climate change, little is known about their smaller, more abundant inhabitants.
“We don’t really have a good understanding of the below ground diversity in South-east Asian rainforests – particularly microbes, but also underground fauna like worms. All these groups have important interactions that will affect rainforest plants,” Taylor said. “Microbes, such as fungi bacteria, interact with rainforest trees in very important ways. You could lose the majority of mammals and birds in a forest (not that we’d want to) and it would still survive in part. But take away the fungi and bacteria, and there would be no forest.”
Old world rainforests differ from their neotropical counterparts in the Americas, which tend to get much more research attention and investment. One major difference is that the diversity of trees in the rainforests of Borneo is dominated by a single family, the dipterocarps, which have a particular symbiotic relationship with fungi that is rare in neotropical rainforests. Unlike other mycorrhizal relationships, such as arbuscular mycorrhiza and ericoid mycorrhiza, ectomycorrhizal fungi do not penetrate their host’s cell walls. Instead, they form an intercellular interface of highly branched hyphae that help the plant take up nutrients, including water and minerals, often helping the host plant survive adverse conditions. In exchange, the fungal symbiont is provided with access to carbohydrates.
Among the key nutrients Taylor and collaborators from the Universities of York, Manchester and Aberdeen in the UK and Universiti Malaysia Sabah and Sabah Forestry Department in Malaysia are studying is phosphorus, a limiting nutrient for plant growth in rainforests. In the rainforest, phosphorus transfer to trees is mediated by mycorrhizal fungi and bacteria.
Initial fieldwork conducted over several weeks in 2015, 2016 and 2017 has revealed different pools of phosphorus and other nutrients throughout the topographically diverse 50 hectare site. Experimental work carried out by the group has also shown that these different phosphorus pools impact the growth of the arbuscular and ectomycorrhizal trees differently.
In addition to DNA extracted from tree roots, Taylor and his colleagues have collected more than 200 DNA samples from soil. Short read sequencing has already generated a huge amount of taxonomic data that has provided insight into the soil composition, the spatial distribution of the different species, and how they interact. However, delineating the different metabolic pathways involved in the phosphorus cycle has been more challenging.
“It has already revealed a large amount of data on who is there, but we don’t know what they are doing,” Taylor said. “Nutrient concentrations throughout the plot are highly variable and show clear effects on above-ground plant diversity and root-associated fungi, but we do not know to what extent nutrients influence microbial function within the soil.”
They are hoping metagenomic profiling via SMRT Sequencing will provide the functional genomic information they need to elucidate the metabolic dynamics both within and between species.
“Metagenomic sequencing on the PacBio system will empower us to identify many phylogenetically distinct soil prokaryotes and eukaryotes, looking at evolutionary relationships, as well as identify full genes coding for enzymes involved in phosphorus and nitrogen cycling across a low to high phosphorus and nitrogen gradient,” Taylor said.
Taylor said he’s been eager to try PacBio sequencing for a long time.
“There’s an incredible use of PacBio in long-read amplicon sequencing in the microbial work I do,” he said. “For detailed phylogeny and taxonomy, long reads are essential.”
As many of the microbes will likely be “unculturable,” the long, highly accurate HiFi reads will enable the discovery of novel genes from unknown species that would be challenging to characterize with other approaches.
“That’s one of the reasons why this metagenomic data is so important. It will give information on all organisms in the samples, regardless of our ability to culture them,” Taylor said. “The ideal would be to get complete genomes from this metagenomic data. That would be amazing. But so little work has been done that anything we reveal is going to be novel and useful.”
Taylor’s work is taking place in a pristine forest that has been preserved and protected from logging. But, as in all habitats, the area is still threatened by climate change.
“We’re seeing more variable weather patterns,” Taylor said. “When we first visited the rainforests in 2015, it may sound strange, but the ‘rain’ forest was going through a period of ‘drought’ in the area that brought much less water than usual. This continued drought had a significant impact on functional traits of the rainforest trees.”
Research on ectomycorrhizas is increasingly important in ecosystem management and restoration, forestry and agriculture in areas like Borneo, which hosts more than 6,000 species found nowhere else on earth, thanks to its evolutionary isolation after the last ice age.
“PacBio sequencing will fundamentally change our understanding of microbial phylogeny and the capacity of soils to cycle nutrients in South-east Asian rainforests, providing many questions for further study,” Taylor said.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win free sequencing.
Thank you to our co-sponsor and PacBio certified service provider, Maryland Genomics, for supporting the 2019 Metagenomics SMRT Grant Program.
Traditional RNA-Seq is done by fragmenting cDNA, and then sequencing the fragmented reads with paired-end sequencing. The problem comes when trying to identify the full-length isoform during assembly. This is computationally challenging, and sometimes intractable.
The solution? Long-read isoform sequencing, according to PacBio Principal Scientist Elizabeth Tseng and PacBio user Gloria Sheynkman, a research fellow at Dana-Farber Cancer Institute. The two recently participated in a webinar, sharing their experiences using PacBio’s Iso-Seq method.
Tseng started by explaining the method and some of its applications.
“In contrast to traditional RNA-Seq, the Iso-Seq method produces full-length cDNA, and using the PacBio long, accurate reads, can sequence the full transcript. No assembly is required,” Tseng said.
She also discussed some of the bioinformatics tools available, including SQANTI2, which can be used to classify full-length transcripts against annotations such as GENCODE, and as a quality control tool.
The Iso-Seq method in action
At Dana Farber, Sheynkman uses the Iso-Seq method to characterize cancer cells and create complete, accurate transcriptomes.
In one example, she ran five breast cancer cell lines and eight melanoma samples in one pooled library on a single SMRT Cell 8M each, achieving around 6 million polymerase reads, with an average base yield of 300 Gb and an average polymerase read length of ~50 kb. Sequencing was performed by Maryland Genomics, a PacBio certified service provider.
With the improvements in the chemistry of the new Sequel II System, Sheynkman said she has been able to capture a much wider range of transcript lengths, without having to do size selection.
“Overall, we’re really detecting a much larger range, with cDNA molecules up to 6 and 7 kb,” she said.
For both breast cancer and melanoma, she obtained about 14,000 unique genes and 11,000 unique isoforms, around 30% of which were novel. And each SMRT Link Iso-Seq job was completed in just 6-9 hours, she noted.
Towards accurate isoform quantification
But the most promising application for Sheynkman is isoform quantification.
“I think this is a really important goal for the field,” she said. “To know how many copies of each transcript is expressed in the cell will really open up a lot of avenues in biomedical research, such as having consistent biomarkers, understanding disease mechanisms, and even just fundamental biological understanding.”
While short-read sequencing can achieve gene quantification with reliable results, the complexity of isoform structure requires more comprehensive coverage.
“Isoform quantification methods are very dependent on having accurate transcript models,” Sheynkman said.
The improved depth of coverage and reduced bias in sampling full-length reads on the new Sequel II System should mitigate these limitations, Sheynkman said. And the high technical reproducibility achieved by PacBio is another big strength, she added.
Targeted or whole transcriptome sequencing?
Sheynkman further discussed cases in which targeted isoform sequencing may be preferred over whole transcriptome sequencing. By targeting only genes of interest, rare isoforms could be identified with low to moderate sequencing effort. Multiplexing could further reduce the cost and increase sample size, which could be useful for applications such as biomarker discovery and validation.
Sheynkman also provided a new solution to a common problem when doing targeted sequencing: probes. ORF Capture-Seq is designed as an easy, versatile option to make capture probes directly from available clones/PCR product within a single day, using low-cost molecular biology reagents. It also allows for the generation of many complex probe sets tailored to different genes, Sheynkman said. The complexity of the probe sets allows you to target anywhere from 1 to 1,000 genes.
The webinar also included a lively Q&A that is well worth the listen.
We were delighted to host an educational workshop at last month’s annual meeting of the American Society of Human Genetics (ASHG), where we had the opportunity to feature talks from two customers as well as an overview of SMRT Sequencing. If you couldn’t attend, check out the videos or read the highlights below.
Emily Hatas, our director of business development, kicked things off with a look at how SMRT Sequencing has evolved over the years. Compared to the first instrument we offered, the Sequel II System represents a 100-fold improvement in read length and a 10,000-fold improvement in throughput. As of last month, customers were averaging about 160 Gb per SMRT Cell, a yield more than 10 times higher than the Sequel System.
Most of the presentation focused on applications in human genome analysis. High-throughput structural variant detection, which makes use of continuous long-read (CLR) sequencing, is well-suited to population studies and can be run at a cost of about $670 per sample when running two samples on each SMRT Cell. Comprehensive variant detection, which uses HiFi sequencing to make multiple passes around each molecule for optimal accuracy, is great for disease research — particularly for solving rare diseases — and costs about $2,600 per sample, assuming each library uses two SMRT Cells. Finally, de novo assembly of reference genomes should also be based on HiFi reads, Hatas told attendees, since it achieves comparable contiguity to CLR mode with about six times higher accuracy. In addition, HiFi data cuts analysis time in half and generate much smaller files to make de novo assembly more scalable.
Next up was Naomichi Matsumoto from Yokohama City University to speak about the use of SMRT Sequencing to solve Mendelian diseases. He shared the story of how his lab discovered a 12.4 kb structural variant that’s responsible for progressive myoclonic epilepsy in two siblings. The variant was in a repetitive, GC-rich region, which was why previous attempts to find it had failed. With low-coverage whole genome sequencing on the Sequel System, his team identified the variant and later confirmed that it was causal.
Matsumoto also reported progress in understanding repeat expansion disorders — many of which have neurological components — by pairing SMRT Sequencing with new analysis tools designed to highlight repetitive areas. In one example, his team was able to distinguish between the smaller number of repeats associated with healthy controls and the larger numbers associated with symptomatic patients.
The final talk came from Shawn Levy of the HudsonAlpha Institute for Biotechnology and the recently spun out services lab, now known as HudsonAlpha Discovery, which is a division of Discovery Life Sciences. He offered a look at his team’s early access experience with the Sequel II System, which was so successful that the research institute now has four of the instruments.
His data showed the increasing output of the system over time, as well as yield increases from the HiFi method. Levy noted that accuracy improves with each pass around the molecule, but reaches a plateau at the tenth pass or so. For Iso-Seq experiments, the team saw a significant improvement in yield from the Sequel System to the Sequel II System. Levy also shared hot-off-the-presses data from a project designed to determine the quality of Iso-Seq reads that can be gleaned from FFPE samples. The longer reads made possible with this approach don’t overcome the highly fragmented DNA and RNA coming from the samples, Levy said, but they definitely improve biological resolution and enable the characterization of higher molecular weight RNA that’s present in the samples. The project required a modified Iso-Seq protocol, which is still being optimized for best performance. While conventional approaches are evaluated based on how many 200-nucleotide reads they generate, the SMRT Sequencing method resulted in an average length of 435 bases.
Levy noted that his team also uses long-read sequencing for targeted sequencing applications associated with confoundingly homologous regions and for analyzing complex rearrangements in cancer. Going forward, they will also be sequencing about 7,000 genomes using long-read WGS for the All of Us Research Program to increase discovery of structural variants.
We’d like to thank all of the ASHG attendees who made our workshop such a success! If your research includes human genetics, please consider applying for our 2019 Human Genetics SMRT Grant Program. The winner will receive complimentary sequencing from the HudsonAlpha Genome Sequencing Center of up to 12 SMRT Cells. The deadline to apply is November 22, 2019.
There’s the genome, the transcriptome, the microbiome… and now the NLRome?
Breeders and pathologists have long been interested in uncovering the secrets of plant immunity, and much of their attention has been focused on receptors that can activate immune signalling: cell-surface proteins that recognize microbe-associated molecular patterns (MAMPs), and intracellular proteins that detect pathogen effectors, including nucleotide-binding leucine-rich repeat receptors (NLRs).
Hundreds of NLR genes can be found in the genomes of flowering plants. They are believed to form inflammasome-like structures, or resistosomes, that control cell death following pathogen recognition, and are being investigated as candidates for engineering new pathogen resistances.
For these reasons, scientists are keen to create inventories of NLR genes at different taxonomic levels. But their efforts have been hindered by the extraordinarily polymorphic nature of the gene family, patterns of allelic and structural variation, and clusters with extensive copy-number variation.
Two research teams have successfully overcome these challenges by combining resistance gene enrichment sequencing (RenSeq) with SMRT Sequencing.
In a recent paper in Cell, a multi-institutional team led by Felix Bemm and colleagues at the Max Planck Institute for Developmental Biology in Germany detailed their creation of a nearly complete species-wide pan-NLRome in Arabidopsis thaliana.
The sequences they obtained allowed them to define the core NLR complement, as well as to chart integrated domain diversity, describe new domain architectures, assess presence or absence of polymorphisms in non-core NLRs, and map uncharacterized NLRs onto the A. thaliana Col-0 reference genome.
“Reference genomes likely include only a fraction of distinct NLR genes within a species, which in turn has made it impossible to obtain a clear picture of NLR diversity based on resequencing efforts,” the authors wrote.
“Our work provides a foundation for the identification and functional study of disease-resistance genes in agronomically important species with more complex genomes,” they added.
Using RenSeq to trace NLR evolution in tomatoes
Another team from the University of California at Berkeley, led by Brian Staskawicz and first author Kyungyong Seong, explored the NLRome of the tomato plant Solanum lycopersicum.
This important crop is challenged by more than 200 diseases caused by diverse pathogens. Low genetic diversity and changes of resistance (R) genes of the cultivated tomato during domestication have led to a heightened urgency for genetic improvement. Scientists are keen to draw upon the hereditary disease resistance of wild tomato species that have co-evolved with their pathogens in highly diverse habitats, and R genes in particular, which have shown more durable resistance against multiple pathogens.
Previous attempts to use RenSeq to selectively capture and sequence NLRs in tomato have been limited by the method’s inability to completely resolve highly repetitive sequences and physical clusters of NLRs, the team noted in a bioRxiv preprint, so they turned to SMRT Sequencing as well.
They employed SMRT RenSeq to identify NLRs from 18 Solanaceae accessions (S. lycopersicum Heinz, Nicotiana benthamiana, Capsicum annuum ECW20R, plus 15 wild tomato accessions belonging to five species).
They produced 264 to 332 high-quality NLR gene models in tomato, and annotated 314 NLRs alongside the reference genome of S. lycopersicum Heinz.
“Our RenSeq results improved the annotation of 128 NLRs, including 13 existing annotations which were incomplete because of mis-assembly or unfilled gaps,” the authors wrote.
“We demonstrated that SMRT RenSeq is a cost-effective, efficient alternative to the whole genome sequencing. We also verified that SMRT RenSeq was capable of…resolving the complexity of NLRs and their clusters.”
The team used the gene models and annotations to explore NLR evolution, but noted that larger scale comparative studies including evolutionarily distant wild tomato species should be done to provide more comprehensive insights, and to expand the scope of NLRome for genome engineering and breeding of tomatoes.
“Our study provides high quality gene models of NLRs that can serve as resources for future studies for crop engineering and elucidates greater evolutionary dynamics of the extended NLRs than previously assumed,” the authors wrote.
The PIK3CA oncogene has been the target of intense research scrutiny for decades. Remarkably, though, a new paper in Science today reports completely novel findings about compound mutations that are associated with patients who respond extremely well to targeted therapies. While more studies are needed, this work has important implications for delivering treatment to patients with breast cancer and other common cancers.
“Double PIK3CA mutations in cis increase oncogenicity and sensitivity to PI3Kα inhibitors” comes from lead author Neil Vasan, senior authors Maurizio Scaltriti and José Baselga, and collaborators at Memorial Sloan Kettering Cancer Center, the Icahn School of Medicine at Mount Sinai, one pharma company, and other institutions. The project emerged from follow-up studies of a breast cancer patient previously identified as a super-responder to the targeted PI3Kα inhibitor alpelisib. It turned out the patient had double PIK3CA mutations, so scientists embarked on an effort to find out whether that mattered for response to treatment — and what they learned could change how oncologists approach therapy selection.
As it turns out, multiple PIK3CA mutations are far more common than expected; the vast majority are double mutations. Researchers detected them across a wide variety of cohorts in 12% to 15% of breast cancers and other types of cancer. Previously, that number was believed to be less than 1%.
According to the authors, that discrepancy can be attributed to the approaches traditionally used to analyze PIK3CA mutations in cancer genomes. “The common practice of sequencing only certain single-nucleotide variants or some but not all exons across a gene likely underestimates the frequency of multiple mutations in PIK3CA mutant cancers,” Vasan et al. write.
For this project, scientists deployed SMRT Sequencing to analyze the full PIK3CA gene, which not only gave them more complete information, but also allowed them to phase mutations and determine when the double mutations occurred in the same allele. “Establishing their allelic configuration is important because cis mutations would result in a single protein with two mutations, whereas trans mutations would result in two proteins with separate individual mutations, and these could have different functional consequences,” the team notes.
But proving the mutations occurred in cis was no easy task. “To study the allelic configuration of double mutations, we faced several technical hurdles based on our observation that the most frequent double PIK3CA mutants are located far apart in genomic DNA.” This meant that short-read and even Sanger sequencing could not span the distance. It also meant that analyzing degraded DNA from FFPE samples would not support the kind of full-length sequence that was required. Researchers went back to patients and collected new samples that were frozen prior to sequencing. With those samples and long PacBio reads, scientists were able to distinguish patients with cis versus trans mutations.
“The overall consequence of these cis mutations is a phenotype of enhanced oncogenicity and greater sensitivity to PI3Ka inhibitors,” the team writes. “Our findings provide a rationale for testing whether patients with multiple–PIK3CA-mutant tumors are markedly sensitive to PI3Kα inhibitors.”
We caught up with Vasan shortly before this publication was released and asked him about the implications of these findings. “This gene has been so well studied for decades, it’s humbling that we found something new,” he told us. “In the cancer sequencing field, I think that we’ve hit a plateau in terms of single nucleotide variation. From a discovery point of view, we need to focus on higher-order interactions such as these double mutations.”
The cultivation and conservation of one of the most important commercial fishes in the world may come down to sex determination — how can you successfully breed a species without knowing the sex of your stock?
A Japanese research team has come up with a solution, thanks to a new Pacific bluefin tuna reference genome and the male-specific DNA markers they were able to identify as a result.
In a study published recently in the Nature journal Scientific Reports, first author Ayako Suda and lead author Atushi Fujiwara of the Japan Fisheries and Education Agency of Yokohama, described how they developed a PCR assay to accurately identify male tuna, based on a new high-quality PacBio and Illumina assembly.
Wild populations of Thunnus orientalis have been in drastic decline due to overfishing, leading Japan and other nations to develop full-life-cycle tuna aquaculture systems as early as the 1970s. They have identified optimum rearing conditions for the species, but these conditions are difficult to achieve. Spawning is strongly influenced by environmental factors such as water temperature, for example, and only some females spawn in cultivation conditions, reducing genetic diversity. Controlling the sex ratio in sea cages could help increase the production of fertilized eggs, but the Pacific bluefin tuna lacks morphological sexual dimorphism, making it difficult to identify and remove males. Furthermore, identifying males through gonad inspection can be lethal and inconclusive in young fish.
So the researchers set out to improve the T. orientalis genome, first assembled in 2013. Despite being used as a reference, that assembly is highly fragmented, with a large number of gaps in its scaffolding.
By combining sequence data from PacBio long reads and Illumina short reads, the Japanese team created a 787 Mb genome assembly in only 444 scaffolds with a contig N50 of 3 Mb. This represents a 376-fold increase in contiguity and a 148-fold reduction in the number of gaps compared to the existing reference.
Through analyzing re-sequence data of several males and females, 250 male-specific SNPs were identified from more than 30 million polymorphisms, with seven distinct regions being identified. The team then focused on one in particular: a 3,174 bp section of a single scaffold that contained 51 male-specific variants. They created a PCR-based sex identification assay targeting this stretch of DNA and achieved high accuracy in testing across 115 fish.
“Sex identification using our PCR assay is easy, requiring minimal handling of individuals. Moreover, sex of juveniles can be identified using our method, allowing the sex ratio in cages to be adjusted at an early stage, which could enhance breeding programs,” the authors state.
They also note that the approach might be less stressful to tuna, and requires less effort than sampling approaches based on fin clips or muscle tissues from live individuals.
“It could also be used to obtain data from wild populations, providing useful information for the management and conservation of these natural stocks,” they add.
The assay could be implemented in surveys that evaluate sex ratio analyses, rather than waiting for sexual maturity, for example. Incorporating the sex of fish while tracking their migration patterns — a strategy used by Barbara Block and colleagues in California — could also provide valuable information for the management of wild tuna fisheries.
“Our improved draft genome provides a solid foundation for future population and resource management studies of Pacific bluefin tuna,” the authors conclude.
At the annual meeting of the American Society of Human Genetics in Houston, PacBio scientists presented how our Sequel II System performs for structural variant (SV) detection and for whole transcriptome sequencing. The educational workshop focused on experiments that can be done using a single SMRT Cell 8M on the Sequel II System.
The event kicked off with Aaron Wenger walking through SV analysis, which he said has mirrored the development path of single nucleotide variants, from proof-of-concept to individual rare disease studies and now to large cohort studies like SOLVE-RD in Europe and All of Us in the United States.
Wenger showed a variety of SV types that can be detected with highly accurate SMRT Sequencing, such as insertions, deletions, translocations, and inversions. He also showed the standard SV discovery and analysis workflow, and the precision and recall performance of this method. He noted that a single SMRT Cell is now sufficient to achieve high recall for SV discovery for two human samples.
Next up, Elizabeth Tseng (@magdoll) spoke about the Iso-Seq method for full-length transcript sequencing that eliminates the need for bioinformatics-driven isoform assembly and enables direct ORF prediction even without a reference genome. With the Iso-Seq Express Kit, she noted, customers can generate reliable results from as little as 60 nanograms of total RNA.
Using an Alzheimer’s brain Iso-Seq dataset released for ASHG, Tseng demonstrated the comprehensive, highly accurate results for full-length isoform detection using SMRT Sequencing. More than 99% of transcripts reported by the Iso-Seq method are more than 99% accurate, she added. She also looked at single-cell Iso-Seq experiments, showing examples of SMRT Sequencing paired with DropSeq or 10x Genomics workflows, and at how Iso-Seq analysis can be used to solve rare disease or characterize cancer fusion genes.
Thanks to all the attendees who took the time to listen to our presentations. We hope you enjoyed the meeting as much as we did!
In a new Science publication, researchers from the University of Washington and other institutions report detailed analyses revealing the adaptive importance of copy number variants (CNVs) acquired from Denisovan and Neanderthal ancestors, the closest relatives of modern humans, in the modern-day Melanesian population. The team used PacBio long-read sequencing to study these complex stretches of DNA and the Iso-Seq method to generate full-length transcript data.
“Adaptive archaic introgression of copy number variants and the discovery of previously unknown human genes” comes from lead author PingHsun Hsieh (@phhBenson), senior author Evan Eichler, and collaborators. For the project, they focused on the Melanesians, an oceanic population that are known to have more Denisovan and Neanderthal ancestry than other groups. This made an excellent foundation for studying the role of CNVs in adaptation and archaic introgression.
“Relatively little is known about the extent to which CNVs contribute to the genetic basis of local adaptation and, more importantly, whether CNVs introgressed from other hominins may have been targets of adaptive selection,” the authors write.
As part of this project, scientists focused on “two of the largest and most complex” CNVs found in the Melanesian genome — a 5 kb duplication and a 73.5 kb duplication, both on chromosome 16 — for a deeper investigation. “Both events are largely restricted to Melanesians and the Denisovan archaic genome F2 and are thought to be involved in a single >225-kb complex duplication (DUP16p12) introgressed from the Denisovan genome,” they report. “This region has been difficult to correctly sequence and assemble, and only recently has the sequence structure of the ancestral locus (>1.1 Mb) been correctly resolved.”
To better understand the original duplication, the team generated 75-fold whole-genome coverage of a Melanesian individual using SMRT Sequencing. This allowed them to narrow down the insertion location to a 200 kb region that is enriched in segmental duplication that “predisposes the region to recurrent structural rearrangements associated with autism and developmental delay,” Hsieh et al. write.
By applying the Segmental Duplication Assembler, a methodology recently published in Nature Methods, they wound up with a 1.8 Mb contig including the correctly assembled Melanesian duplication. “Notably, the sequence-resolved assembly shows that the actual length of DUP16p12 duplication polymorphism is ~383 kb, which is longer than previously thought,” the authors report. “Sequence and phylogenetic analyses suggest that the variant originated from a series of complex structural changes involving duplication, deletion, and inversion events ~0.5 to 2.5 million years ago within the Denisovan ancestral lineage.” That duplication was inserted into the Denisovan genome within the last 200,000 to 500,000 years and subsequently introgressed into the ancestors of Melanesians between 60,000 to 170,000 years ago, the authors conclude.
The team performed Iso-Seq with hybridization capture probes toward this region to produce full-length gene models and better characterize the functional effects of CNVs in the Melanesian genome. Based on their results — including a comparison to gene models from other humans and the chimpanzee — the scientists found that the 383 kb duplication is likely adaptive. “This helps to explain why this polymorphism has become nearly fixed within the Melanesian populations (>80%) despite its large size, which is typically regarded as selectively disadvantageous,” they note. “Notably, the Melanesian-specific gene NPIPB shows ~3% amino acid divergence and evidence of positive selection despite its recent origin.” The scientists predict that the proximity of this duplication to a genomic region associated with autism (chr16p11.2) will have an impact on the frequency of autism-associated rearrangements in the Melanesian population.
Based on these results and other data confirming Neanderthal-origin CNVs in the Melanesian genome, the scientists were able to “reconstruct the structure and complex evolutionary history of these polymorphisms and show that both encode positively selected genes absent from most human populations,” they write. “This study highlights the substantial large-scale genetic variation that remains to be characterized in the human population and the need for development of additional reference genomes that better capture the diversity of our species and complete our understanding of human genes.”
We caught up with Hsieh at ASHG 2019, where he was presenting a poster on this research. He summarized the project by stating, “The high-quality, long-read sequencing data opens up an unprecedented venue to study variants in complex genomic regions. The ability to access these new variants helps us advance our understanding of the biology and evolution of our own species.”