This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
The genome of the rose is almost as complicated as its connotations when given as a gift on Valentine’s Day or other special occasions.
Although relatively small in size, at 400-750 Mb, with seven chromosomes, the cells of roses have multiple sets of chromosomes beyond the basic set. And these can vary widely between the commercial varieties. Some are diploids, with two homologous copies of each chromosome (like humans, with one from the mother and one from the father), while others can have as many as five different sets (pentaploids). Most are tetraploids, with four sets of chromosomes.
To further complicate things, many roses are “segmental allotetraploids,” which means that part of the genome is behaving like an allotetraploid (with four chromosome sets from two distinct species, which occurs during hybridization) – and part of the genome is behaving like an autotetraploid (with four sets of homologous chromosomes).
Needless to say, parsing all of this out is challenging. But researchers from the Netherlands recently presented their solution, using HiFi reads generated by the Sequel II System.
In a workshop discussion at PAG XXVIII, Bart Nijland (@bart3601) of Genetwister Technologies (@genetwister), explained how his team set out to make a haplotype-aware assembly of Rosa x hybrida L. in order to capture its full range of genetic variation, rather than rely on more traditional assemblies which collapse the haplotypes into single sequences that could be missing critical information.
“For a highly heterozygous, highly complex, commercially important species like the rose, there is a huge benefit to making a haplotype-aware assembly,” Nijland said. “A lot of the existing technologies don’t perform very well in doing this. So we were very happy when PacBio released its HiFi protocol. Due to the high accuracy of the reads, we thought this could really help us in solving this challenge.”
The next challenge was isolating DNA from the leaf tissue of a tetraploid rose variety, which is notoriously difficult because of secondary metabolites. Once that was overcome and the sample was processed to create a HiFi SMRT library, speedy sequencing of four SMRT Cells 8M was performed on the Sequel II System at Radboud UMC. The result was more than two terabytes of raw polymerase data, with an average yield of more than 500 Gb per SMRT Cell.
“We did a k-mer analysis to investigate the heterozygosity of the sample. Due to the high accuracy of the reads, we could nicely see four distinct peaks, which you would expect in a heterozygous, tetraploid sample,” Nijland said. “And when mapping the HiFi reads, we could already distinguish four haplotypes. So we were very happy to see this.”
In order to get an even better picture of the variation between the diploid and tetraploid varieties, Nijland and colleagues, including Henri van de Geest (@geesthc) and Mark de Heer, performed a de novo assembly using FALCON and Canu.
“Our assembly is very much improved and we were able to separate many of the haplotypes,” Nijland said.
The next step is to improve the assemblies even further by using Bionano or HiC technologies, which Nijland is hoping will help separate some of the alleles that were extremely similar due to being a segmental allotetraploid.
“We managed to assemble a heterozygous, polyploid genome, without the need for ultra high molecular weight DNA, which is required for a lot of other long-read sequencing,” Nijland said. “Also, the sequence coverage which is required in the assembly is lower, and because of the high accuracy, the computation of the assemblies is much less.”
“Most importantly, we’re getting a better representation and better overview of genomic content in the assembly. This provides a very valuable tool for molecular breeding efforts in rose.”
Catch up on other PAG presentations in a recent blog post and watch Nijland’s full PAG talk here:
We hear a lot about the growing crisis of antibiotic resistance in human health, but it turns out this is just the most visible place it appears as it moves through our complex modern environment. For example, when intensive farming is used to feed large urban populations, antibiotic resistance can first emerge on farms and gain access to human communities through the food system.
One of the key groups on the front lines of monitoring antibiotic resistance from farm to fork in the United States is the National Antimicrobial Resistance Monitoring System (NARMS). NARMS was launched in 1996 as an interagency partnership among the USDA, the FDA, the CDC, and state and local health departments to protect public health by tracking changes in the antimicrobial susceptibility of bacteria in food animals, retail meat and ill people.
Nationwide, public health labs submit Salmonella, Campylobacter, Shigella, Escherichia coli O157, and Vibrio isolates from clinical specimens and outbreaks to the CDC for testing. In addition, 19 states collect samples of retail chicken, ground beef and pork chops every month for culturing, serotyping, antimicrobial susceptibility testing, and genome sequencing by the FDA. Finally, the USDA conducts similar tests on bacteria isolated from food animals at randomly sampled, nationally representative slaughter and processing plants throughout the country.
By combining information from all these sources, NARMS can detect emerging trends in resistance, understand the genetic mechanisms of resistance, link illnesses to specific sources or practices, educate consumers, and develop data-driven recommendations for improving antibiotic stewardship.
One of the tools NARMS uses for bacterial whole genome sequencing is PacBio long-read sequencing, prized for its ability to assemble not only chromosomes but also plasmids and other accessory genome elements that frequently carry drug resistance genes. Over the years, scientists at NARMS have used PacBio reference genomes to facilitate numerous comparative genomics analyses of Salmonella, Campylobacter and Enterococcus strains, examining how virulence and resistance to β-lactamase, ciprofloxacin, linezolid and other families of antibiotics evolves at the molecular level.
Adding to this body of work, NARMS Director Patrick McDermott and collaborators recently reported applying SMRT Sequencing to 11 E. coli isolates collected from retail meats. One of the explicit goals of the study was to add more closed plasmids carrying quinolone resistance to their reference database, expanding our understanding of this emerging challenge to the treatment of Gram-negative infections. All the selected E. coli strains used in this study were resistant to ciprofloxacin and known to carry plasmid mediated quinolone resistance (PMQR) elements. The team generated “closed, circular chromosomes and plasmids from each isolate,” they write.
One key finding of the study was that seven of the plasmids analyzed did not match any existing sequences in GenBank. The authors commented, “This demonstrates the importance of increased sequencing of plasmids even in well-studied bacteria such as E. coli, since completely new plasmids are still being discovered.” They also note that while the prevalence of PMQR genes in the US food supply is currently low and the E coli strains from this study are unlikely to cause disease in humans, the identification of numerous novel plasmids suggests the potential for further spread to other strains or genera.
Furthermore, the authors emphasized that “This work shows the value of long-read sequencing in de novo characterization of [antimicrobial-resistant] plasmids.” More specifically, “Using only short-read sequencing data makes it difficult to accurately identify plasmids or fully characterize them.”
For example, closed plasmids are required to identify when multiple resistance genes or multiple copies of the same gene are co-located in one plasmid. This more complete information is important for uncovering the potential for co-selection of resistance. Another key finding from the paper is that while fluoroquinolone is not commonly used in food animal production, seven out of the 11 PMQR plasmids sequenced in this study also carried genes for resistance to tetracycline, “the highest selling antimicrobial for food animals in the United States.” Continued use of tetracycline in food animals could therefore drive co-selection for fluoroquinolone resistance in E. coli.
Learn more about the methods and workflow for bacterial whole genome sequencing.
People of Japanese descent just moved a little closer toward the promise of precision medicine thanks to a population-specific reference genome based on the de novo genome assembly of three Japanese individuals. A new preprint describing the work shows that SMRT Sequencing was instrumental in the achievement.
Scientists from Tohoku University, led by Jun Takayama (@jntkym), Kengo Kinoshita (@kk824), Masayuki Yamamoto, and Gen Tamiya, aimed to create an improved reference genome resource that would better represent the genetic background of a Japanese population than the current human reference genome. “Some ethnic ancestries are under-represented in the international human reference genome (e.g., GRCh37), especially Asian populations, due to a strong bias toward European and African ancestries in a single mosaic haploid genome consisting chiefly of a single donor,” they write.
To address that challenge, they sequenced the genomes of three Japanese individuals to more than 100-fold coverage with PacBio SMRT Sequencing. The contig N50 value for each genome was approximately 20 Mb. Bionano optical maps were used to perform hybrid scaffolding to boost contiguity even further. “These and other assembly statistics were better than or comparable to other published de novo assemblies,” the authors report.
Fig 1a. Construction of JG1: PCA plot showing that the three sample donors are within the Japanese population cluster.
Next, the team had to merge all three of these assemblies to “construct a reference-quality haploid genome sequence,” they write. “We integrated the genomes using the major allele for consensus, and anchored the scaffolds using sequence-tagged site markers from conventional genetic and radiation hybrid maps to reconstruct each chromosome sequence.” The meta-assembly was designed to avoid the inclusion of rare variants and unresolved sequences for broadest possible applicability.
Takayama et al. validated the utility of this new reference genome — known as JG1 — by analyzing its representation of common variants among Japanese people and its ability to home in on causal variants for rare disease from seven Japanese families. In all cases, the population-specific reference performed at least as well as or better than other assemblies in detecting relevant variation; for example, in the rare disease case, JG1 reduced the number of false-positive variant calls from an exome analysis.
JG1 “is highly contiguous, accurate, and carries the major allele in the majority of single nucleotide variant sites for a Japanese population,” the scientists report. “We expect that population-specific reference genome such as JG1 will prove to be practical and beneficial options for genome analyses of individuals originated from the population.”
PacBio long-read sequencing is being used to develop population-specific reference genomes as part of several international research efforts. Learn more about these projects and explore detailed assembly information in our interactive map.
We are pleased to announce the winner of the 2019 Human Genetics SMRT Grant: Tychele Turner, an assistant professor who recently joined the Washington University in St. Louis School of Medicine.
Turner’s research focuses on neurodevelopmental disorders, particularly on finding answers to unsolved cases. Her project aims to sequence members of a family affected with autism, using long reads and the high accuracy of HiFi sequencing to try to identify a causal genetic variant. We spoke with her to learn more about this winning proposal.
Q: How did you get involved in studying neurodevelopmental disorders?
A: My interest in neurodevelopmental disorders goes back to my graduate school days. I worked in Aravinda Chakravarti’s lab, where I focused on studying autism, especially in families with multiple affected girls. In autism, there is a sex bias; about 80% of all cases are male. When you have a female with autism, that’s pretty rare, and when you have multiple affected females in a family, that’s even rarer. The prevailing thought is that it might just take a more severe mutation for a girl to become affected with autism. That’s where I started my research career.
Then I moved to Evan Eichler’s lab for my postdoc, where I did large-scale assessment of children with neurodevelopmental disorders — looking at thousands to tens of thousands of individuals using microarrays, whole-exome sequencing, and short-read whole-genome sequencing. We were able to find a lot of new genetic components, particularly from de novo mutations.
Q: Why is your lab focused on uncovering new genetic components associated with autism?
A: We can only explain about 30% of all cases today. It seems low because the heritability of autism is high, so we think there is more to discover. To find these things we haven’t been able to see before, I think part of the issue may be technology. That’s why I was really excited when the opportunity came for the PacBio grant. That’s just the kind of thing we might need to find the variation we can’t explore with the older technologies.
Q: How is the research community trying to get answers for the remaining 70% of cases?
A: One approach is adding more samples. As we sequence more and more people, we’re able to find more and more of those genes with statistical significance. I think the future is very bright on that front. But the other approach is using new technologies to find the types of variation that we’ve missed. We could implicate new genes and also go back to known genes and identify new mutations. I would call this completing the allelic series within the genes that reach significance. If we can get to that point, we can be very clear about all the contributing elements. It’s not fun to be limited by your technology.
Q: Why do you think HiFi sequencing could make a difference?
A: I think it’s really important because it will allow us to find structural variants — such as small deletions and duplications — that we never can see otherwise. I’ve worked a lot with whole genome sequencing from short-read data, but we’re limited with that technology. We can detect really big structural variants. If someone has a deletion that’s a megabase, we will see it. But if that person has a deletion that removes one exon of a gene, and that deletion is 200 base pairs, we have a really hard time finding that in our data. And if we do find it, we have a hard time pulling it out from the noise.
But with PacBio’s long reads, a 200 base pair deletion is no problem because you’ll see it within the actual read. You just map it to the genome, and you have your answer. That’s what I’m really excited about. It also lets you get into the GC-rich regions of the genome, which is important for repeat expansions like the one associated with fragile X syndrome.
Q: How do you foresee PacBio sequencing helping the family you are working with?
A: I’m working with John Constantino in the autism clinic here at WashU — he did a deep clinical workup on this family, which has two girls with autism who have a fairly severe phenotype. They have previously been tested with arrays and exome sequencing and so far, there have been no answers. We think the reason we haven’t been able to find a genetic event yet is probably a technology issue. Our plan is to do PacBio sequencing with the SMRT Grant and also to generate some data with complementary technologies. We’re going to go all in for this family.
Q: In your proposal, you described this approach as a “pathway for discovery.” What did you mean by that?
A: As a new lab, this is really exciting because we’re going to have the first opportunity to look at this kind of data and we think that it’s going to be important to use for other families in the future. In addition to getting an answer for this family, we can use it as a platform to show people how to solve these cases. I’m really interested in going back to the whole collection of families with autism where we don’t have an answer and figure out what’s happening. Some percentage of cases should be explained with this new approach. Having this grant will help us to do that.
We’re excited to support this research and look forward to seeing the results. Thank you to our co-sponsor and Certified Service Provider, the HudsonAlpha Genome Sequencing Center, for supporting the 2019 Human Genetics SMRT Grant Program. Explore the 2020 SMRT Grant Programs to apply to have your project funded.
What better way to start the year than a gathering of thousands of stellar scientists? We were excited, once again, to attend the Plant and Animal Genome (PAG) Conference in sunny San Diego and to showcase some of the achievements of our customers at our well-attended workshop.
For those who missed it – or just want to relive the excitement – here is an overview, and recordings of the presentations.
The workshop kicked off with our CSO Jonas Korlach looking back at the evolution of SMRT Sequencing over the last decade, and concluded with an update on the latest PacBio developments, including reduced analysis time with HiFi Reads and an ultra-low DNA input protocol, by Michelle Vierra (@the_mvierra), strategic marketing manager for plant and animal sciences.
Watch Korlach’s introductory remarks:
Watch Vierra’s full workshop talk, PacBio Update on Products and HiFi Applications
Expanding the Tree of Life
First up at the PAGXXVIII PacBio workshop was Mark Blaxter (@blaxterlab), project lead for the Sanger Institute’s Darwin Tree of Life – a position he described as his ‘dream job’. The project, which aims to sequence all 60,000 species believed to be on the British Isles, over the next 12 years, starting with species representing 4,000 families.
“After that, we’ll move on to the genera and after that we’ll do the rest,” Blaxter said. “This requires us ramping up to do 5,000 genomes a year. If you divide that by the number of working days in a year, that’s 20 genomes a day. That’s five before coffee, another five before… It’s terrifying. But I think actually we can get there.”
The Sanger team has already generated data for 94 species, including 44 new moth and butterfly (Lepidoptera) PacBio assemblies. Combined with HiC data, they have been able to generate chromosomal, telomere-to-telomere assemblies from the HiFi reads.
“Having spent years sequencing other butterflies, this is truly transformational,” Blaxter said. “We hope we can spread this across the whole of the Tree of Life.”
Watch Blaxter’s full workshop talk: Endless Forms – Genomes from the Darwin Tree of Life Project
The Fungus Among… Plants
In a talk that might just inspire you to take on mycology, Jana U’Ren (@you_wren) of the University of Arizona discussed the fungi that live inside of plants and her studies of their biology and evolution.
U’Ren’s studies focus on symbiotic fungi found in the photosynthetic tissue of plant leaves. A single leaf can harbor dozens to hundreds of species of fungi. They live asymptomatically within their host species, and are grouped together functionally as endophytes.
Prior sequencing efforts of these endophytes were limited to ~300 base pair fragments of hyper variable regions. While that was useful for community analysis to understand where the species overlapped, it couldn’t be used for phylogenetic analyses.
“What we’re trying to do now is to answer both Where and Who these (fungi) are using PacBio sequencing.”
U’Ren sequenced ribosomal DNA amplicons from 25 different species of plants from Boreal regions at the Arizona Genomics Institute, resulting in more than a million high-quality reads and a treasure trove of data to sift through, which she is currently doing.
“It was a beautiful dataset. We have very high-quality data that has been validated with all the culturing that we’ve been doing over the last 12 years,” U’Ren said. “We recovered a high richness of fungal OTU (operational taxonomic unit), which was what we were looking for, and what we found was this higher phylogenetic diversity.”
Watch U’Ren’s full workshop talk: Phylogenetic Insights into the Endophyte Symbiosis using PacBio Ribosomal DNA Sequencing
A Rose is a Rose
The genome of the rose is almost as complicated as its connotations when given as a gift on Valentine’s Day or other special occasions.
Many roses are “segmental allotetraploids,” which means that part of the genome is behaving like an allotetraploid (with four chromosome sets from two distinct species, which occurs during hybridization) – and part of the genome is behaving like an autotetraploid (with four sets of homologous chromosomes).
Needless to say, parsing all of this out is challenging. Bart Nijland (@bart3601) of Genetwister Technologies explained how his team set out to make a haplotype-aware assembly of Rosa x hybrida L. in order to capture its full range of genetic variation, rather than rely on more traditional assemblies which collapse the haplotypes into single sequences that could be missing critical information.
“A lot of the existing technologies don’t perform very well in doing this. So we were very happy when PacBio released its HiFi protocol. Due to the high accuracy of the reads, we thought this could really help us in solving this challenge,” Nijland said.
A k-mer analysis of their sequenced samples revealed four distinct peaks, exactly what they were expecting in their heterozygous, tetraploid samples. Further de novo assembly of diploid and tetraploid varieties by Nijland and colleagues, including Henri van de Geest (@geesthc) and Mark de Heer, provided an even better picture of the variation between them.
“This provides a very valuable tool for molecular breeding efforts in rose,” Nijland said.
Watch Nijland’s full workshop talk: The Impact of Highly Accurate PacBio Sequence Data on the Assembly of a Tetraploid Rose
Going Ape Over Iso-Seq Analysis
The work of Zev Kronenberg (@zevkronenberg) and team made headlines — and the cover of Science — when reported a high-resolution comparative analysis of great ape genomes. During the workshop, he shared how transcriptome analysis via the Iso-Seq method led to further discoveries.
Kronenberg, then a post-doc in the lab of Evan Eichler and now a senior bioinformatics engineer at PacBio, used PacBio’s RNA sequencing method, Iso-Seq, to annotate the great ape genomes his team created, detangle several complicated loci, and enrich our biological understanding of the differences between us and our closest relatives.
“We spent a lot of effort not only ensuring that the genome assembly turned out well, but we made sure that the de novo genome annotation was done correctly and that we were able to trust the genes that we find.”
Mapping the transcriptome data of the great apes against human transcriptome data, Kronenberg and his colleagues looked for areas where they differed. He showed several examples, including a human specific 60 Kb intronic deletion that, with a bit of digging, a graduate student was able to associate with a region linked in other studies to human diet.
“Without the Iso-Seq data, that probably would have been the end of the story. But with Iso-Seq data, we were able to identify how this non-coding variant could potentially have a phenotypic effect.”
The project proved the power of combining genome and transcriptome data, Kronenberg said.
“No story is really complete with just a genome. The Iso-Seq data was absolutely central to us discovering really interesting biological candidates.”
The higher capacity and speed of the Sequel II System has made even more possible, Kronenberg added.
“We sequenced I think well over 100 SMRT Cells for only the Iso-Seq data. Today, a single SMRT Cell would do that whole project. And that, to me, is mind blowing.”
Watch Kronenberg’s full workshop talk: Characterizing Genetic Differences between Great Apes using Iso-Seq Data
HiFi Data Assemble!
Several other presentations throughout the conference demonstrated how highly accurate HiFi reads on the Sequel II System are improving results, including “HiCanu: Resolving repeats and haplotypes” by Sergey Koren (@sergekoren) of NHGRI, slides of which are available here. In addition to HiCanu, two other genome assemblers built for HiFi data also made their debut: Nighthawk from Zev Kronenberg and Hifiasm from Heng Li (@lh3lh3).
Our four poster presentations from the PAG conference are available to view:
- A Complete Solution for High-Quality Genome Annotation using the PacBio Iso-Seq Method – Elizabeth Tseng, et al.
- Beyond Contiguity: Evaluating the Accuracy of De Novo Genome Assemblies – Sarah Kingan, et al.
- Every Species Can be a Model: Reference-quality PacBio Genomes from Single Insects – Sarah Kingan, et al.
- A High-Quality PacBio Insect Genome from 5 ng of Input DNA – Jonas Korlach, et al.
Snake milking, horse blood harvesting and brewing — antivenom production is still more medieval art than modern science. But a new high-quality snake genome may finally pull it into the 21st century.
As recently reported in Nature Genetics, a team of scientists led by Somasekar Seshagiri, a former staff scientist at Genentech and now president of the nonprofit SciGenom Research Foundation (@SGRF_Science) in India, assembled the genome and transcriptome of the lethal Indian Cobra (Naja naja) using PacBio long-read sequencing and other genomic technologies.
They also created a “venom-ome,” a catalog of venom-gland-specific toxin genes they hope can be used for the development of synthetic antivenom of defined composition using recombinant technologies.
The new cobra genome is one of only a few snake genomes ever published. Previous assemblies were generated primarily using short-read sequencing, resulting in highly fragmented assemblies, “thus limiting their utility for creating a complete catalog of venom-relevant toxin genes,” the authors noted. Compared with the king cobra genome, the Indian cobra genome contains far fewer scaffolds (1,897 versus 296,399), and 929-fold better contiguity.
“This high-quality genome allowed us to study various aspects of snake venom biology, including venom gene genomic organization, genetic variability, evolution and expression of key venom genes,” the authors wrote.
A team of scientists from AgriGenome (@agrigenome), a PacBio certified service provider, was instrumental in generating long-read PacBio whole genome and venom gland Iso-seq data. Their bioinformatics team helped build a functional annotation pipeline that leveraged 101,761 Iso-seq transcript isoforms to identify and correctly annotate 139 toxin genes out of the 12,346 genes expressed in the venom gland, the ‘venom-ome’. Of the 139 toxin genes, 19 were expressed primarily in the venom gland.
Targeting these core toxins — which are responsible for a wide range of symptoms in humans, including heart-function problems, paralysis, nausea, blurred vision, internal bleeding and 100,000 deaths per year worldwide — could lead to the development of a safe and effective humanized antivenom, as well as drugs to treat hypertension, pain and other disorders, the authors suggest.
“The genome and the associated predicted proteome will also serve as a powerful platform for evolutionary studies of venomous organisms,” the authors wrote.
Learn more about the methods and workflow for PacBio whole genome sequencing.
Maize researchers have been rejoicing over a New Year’s gift delivered by a group of 33 scientists: A 26-line “pangenome” reference collection.
The multi-institutional consortium of researchers used the Sequel System and BioNano Genomics optical mapping to create the assemblies and high-confidence annotations. They released the results on January 9, and in several presentations at the Plant and Animal Genome XXVIII Conference, less than two years after the ambitious project was funded by a $2.8 million National Science Foundation grant.
The collection includes comprehensive, high-quality assemblies of 26 inbreds known as the NAM founder lines — the most extensively researched maize lines that represent a broad cross section of modern maize diversity — as well as an additional line containing abnormal chromosome 10.
Scientists can download the project’s raw whole genome sequencing data, RNA sequencing data, optical map data, gene annotations and gene models at MaizeGDB. The site also features browsing and data visualization tools.
Led by faculty investigator R. Kelly Dawe (@corncolors), Distinguished Research Professor at the University of Georgia, Matt Hufford (@mbhufford), associate professor at Iowa State University, and Doreen Ware, a computational biologist at USDA and Cold Spring Harbor Laboratory, the NAM Consortium also included scientists from Corteva Agriscience, who are conducting their own large-scale sequencing effort of the company’s maize lines as well.
“People have been using these particular lines for years, so everybody has been really excited to get these new references as a resource,” Hufford said. “The assemblies that have come out are better than anything else that’s out in maize.”
Maize has been extremely challenging to sequence because the vast majority of its 2.3 Gb genome — a staggering 85 percent — is made up of highly repetitive transposable elements. It is also amazingly diverse. A study comparing genome segments associated with kernel color from two inbred lines revealed that 12 percent of the gene content was not shared – that’s much more diversity within the species than between humans and chimpanzees, which exhibit more than 98 percent sequence similarity.
The 26 varieties were prepped at the Arizona Genomics Institute, sequenced at the University of Georgia, Oregon State University, and Brigham Young University, and assembled by the NAM Consortium using PacBio long reads. Scaffolds were validated by BioNano optical mapping, and ordered and oriented using linkage and pan-genome marker data. RNA-seq data from multiple tissues were used to annotate each genome using a pipeline that included BRAKER, Mikado and PASA.
“We spent a lot of time on gene model annotation, validation and benchmarking against B73 (the first reference genome annotations for maize, created by Ware’s lab in 2009, and updated in 2017) and other maize genes that have been manually curated by the community,” Hufford said.
Now comes the fun part: Peering into all the data and seeing what secrets it will reveal.
“For the last few months, we have started to see the cool biology emerging,” Hufford said. “What we are seeing is a lot of structural variation linked to phenotypic traits we haven’t been able to explain before.”
In addition to answering questions about basic biology and agronomic variation, the data is shedding light on the evolution of the different maize lines.
“We’re learning about the tempo of gene loss following a genome doubling event several million years ago. It appears to be ongoing, and still in flux,” Hufford said.
Next steps for the consortium include additional functional annotations for the NAM gene models, such as transposable elements, SNPs and insertions, as well as methylome and ATAC-Seq data.
“These data will help the maize community assess the role of variation in the determination of agronomic traits,” Hufford said.
Hufford will also be using SMRT Sequencing on the Sequel II System for two other large assembly projects for teosintes, a wild relative of maize, and other grass species.
“I think it’s really going to help with some of these complex varieties,” he said.
Learn more about the methods and workflow for PacBio whole genome sequencing.
By Zev Kronenberg, Senior Engineer of Bioinformatics at PacBio
Since the introduction of HiFi reads the community has embraced these long and highly accurate reads for human genome assembly and paralog resolution [1-5]. At PacBio, the assembly team (Figure 1) is working to build on the accuracy of HiFi data for direct phasing during assembly.
In diploid organisms, phasing an assembly means separating the maternally and paternally inherited copies of each chromosome, known as haplotypes. Each phased contig, or haplotig, is made up of reads from the same parental chromosome (Figure 2). Phased genomes give better quality than collapsed genomes; they provide allelic information, which can be important for studying human diseases, crop improvement, evolution, and more.
Figure 2. Phased de novo assembly. A collapsed haploid assembly meshes contigs from different haplotypes (unphased assembly), while a partially phased assembly may still switch between the two haplotypes in its primary contigs. A fully phased assembly would cleanly separate the two haplotigs.
FALCON-Unzip is a diploid-aware genome assembler that has been used to assemble and phase many PacBio genomes . It first creates a collapsed assembly, then uses heterozygous single nucleotide variants to partition the reads by haplotype and reassembling them into haplotigs. The assembly outputs are primary contigs with associated haplotigs (Figure 3).
Figure 3. FALCON-Unzip phasing and haplotig assembly steps. In the first stage primary contigs and associate contigs are produced, reads are aligned to the primary contigs, and phased. The phase is then re-introduced to the assembly graph, followed by re-assembly.
While FALCON-Unzip has consistently given our users excellent results, it was built for long reads with higher error rates and does not take advantage of the high accuracy of the HiFi reads. In 2019, FALCON-Unzip was adapted for HiFi data, producing high-quality results . However, the current implementation still requires iterative assembly, and does not use indels for phasing. Therefore, we have started working on a new graph cleaner called Nighthawk that simplifies the assembly graph by removing cross-haplotype alignment overlaps, which can significantly speed up and improve assembly. While still a work in progress, the preliminary results are promising.
Nighthawk: A smart, efficient assembly graph cleaner
Nighthawk uses that classical bioinformatics data structure, the De Bruijn graph, to identify genetic variants (substitutions, insertions, and deletions) and remove cross-haplotype overlaps in the assembly string graph.
Most long-read genome assemblers follow the overlap-consensus-layout (OLC) workflow. The overlap stage begins with a pairwise alignment of all reads (Figure 4A). For each read, a pile of alignments to all other reads is generated. The goal of Nighthawk is to detect and remove cross-haplotype overlaps — that is, alignments between reads that come from different haplotypes. It also needs to remove other false alignments that come from paralogs, repeats, etc.
Given a pile of reads, Nighthawk builds a read-colored k-mer De Bruijn graph , where each node represents a k-mer; node colors denote a unique set of reads (Figure 4B). For each read overlap, Nighthawk calculates a read similarity score (RSS). The RSS is the number of shared variants between two reads. A positive RSS indicate that reads are in phase with another, while a negative RSS suggest the read overlap is cross-haplotype and should be removed (Figure 4C). Nighthawk removes overlaps with a negative RSS. The remaining overlaps are then passed on for the layout and consensus stage of assembly (Figure 4D).
It is amazing to see how clean a HiFi-based De Bruijn graph is (Figure 5). This is often a work of art in itself! After running Nighthawk, the overlaps can then be passed into string graph assemblers such as FALCON for assembly.
Figure 4. The Nighthawk workflow. Nighthawk builds a colored De Bruijn graph from read overlaps. Overlaps are scored by shared variants between two reads. Overlaps with negative RSS indicate cross-phase overlaps and are removed. The resulting overlaps are passed to a string graph assembler (such as FALCON) for phased assembly.
Figure 5. A HiFi De Bruijn graph for a pile of reads from Drosophila genome sequencing. Each dot represents a k-mer (k=23), the edges denote neighboring k-mers. The larger red dots mark the head of heterozygous bubbles.
Testing Nighthawk on a HiFi data set
We evaluated how well Nighthawk’s RSS could distinguish in-phase and cross-phase overlaps against three ground truth sets (Table 1). In all three data sets, Nighthawk’s RSS was able to distinguish in-phase read overlaps (true positives) from cross-phase read overlaps (true negatives) while having very few false positives and false negatives.
But what effect does Nighthawk’s graph cleaning have on the assembled genome? Our team patched Nighthawk into FALCON and assembled a heterozygous (0.6%) F1 Drosophila HiFi data set. The haploid genome size is 140 Mb, so a perfectly assembled diploid genome would consist of a total of 280 Mb total in primary and associated contigs.
Our Nighthawk-FALCON assembly produced 247.1 Mb of primary contigs and 14.9 Mb associated contigs, creating a diploid genome that’s a total of 262 Mb (93.9%). The phasing accuracy, as measured by parental k-mers, was much better using Nighthawk for both primary and associated contigs compared to other methods.
Toward a truly phased assembly
We have shown that HiFi data alone can be used to effectively phase a Drosophila genome. Our new tool, Nighthawk, is an assembly graph cleaner that uses the accuracy of HiFi reads for variation detection. The phasing of the primary and associate contigs improves compared to FALCON when Nighthawk is used to filter out cross-phase alignment overlaps.
Nighthawk is still a work in progress, and many challenges remain. One such challenge is the use of alignment identity as a filter to identify cross-phase overlaps. Setting the right identity threshold is a Goldilocks problem: a filter that’s too stringent would fragment the assembly, while a filter that’s too relaxed would not remove all the false overlaps. Another challenge is complex graph structures that may arise from repeat structures, homozygosity, lack of overlap coverage, etc.
Nighthawk is only the first piece in the overlap-layout-consensus assembly process. Our team is continuing to modify string-graph algorithms to recognize the graph structures Nighthawk generates. We are excited about the new possibility HiFi data brings and believe that fast, direct phased assemblies will be feasible in the not-too-distant future.
The PacBio assembly team would like to thank Tobias Marschall (@tobiasmarschal) for the inspiration to use De Bruijn graphs for variant calling (NCBI Hackthaon 2019) and Mark Chaisson (@mjpchaisson) for technical guidance on avoiding common pitfalls.
 Wenger et al., “Accurate Circular Consensus Long-Read Sequencing Improves Variant Detection and Assembly of a Human Genome”, Nature Biotechnology (2019)
 Vollger et al., “Improved Assembly and Variant Detection of a Haploid Human Genome Using Single-Molecule, High-Fidelity Long Reads”, Annals of Human Genetics (2019)
 Vollger et al., “Long-Read Sequence and Assembly of Segmental Duplications”, Nature Methods (2019)
 Garg et al., “Efficient Chromosome-Scale Haplotype-Resolved Assembly of Human Genomes”, bioRxiv (2019)
 Porubsky et al., “A Fully Phased Accurate Assembly of an Individual Human Genome”, bioRxiv (2019)
 Chin et al., “Phased Diploid Genome Assembly with Single-Molecule Real-Time Sequencing”, Nature Methods (2016)
 Kronenberg et al., “High-quality Human Genomes Achieved through HiFi Sequence Data and FALCON-Unzip Assembly”, ASHG Poster (2019)
 Garg et al., “A Graph-Based Approach to Diploid Genome Assembly”, Bioinformatics (2018)
 Patterson et al., “WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads.” In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014)
 Koren et al., “De Novo Assembly of Haplotype-Resolved Genomes with Trio Binning”, Nature Biotechnology (2018)
A hearty congratulations to Cleo van Diemen at the University Medical Center Groningen for winning the 2019 Neuroscience SMRT Grant!
Van Diemen’s impressive proposal involves using PacBio long-read sequencing to find new genetic mechanisms associated with spinocerebellar ataxia (SCA). While some 70% of SCA patients can get clear diagnostic and prognostic information because they have one of the ~37 genes known to be associated with this condition, 30% of patients have no such clarity. In this project, van Diemen and her colleagues will use their SMRT Grant award to generate highly accurate long reads for two SCA patients with unknown disease etiology.
As team-leader of the research & development unit of the genome diagnostics section of the genetics department, van Diemen aims to introduce new technologies to help her colleagues achieve their research and diagnostic goals. In this case, she is working with a scientist focused on SCA patients to find a way to diagnose previously unsolvable cases.
So far, existing approaches have included standard linkage analysis, SNP arrays to look for some known structural variants, exome sequencing, and gene expression analysis. Now, van Diemen hopes that adding structural variant detection with SMRT Sequencing will provide some new answers. Repeat expansions are among the possible culprits. “Repeat genes have been identified in a lot of ataxias,” van Diemen says. With SMRT Sequencing, it will finally be possible “to do this genome-wide approach for new repeat genes.”
Structural variation is another potential source of causal mechanisms for the unexplained SCA cases. “There is some evidence that structural variants may play a role in ataxias,” van Diemen says. But SNP arrays lack the ability to discover new variants or to detect complex situations, such as inversions. And short-read sequencing often misses these large elements. “With long-read sequencing, it’s easier to identify them,” she adds.
Ultimately, the goal is to give all SCA patients the DNA-based information that will help them manage their condition. “There are some differences in the phenotypic spectrum, so knowing the genetic basis can help patients understand what they will face in the future and also makes it possible to consider genetic testing for family counseling,” van Diemen says. “That’s the clinical importance of having a genetic diagnosis.”
This SMRT Grant represents van Diemen and her team’s first use of PacBio sequencing. She believes it will be “a good starting point” that will help them understand how to apply long-read sequencing for larger-scale studies in the future. “We are looking forward to it,” she says. “It’s a great opportunity.”
We’re excited to support this research and look forward to seeing the results. Thank you to our co-sponsor and Certified Service Provider, the Center for Genomic Research at the University of Liverpool, for supporting the 2019 Neuroscience SMRT Grant Program.
Learn more about upcoming SMRT Grant Programs for a chance to win free sequencing.
A new preprint from lead authors David Porubsky and Peter Ebert, senior authors Evan Eichler and Tobias Marschall (@tobiasmarschal), and collaborators reports a method for generating fully phased, de novo human genome assemblies without parental data. The approach combines PacBio HiFi reads (>99% accuracy, 10-20 kb) with the short-read, single-cell Strand-seq technique. The authors provide a proof-of-principle through assembling the genome of a Puerto Rican female from the 1000 Genomes Project.
The work extends a recent publication from many of the same authors in which HiFi reads were used to produce an accurate and contiguous assembly of the human haploid genome, CHM13. To help assemble a phased diploid genome, the newer work adds Strand-seq, “a single-cell sequencing method able to preserve structural contiguity of individual homologs in every single cell.” The authors used Strand-seq to group HiFi reads by chromosome, order and orient contigs, and phase variants over long genomic distances. “Taken together, these features make Strand-seq the method of choice to be combined with high-accuracy long-read sequencing platforms to physically phase and assemble diploid genomes.”
The team generated 33.4-fold HiFi read coverage of the selected sample using the Sequel II System. They called single nucleotide variants in the HiFi reads with DeepVariant and phased variants using Strand-seq and HiFi reads. That “resulted in chromosome-length haplotypes with >95% … of all these heterozygous variants placed into a single haplotype block,” the scientists report. “With such global and complete haplotypes we assigned ~81% of the original PacBio HiFi reads to either parental haplotype 1 (H1) or haplotype 2 (H2).”
The team then used two tools, Canu and Peregrine, to assemble the haplotype-separated reads. A small number of chimeric contigs were corrected with Strand-seq data and the SaaRclust algorithm. The final contig N50s of the fully phased assemblies were 25.8 Mb and 28.9 for each haplotype. Assemblies were found to be highly accurate, with basepair quality scores higher than QV40; nearly all gene-disrupting indels in the sequence were found to be true biological events, not assembly artifacts. By titrating HiFi read coverage, the authors found that around 15-fold coverage of each haplotype is sufficient to produce an accurate, contiguous assembly.
“Our assembly strategies allow us to transition from ‘collapsed’ human assemblies of ~3 Gbp to fully phased assemblies of ~6 Gbp where all genetic variants, including [structural variants], are fully phased at the haplotype level,” the scientists report. In addition to the importance of using this method for assembling individual genomes, the authors note, “Fully phased, reference-free genomes are also the first step in constructing comprehensive human pangenome references that aim to reflect the full range of human genome variation.”
With the release of the award-winning Sequel II System, 2019 was an exciting year for the SMRT Sequencing community. We were inspired by our users’ significant contributions to science across a wide range of disciplines. As the year draws to a close, we have taken this opportunity to reflect on the many achievements made by members of our community, from newly sequenced plant and animal species to human disease breakthroughs.
“It has been another phenomenal year for science. The introduction of the Sequel II System will accelerate discovery even more, and I can’t wait to see what 2020 will hold.”
Jonas Korlach, Chief Scientific Officer
Human Biomedical Research
The year brought incredible insights into human genetics. Some researchers homed in on single mutations, while others zoomed out to explore variation on a population scale. PacBio technology was also selected for new large-scale sequencing projects, including the NHGRI Human Genome Reference Program and the All of Us program. Here are some of our favorite publications from the year:
- The mystery cause of progressive myoclonic epilepsy in a family that eluded detection in standard whole-exome sequencing was revealed with PacBio whole genome sequencing, as reported in Journal of Human Genetics and on our blog.
- New insights into specific human populations were revealed in several studies, including Melanesians, as reported in Science, and Tibetans, as reported in National Science Review.
- Double mutations in the PIK3CA oncogene were found to influence targeted therapy, as highlighted in Science and our blog.
- The importance of comprehensive variant detection was featured in several papers. University of Washington researchers Mitchell R. Vollger and Evan Eichler reported that “HiFi may be the most effective standalone technology for de novo assembly of human genomes” in their Annals of Human Genetics paper (read our blog), while members of the Human Genome Structural Variation Consortium reported “the most comprehensive assessment of SVs in human genomes to date” in Nature Communications. University of Michigan researchers Steve S. Ho and Ryan E. Mills shared their review entitled “Structural variation in the sequencing era.”
- A PLoS One paper by Mayo Clinic researchers demonstrated the use of No-Amp targeted sequencing to interrogate the sequence structure of expanded repeats in Fuchs Endothelial Corneal Dystrophy.
- The utility of the PacBio Iso-Seq method for studying disease risk genes was showcased in a Frontiers in Genetics paper by PacBio and Duke University researchers studying transcripts across synucleinopathies.
Plant & Animal Sciences
Commoner’s law of ecology states that “everything is connected to everything else,” and this was highlighted in several studies that showed the interdependence of microbes, plants, insects, and other animals. International consortia such as the Vertebrate Genomes Project, the Earth Biogenome Project, and the Sanger Institute’s 25 Genomes Project released many new reference genomes, which will only bolster our understanding of individual species as well their interactions with their ecosystem cohabitants. Here are some of our favorite publications from the year:
- Korean scientists provided a great example of mutualistic interactions in their Nature Communications paper examining the relationships between Streptomyces bacteria, strawberry plants, and pollinating bees.
- A USDA project to sequence the spotted lanternfly showcased the power of SMRT Sequencing to rapidly generate high-quality genomes from the DNA of single insects to fight invasive species.
- The latest Nature publication from the Cantu Lab delved into a largely unexplored feature of plant genomes — structural variants — in a study of the population genetics in grapevine domestication.
- Pathologists interested in uncovering the secrets of plant immunity used PacBio targeted sequencing to create inventories of NLR genes, which are candidates for engineering new pathogen resistance (read our blog).
- For shrimp, which have notoriously hard genomes to sequence, an isoform-level transcriptome reference generated with the Iso-Seq method was reported on Fish and Shellfish Immunology and summarized in our blog.
Microbiology & Infectious Disease
From C. difficile to symbiotic defense systems, we were treated to new insights in the realm of microbiology. We also learned about a new way to use an old method to provide unprecedented taxonomic resolution at species and strain level and gained insight into intra-bacterial defense. Here are some of our favorite publications from the year:
- Not only has the Mount Sinai Pathogen Surveillance Program adopted SMRT Sequencing for continuous monitoring and disease control, the accumulated PacBio data has also inspired new research, including a paper published in Nature Microbiology on the discovery of a conserved orphan methyltransferase that drives C. difficile infection persistence (read our blog).
- A team of researchers at the Jackson Laboratory published a study in Nature Communications, and featured on our blog, using HiFi sequencing to unlock the full potential of 16S rRNA Sequencing to provide taxonomic resolution of the human gut microbiome at species and strain level.
- PacBio reference genomes enabled a groundbreaking study published in Nature of intra-bacterial defense genes in the human gut microbiome by researchers at the University of Washington.
- As published in Science, long reads were also used to reconstruct a tripartite symbiotic factory for a marine toxin, involving bacteria, algae and a sea slug.
Did we miss one of your favorite publications of 2019? Tweet your favorites to us @PacBio, using #PoweredbyPacBio. And check out our searchable publications database for more than 1300 examples of outstanding SMRT Science from 2019.
Neurexin genes, which have been associated with certain neuropsychiatric disorders, are known to make heavy use of alternative splicing. In a recent study, scientists used the Iso-Seq method with SMRT Sequencing to better understand splice variants in neurons derived from human induced pluripotent stem cells (hiPSCs).
The study, “Neuronal impact of patient-specific aberrant NRXN1α splicing,” was published in Nature Genetics. Lead authors Erin Flaherty (@erinkflaherty) and Shijia Zhu, senior author Kristen Brennand (@kristenbrennand), and collaborators at the Icahn School of Medicine at Mount Sinai and other institutions undertook the project to help shed light on disorders linked to exonic deletions in the neurexin-1 gene, including schizophrenia.
“Deletions occur non-recurrently (with different boundaries) between patients, and the mechanisms underlying variable penetrance and diverse clinical presentations remain unknown,” the scientists write, adding that mouse models have been of limited value in elucidating this biology. “To better understand the clinical impact of NRXN1+/− mutations, it is critical to evaluate how distinct patient-specific deletions alter the NRXN1 isoform repertoire and impact synaptic function in a human context.”
Central to the study are four patient samples with rare heterozygous intragenic deletions in NRXN1 with severe psychosis disorder. Two patients contained a 136 kb deletion in the 3’ region of NRXN1 (3’-NRXN1+/-), while the other two shared a 115 kb deletion in the 5’ region (5’-NRXN1+/-).
To understand how the deletions affect splicing, the authors incorporated long-read and short-read sequencing of the NRXN1 gene on hiPSC-derived cell types. In hiPSC and other human samples, the team identified more than 120 human NRXN1α isoforms that are predicted to be translated. A comparison showed that “hiPSC-neurons modeled well the NRXN1α alternative splicing diversity found in vivo, particularly the high-abundance isoforms,” the authors report.
They showed that patient-derived NRXN1+/- hiPSC-neurons have a >2-fold reduction in wild type NRXN1α isoforms and an increase in novel isoforms from the mutant allele. “Across the two 3’-NRXN1+/- cases, we observed reduced abundance of 50% of the wild-type isoforms,” they note. The authors add that they “further detected 31 mutant NRXN1α isoforms unique to the 3’-NRXN1+/- hiPSC-neurons that resulted from splicing across the three deleted exons not found in controls.”
This alteration of isoform expression may be affecting neuronal maturation and activity. Compared to control, the 5’-NRXN1+/- and 3’-NRXN1+/- hiPSC-neurons had fewer mature neurons and decreased neuronal activity. Interestingly, over-expression of certain wild-type isoforms increased neuronal activity in the 5’-NRXN1+/- hiPSC-neurons, while over-expression of other mutant NRXN1α isoforms decreased neuronal activity in control hiPSC-neurons. “Our data supports a model whereby functional deficits in 5’-NRXN1+/- neurons arise from NRXN1 haploinsufficiency and can therefore by rescued by overexpression of wild-type NRXN1α isoforms,” the authors write, “but unexpectedly, haploinsufficiency in 5’-NRXN1+/- neurons is exacerbated by novel dominant-negative acitivity of mutant splice isoforms, and so cannot be rescued by simply increasing wild-type NRXN1α levels.”
“Our report links patient-specific, heterozygous intragenic deletions in NRXN1 to isoform dysregulation and impaired neuronal maturation and activity in a human and disease-relevant context,” the authors note. “Mutant NRXN1α isoforms may be particularly biologically relevant as our experimental data demonstrated that overexpression of even a single mutant isoform was sufficient to perturb neuronal activity in control neurons.”
Ultimately, the scientists believe their findings from this project could have significant benefits for understanding and potentially treating schizophrenia. “Evaluating how loss and/or gain of specific NRXN1 isoforms impact neuronal fate, maturation and function in a cell-type-specific and activity-dependent manner represents a critical first step towards a more genetics-based form of precision medicine,” they conclude. “Understanding how NRXN1+/− deletions perturb the splice repertoire and alter neuronal function could ultimately improve genetic diagnosis, prognosis and/or lead to new therapeutic targets.”
Bat lovers and animal researchers have been waiting for insights into the evolution and remarkable genetic adaptations of our winged mammalian friends, ever since the global Bat1K initiative announced its quest to decode the genomes of all 1,300 species of bats using SMRT Sequencing and other technologies.
Now, the first six reference-quality genomes have been released on the Hiller Lab Genome Browser, and described in a pre-print by Sonja Vernes (@Sonja_Vernes), Michael Hiller (@hillermich) and Gene Myers (@TheGeneMyers) of the Max Planck Institute, Emma Teeling (@EmmaTeeling1) of the University of Dublin, and 26 others.
What did the researchers find? Enough to excite evolutionary biologists, immunologists, and bat enthusiasts alike.
Bat Evolution – New Insights
The phylogeny of Laurasiatheria and, in particular, the position of bats, has been a long-standing, unresolved evolutionary question. Phylogenetic analyses of 12,931 protein coding-genes and 10,857 conserved non-coding elements identified across 48 mammalian genomes helped to resolve bats’ closest extant relatives within Laurasiatheria, supporting a basal position for bats within the clade Scrotifera.
Bats are suspected reservoirs for some of the deadliest viral diseases, including Ebola, SARS (severe acute respiratory syndrome), rabies, and MERS (Middle East respiratory syndrome coronavirus). But they appear to be asymptomatic and survive these infections. Figuring out why could increase our understanding of immune function and help prevent viral spillovers into humans.
A screen of the six new genomes — Greater horseshoe bat (Rhinolophus ferrumequinum); Egyptian rousette (Rousettus aegyptiacus); Pale spear-nosed bat (Phyllostomus discolor); Velvety free-tailed bat (Molossus molossus); Kuhl’s pipistrelle (Pipistrellus kuhlii); and Greater mouse-eared bat (Myotis myotis) — revealed selection on immunity-related genes which may underlie bats’ unique tolerance of pathogens, as well as several inactivated genes and expansion of APOBEC3 genes, which produce DNA and RNA editing enzymes with roles in lipoprotein regulation and somatic hypermutation.
“Together, genome-wide screens for gene loss and positive selection revealed several genes involved in NF-kB signalling, suggesting that altered NF-kB signalling may contribute to immune related adaptations in bats,” they wrote.
In order to further understand the bats’ viral responses, the researchers screened the genomes to ascertain the number and diversity of endogenous viral elements, considered as ‘molecular fossil’ evidence of ancient infections.
They found a surprising diversity of endogenous retroviruses, with some sequences never previously recorded in mammalian genomes, confirming interactions between bats and complex retroviruses, which endogenize exceptionally rarely.
“These integrations…can help us better predict potential zoonotic spillover events and direct routine viral monitoring in key species and populations,” the authors wrote.
Bats also exhibit extraordinary longevity—they can live up to 10 times longer than expected given their small body size and high metabolic rate. Only 19 mammalian species are known to live proportionately longer than humans given their body size, and 18 of these are bats.
“Bats show few signs of senescence and low to negligible rates of cancer, suggesting they have also evolved unique mechanisms to extend their health spans, rendering them excellent models to study extended mammalian longevity and ageing,” the team writes.
Within the six new genomes, the team found a loss of the oncogenic miR-374 gene, which promotes tumour progression and metastasis in diverse human cancers, and selection for PURB, a gene that plays a role in cell proliferation and regulates the oncogene MYC74, and exhibits a unique anti-ageing transcriptomic profile in long-lived Myotis bats.
Studying the genetics of echolocation, vocal learning, and sensory perception in bats could shed light into human blindness, deafness, and speech disorders.
The Bat1K group found two genes expressed in the cochlea and associated with human disorders involving deafness — LRP2 (also called megalin) and SERPINB6 — which seemed to be linked to echolocation. There were bat-specific substitutions in both genes; echolocating bats showed a specific asparagine to methionine substitution in LRP2, whereas the non-laryngeal echolocator Rousettus substituted for a threonine.
For sequencing and bioinformatics buffs, the pre-print also features detailed descriptions of new sequencing pipelines and assembly techniques, including a novel TOGA bioinformatics pipeline.
For each of the six bats, they generated: PacBio long reads, 10x Genomics Illumina read clouds, Bionano optical maps, and Hi-C Illumina read pairs, generating contigs that are ≥355 times more contiguous than the recent Miniopterus assembly generated from short read data, and ≥7 times more contiguous than a previous Rousettus assembly generated from a hybrid of short and long read data.
“This gene annotation completeness of our bats is higher than the Ensembl gene annotations of dog, cat, horse, cow and pig, and is only surpassed by the gene annotations of human and mouse, which have received extensive manual curation of gene models,” they noted.
They annotated between 19,122 and 21,303 coding genes using the PacBio Iso-Seq method of RNA sequencing. They also annotated non-coding RNAs and microRNAs, which can serve as developmental and evolutionary drivers of change, and identified important differences in ncRNAs between bats and other mammals.
This revealed extensive loss of ancestral miRNAs, gains of novel functional miRNA and a striking case of miRNA seed change that alters target specificity, pointing to a possible evolution of regulatory roles in cancer, development, and behaviour in bats.
“This is the first laboratory validation of novel bat microRNA function and highlights how Bat1K genome assemblies can enable the discovery of both non-coding and coding adaptations,” they wrote.
Learn more about the charter and progress of the Bat1K Project in this seminar featuring consortium director Sonja Vernes.
How do pernicious pathogens like Clostridioides difficile spread through hospitals and persist so tenaciously in the human gut, leading to about half a million infections and 30,000 deaths each year?
It’s a mystery scientists have been anxious to solve, and they’ve invested countless hours of research into the bacteria’s physiology, genetics and genomic evolution.
A team from Mount Sinai School of Medicine in New York City has uncovered an important new clue by studying an overlooked aspect of C. difficile’s biology: Epigenetics.
Using PacBio SMRT Sequencing and comparative epigenomics, Pedro H. Oliveira (@pholive81), Gang Fang (@iamfanggang), and colleagues mapped and characterized the DNA methylomes of 36 human C. difficile isolates.
As described in a recent Nature Microbiology paper, while they observed substantial epigenomic diversity across C. difficile isolates, they noticed one methyltransferase (MTase) was highly conserved across all of the isolates (and, they later discovered, in another ~300 published C. difficile genomes). This MTase, which they dubbed camA, shared a common methylation motif — CAAAAA, with the last adenine methylated at the N6 position, namely 6mA.
“Despite the small sample size, I got excited wondering if this methylation pattern might be conserved in this critical pathogen and play important roles in regulating its physiology,” Fang wrote in a Behind The Paper feature.
That left the question, how does it work? The Mount Sinai team reached out to other experts in the field, Aimee Shen at Tufts University and Rita Tamayo at the University of North Carolina, to do some in vitro and in vivo studies.
They found that inactivation of the gene encoding this MTase compromises spore formation, a key step in both the transmission of C. difficile and its ability to persist in the intestinal tract.
“Further experimental and integrative transcriptomic analysis suggested that epigenetic regulation by DNA methylation also modulates the cell length, host colonization and biofilm formation of C. difficile,” the authors wrote.
The discovery could have a direct translational impact. The fact that camA is conserved across all of the C. difficile genomes but is present in just a few Clostridiales makes it a promising, highly specific drug target. Furthermore, as the MTase does not seem to impact the general fitness of C. difficile, a drug that specifically targets it might also have a lower chance for resistance.
“These findings provide a unique epigenetic dimension to characterize medically relevant biological processes in this important pathogen,” the authors concluded.
The authors noted that such high-resolution mapping of bacterial DNA-methylation events has only recently become possible with the advent of PacBio’s single molecule, real-time sequencing.
“This technique enabled the characterization of the first bacterial methylomes and, since then, more than 2,200 (as of September 2019) have been mapped, heralding a new era of bacterial epigenomics,” they added.
Learn more about the methods and workflow for direct detection of epigenetics using PacBio sequencing.
It’s time to revisit the way scientists are using 16S rRNA gene sequencing to study microorganisms, according to a team of Jackson Laboratory researchers.
Popular targets for taxonomy and phylogeny studies because of their highly conserved nature, amplified sequences of the 16S ribosomal RNA genes can be compared with reference databases to determine the identity of the microorganisms that comprise a metagenomic sample. Sequences with a > 95% match are generally considered to represent the same genus, for example, while > 97% matches are considered the same species.
However, these matches are often made by sequencing only part of the nine-region, ~1500 bp 16S gene, either single regions like V4 or V6, or variable regions like V1–V3 or V3–V5, as done in the Human Microbiome Project. In a paper published recently in Nature Communications, Jethro S. Johnson, George M. Weinstock and colleagues point out that it is time to revisit this compromise that arose only because of past technological limitations. Given recent advances in long-read sequencing accuracy, the entire 16S gene should now be interrogated, the authors suggest.
Circular consensus sequencing (CCS, the method used in PacBio HiFi Sequencing), in particular, combined with sophisticated denoising algorithms, means it is now possible to sequence the entire gene with sufficient accuracy to discriminate among millions of sequence reads that differ by as little as one nucleotide, they write.
“Together, these technological and methodological advances mean that for the first time, it is becoming possible to exploit the full discriminatory potential of 16S in a high-throughput manner,” the authors write.
Using an in-silico dataset of 16S sequences taken from the Human Microbiome Project database, the researchers demonstrated that commonly targeted sub-regions were unable to recapitulate the taxonomic information present in the full 16S gene.
“The V4 region performed worst, with 56% of in-silico amplicons failing to confidently match their sequence of origin at this taxonomic level,” they wrote.
“Our simple in-silico experiment demonstrates that it is not valid to assume that ever finer clustering of these sub-regions will result in the improved taxonomic resolution necessary to reflect species.”
They also found that different sub-regions showed bias in the bacterial taxa they were able to identify at the species level. For example, while V1-V3 gave good results for Escherichia and Shigella, good results for Klebsiella required the V3-V5 region, whereas Clostridium and Staphylococcus required V6-V9 sequencing. Since all of these strains may be present in the human gut, the only way to ensure good taxonomic identification of all species is to sequence the full gene from V1-V9.
However, the team points out that it may be possible to obtain even better taxonomic resolution, down to the strain level. Bacteria have between 1-15 copies of the 16S genes. While the number of copies is consistent within a species, the intragenomic variation among the copies is strain specific. The Jackson Lab team believes this intragenomic variation presents an opportunity.
For example, sufficient nucleotide variation exists to distinguish E. coli strain K-12 MG1655 from the infection-causing O157 Sakai strain. The team provides proof of concept evidence for this approach with full-length PacBio 16S sequencing data from 381 isolates selected from the Human Microbiome Project sample bank. They show that the vast majority of these bacteria can be uniquely assigned to a specific strain using the intragenomic 16S variation revealed by PacBio 16S HiFi data.
“Thus, we argue that, when appropriately accounted for, multiple polymorphic 16S copies are not an inconvenience to be overlooked, rather they will enable the 16S gene to be used in strain-level microbiome analysis,” they add.
“Analysis of microbial communities at these taxonomic levels promises to provide a very different perspective to the one afforded by genus-level abundance estimates.”
Learn more about the methods and workflow for PacBio full-length 16S sequencing.
Two recent review articles discuss the idea that structural variants (SVs) — genetic differences that involve at least 50 base pairs — are numerous, important to human biology, and best detected with long reads. The authors review years of studies that have applied PacBio SMRT Sequencing to identify around 20,000 SVs per human genome. The reviews also report on cases in which SMRT Sequencing has helped scientists discover pathogenic variants that explain diseases for which there had previously been no clear genetic cause.
In Nature Reviews Genetics, Steve Ho, Alexander Urban, and Ryan Mills from the University of Michigan and Stanford University consider the algorithms and detection platforms that have enabled a wave of new discovery related to SVs. SVs cannot be reliably detected using short reads since many of the variants are significantly longer than those reads. “Because of this, the degree to which contemporary genomics has studied SNVs compared with SVs is significantly skewed,” Ho et al. write. “A recent analysis found that PacBio long reads were approximately three times more sensitive than a short-read ensemble maximized for sensitivity, implying that a large subset of SVs, many 50–2,000 bp in length, are unresolvable without long reads.”
Ho et al. also discuss the software tools that are useful for calling SVs in long reads, including Sniffles and pbsv. They summarize important projects that have used SMRT Sequencing to look for these variants — including Euan Ashley’s publication on Carney complex and Naomichi Matsumoto’s report on a large deletion that causes epilepsy.
The other review appears in Genome Biology, contributed by lead authors Medhat Mahmoud and Nastassia Gobet, senior author Fritz Sedlazeck, and collaborators at the University of Lausanne, Baylor College of Medicine, and other institutions. “Recent research into structural variants (SVs) has established their importance to medicine and molecular biology, elucidating their role in various diseases, regulation of gene expression, ethnic diversity, and large-scale chromosome evolution,” the authors state. “SVs are increasingly being recognized as an important class of variants, which need to be considered in evolutionary, population, and clinical genomics.”
The team reviews the value of long reads for finding SVs, noting that they “are advantageous for SV calling because they can span repetitive or other problematic regions.” The scientists also walk through the pros and cons of various alignment and SV-calling tools developed for long reads, including NGMLR, minimap2, Sniffles, and pbsv.
SMRT Sequencing provides high precision and recall for SVs in a human genome with just one SMRT Cell on the Sequel II System. By multiplexing two samples per SMRT Cell 8M, the approximate reagent cost is $670 per sample to detect structural variants.
Every year since 2008, The Scientist has canvassed the life-science community to find out which newly released products are having the biggest impact on research. We were proud to have the Sequel System selected as one of the Top 10 Innovations of 2016. And now we’ve been honored again, with the Sequel II System making the Top 10 Innovations of 2019 list.
“Our goal is to identify those products and services that are poised to revolutionize research and advance scientific knowledge,” Scientist editors wrote.
As part of the competition, a carefully selected panel of expert, independent judges were asked to rank the tools, techniques, methodologies, software, and products according to their potential to foster rapid advances or address specific problems in their respective fields.
The Sequel II System was chosen for its ability to generate longer reads with greater accuracy, at greater throughput, at a significantly lower cost.
“PacBio sets the standard for long-read sequencing and this upgrade of their instrument should have high impact on genomics sciences,” said judge H. Steven Wiley, Senior Research Scientist and Laboratory Fellow at Pacific Northwest National Laboratory.
The launch of the Sequel II System in April represented a significant improvement of our long-read sequencing technology. It contains updated hardware to process the new SMRT Cell 8M, which provides ~8x DNA sequencing data output, as well as reduced project costs and timelines compared to the prior version of the system. The Sequel II System also delivers highly accurate individual long reads (HiFi reads), which provides Sanger-quality reads (>99.9% accuracy).
As part of the early access program, customer Evan Eichler declared it the ‘most effective stand-alone technology for de novo assembly,’ and Shawn Levy, a geneticist at the nonprofit HudsonAlpha Institute for Biotechnology, noted that the Sequel II System is also good for analyzing highly repetitive or homologous regions of the genome. Long reads allow researchers to identify structural variants – including translocations in the genome, copy number variants, insertions and deletions – which are complementary to information generated using short read-based approaches.
Our customers have eagerly embraced the new technology and they are enjoying the improved throughput. Over 75 Sequel II Systems have been installed worldwide and they are averaging 160 Gb per SMRT Cell – a ~10-fold increase in yield over the previous Sequel System. The total amount of data generated on the Sequel II Systems this year has already surpassed the data generated on all installed Sequel Systems over the past four years.
The Sequel II System also recently received the Gold Award for the Most Innovative New Product in Genomics – a Life Sciences Industry Award. Since 2002, the Life Science Industry Awards have recognized manufacturers of the “tools of science” that help advance biological research and drug discovery.
We’d like to extend a sincere thanks to everyone who attended our two-day North America User Group Meeting, held this year at our Certified Service Provider, the University of Delaware Sequencing and Genotyping Center (@UD_DNAcore). With representation from 80+ organizations and over 160 attendees, the event was a great environment for sharing best practices and networking with the SMRT Sequencing community. Also, a big thanks to our host, Bruce Kingham (@bkingham) and team, as well as our partners: Agilent, Biosoft Integrators, Circulomics, Covaris, Diagenode, Perkin Elmer, Sage Science and Shoreline Biome. If you weren’t able to attend the meeting, we’ve summed up the highlights below and you can download several of the presentations and view the recordings.
Our CSO Jonas Korlach kicked off the meeting by describing the latest releases and performance metrics for the Sequel II System. The longest reads being generated on this system with the SMRT Cell 8M go beyond 175,000 bases, while maintaining extremely high consensus accuracy. HiFi mode, for example, uses circular consensus sequencing to achieve single-molecule accuracy of Q40 or even Q50. He also talked about the new, user-friendly no-amp protocol and the recent update for low-input samples. Eventually, the goal is to reduce sample requirements to as little as 1 nanogram, he noted; this protocol is currently in development at PacBio.
HiFi and Microbial Sequencing
HiFi sequencing was a central theme for a few talks. Mitchell Vollger (@mrvollger) from the University of Washington spoke about using this approach to study segmental duplications in the human genome. The technique significantly reduced the complexity of accurately mapping these nearly identical sequences throughout the genome; it also reduced the amount of compute power needed compared to a previous PacBio assembly using continuous long reads instead of HiFi reads. Despite generating less data with the HiFi assembly, the team still resolved 30% more segmental duplications with the new approach.
PacBio scientist Meredith Ashby (@AshbyMere) also spoke about the benefits of HiFi reads, focusing her talk on metagenomics and microbiome characterization. She presented several examples of analysis — from full-length 16S sequencing to shotgun sequencing — showing how SMRT Sequencing enables accurate representation of these complex communities, in some cases even without fully assembling genomes. New updates will provide users with a dedicated microbial assembly pipeline, optimized for all classes of bacteria, as well as increased multiplexing on the Sequel II System, now with 48 validated barcoded adapters. That throughput could reduce the cost of microbial analysis to just $70 per sample, she noted.
While we’re on the subject of microbial sequencing, two lightning talks offered nice illustrations of this as well. Masako Nakanishi from the University of Connecticut Health Center presented a study of how the gut microbiome alters an organism’s susceptibility to colonic ulceration; next, she plans to examine cause and effect by evaluating results of fecal transplants in mice. Shawn Polson from the University of Delaware spoke about viral metagenomes, which are more challenging to distinguish than their bacterial counterparts because viruses have no 16S equivalent. With SMRT Sequencing, his team has generated higher-resolution data about viral genomes and aims to use this information as a guide to how these genomes function.
We also saw great presentations with varied applications of Iso-Seq data. Shawn Trojahn (@trojahn_shawn) from Washington State University presented results from transcriptome sequencing of grizzly bears. The analysis focused on differential gene expression during hibernation and active cycles, potentially offering human-relevant information about muscle atrophy and insulin resistance. Thanks to SMRT Sequencing, the team was able to identify more unique isoforms just from liver tissue than had been previously characterized in the entire reference genome. Of particular interest: more than 2,000 transcripts differentially expressed between hibernation and active season, including 86 genes that have isoforms expressed in opposite directions.
From the University of Wisconsin-Madison, Nic Wheeler (@wheeler_worm) spoke about RNA sequencing for filarial nematodes associated with understudied tropical diseases. His team used Iso-Seq analysis to improve gene models and achieve better transcriptome coverage for these worms, which typically have poorly annotated and fragmented genome assemblies. While getting enough RNA to study is a technical challenge, the group still managed to generate full-length isoforms, many of which were novel or contained novel junctions.
Ana Conesa (@anaconesa) from the University of Florida spoke about Iso-Seq analysis tools developed by her group, which created the popular SQANTI tools for Iso-Seq data QC. They’re also working on IsoAnnot to perform functional annotation at isoform resolution; validation has already been done on various species. Currently it’s a set of scripts, but her team is working to produce a more user-friendly version. Finally, tappAS is for functional diversity analysis and for prioritizing genes for validation.
We also had two Iso-Seq-focused lightning talks. Vince Magrini from Nationwide Children’s Hospital spoke about using Iso-Seq analysis as part of a comprehensive profiling strategy for pediatric cancer research; in one example, full-length isoform sequencing provided a clear view of a challenging mutation associated with a drug-targetable pathway. Alexandra Pike (@amimspike) at MIT presented a study of TIN2, a telomere-binding protein, which is mutated in some short telomere syndromes. By pairing the Iso-Seq method with CRISPR, her team revealed a previously uncharacterized TIN2 isoform that may have a functional difference for individuals with these syndromes.
Finally, PacBio scientist Kristin Mars spoke about recent updates, such as the single-day library prep that’s now possible with the Iso-Seq Express workflow. She also noted that one SMRT Cell 8M is sufficient for Iso-Seq experiments; that means whole transcriptome sequencing is feasible for $1,300 per sample, while multiplexed, targeted Iso-Seq analyses can cost as little as $185 per tissue.
Two speakers offered attendees a clear view into potential future clinical use of SMRT Sequencing technology by showing how it’s performing in clinically oriented research labs. Melissa Smith (@lissagoingviral) from the Icahn School of Medicine at Mount Sinai shared results from using more than 1,300 SMRT Cells over the years — most of them for disease-focused research, but also covering microbial sequencing, immune profiling, epigenetics, ecology, and more. Her team has been working with the Sequel II System since January for applications ranging from honing targeted assays for disease-associated genes to performing targeted Iso-Seq for phasing drug targets with severity loci.
They’re also using SMRT Sequencing to detect structural elements — including extremely long and GC-rich repeat expansions — and to characterize diversity in the immunoglobulin loci. Going forward, she added, they aim to use the scIso-Seq method to resolve isoform diversity at the single-cell level.
In a separate talk, LabCorp’s Brian Krueger (@h2so4hurts) discussed the use of SMRT Sequencing for clinical research related to HLA typing, viral genome sequencing, high-throughput variant confirmations to reduce the need for Sanger sequencing, and more.
We also had several speakers focused on technology development or sample prep protocols. Erin Bernberg (@ErinBernberg) from the University of Delaware reported on using the Agilent Femto Pulse for high-resolution, highly sensitive fragment analysis and on the low-input protocol, which her team used for a recent study of ice worms.
Eugenio Daviso from Covaris talked about the use of adaptive focused acoustics for gentle cell lysis and extraction of high molecular weight DNA. Mount Sinai’s Ethan Ellis presented results from the HLS-CATCH method, which involves the use of the SageHLS instrument with CRISPR design methods to target and extract large genomic fragments for sequencing while avoiding pseudogenes and other confounding regions.
In a lightning talk, NEB’s Kelly Zatopek shared data from RADAR-seq, an amplification-free method for detecting and quantifying a wide variety of DNA damage types across a genome. Finally, Shana McDevitt from the California Institute for Quantitative Biosciences shared the core lab perspective as she discussed sample size, purity requirements, and extraction protocols for PacBio sequencing.
We enjoyed connecting with many of our users in Delaware and learning about their latest discoveries and ideas for future research. A huge thanks to our wonderful SMRT Sequencing community for an engaging and exciting meeting!
The most important creatures in a tropical rainforest aren’t necessarily the ones you can see. They work their magic underground, recycling organic matter and processing and transporting vital nutrients for their leafy neighbors above ground.
Microbiologist Joe Taylor wants to learn all about what they are and what they do. And now a grant from PacBio and Maryland Genomics will enable him to reveal some of the secrets of the soil in endangered South-East Asian rainforests.
Taylor, a Career Development Fellow in Microbiomes and Metagenomes at the University of Salford in the United Kingdom, was selected to receive the 2019 Metagenomics SMRT Grant to explore the influence of nutrient concentrations on microbial community composition and nutrient cycling capacity in the soils of an old-world tropical rainforest in Danum Valley, Borneo.
Despite worldwide attention on South-East Asian rainforests due to their charismatic megafauna of orangutans, clouded leopards and elephants, and the threat to the habitat by overlogging, oil palm plantations and climate change, little is known about their smaller, more abundant inhabitants.
“We don’t really have a good understanding of the below ground diversity in South-east Asian rainforests – particularly microbes, but also underground fauna like worms. All these groups have important interactions that will affect rainforest plants,” Taylor said. “Microbes, such as fungi bacteria, interact with rainforest trees in very important ways. You could lose the majority of mammals and birds in a forest (not that we’d want to) and it would still survive in part. But take away the fungi and bacteria, and there would be no forest.”
Old world rainforests differ from their neotropical counterparts in the Americas, which tend to get much more research attention and investment. One major difference is that the diversity of trees in the rainforests of Borneo is dominated by a single family, the dipterocarps, which have a particular symbiotic relationship with fungi that is rare in neotropical rainforests. Unlike other mycorrhizal relationships, such as arbuscular mycorrhiza and ericoid mycorrhiza, ectomycorrhizal fungi do not penetrate their host’s cell walls. Instead, they form an intercellular interface of highly branched hyphae that help the plant take up nutrients, including water and minerals, often helping the host plant survive adverse conditions. In exchange, the fungal symbiont is provided with access to carbohydrates.
Among the key nutrients Taylor and collaborators from the Universities of York, Manchester and Aberdeen in the UK and Universiti Malaysia Sabah and Sabah Forestry Department in Malaysia are studying is phosphorus, a limiting nutrient for plant growth in rainforests. In the rainforest, phosphorus transfer to trees is mediated by mycorrhizal fungi and bacteria.
Initial fieldwork conducted over several weeks in 2015, 2016 and 2017 has revealed different pools of phosphorus and other nutrients throughout the topographically diverse 50 hectare site. Experimental work carried out by the group has also shown that these different phosphorus pools impact the growth of the arbuscular and ectomycorrhizal trees differently.
In addition to DNA extracted from tree roots, Taylor and his colleagues have collected more than 200 DNA samples from soil. Short read sequencing has already generated a huge amount of taxonomic data that has provided insight into the soil composition, the spatial distribution of the different species, and how they interact. However, delineating the different metabolic pathways involved in the phosphorus cycle has been more challenging.
“It has already revealed a large amount of data on who is there, but we don’t know what they are doing,” Taylor said. “Nutrient concentrations throughout the plot are highly variable and show clear effects on above-ground plant diversity and root-associated fungi, but we do not know to what extent nutrients influence microbial function within the soil.”
They are hoping metagenomic profiling via SMRT Sequencing will provide the functional genomic information they need to elucidate the metabolic dynamics both within and between species.
“Metagenomic sequencing on the PacBio system will empower us to identify many phylogenetically distinct soil prokaryotes and eukaryotes, looking at evolutionary relationships, as well as identify full genes coding for enzymes involved in phosphorus and nitrogen cycling across a low to high phosphorus and nitrogen gradient,” Taylor said.
Taylor said he’s been eager to try PacBio sequencing for a long time.
“There’s an incredible use of PacBio in long-read amplicon sequencing in the microbial work I do,” he said. “For detailed phylogeny and taxonomy, long reads are essential.”
As many of the microbes will likely be “unculturable,” the long, highly accurate HiFi reads will enable the discovery of novel genes from unknown species that would be challenging to characterize with other approaches.
“That’s one of the reasons why this metagenomic data is so important. It will give information on all organisms in the samples, regardless of our ability to culture them,” Taylor said. “The ideal would be to get complete genomes from this metagenomic data. That would be amazing. But so little work has been done that anything we reveal is going to be novel and useful.”
Taylor’s work is taking place in a pristine forest that has been preserved and protected from logging. But, as in all habitats, the area is still threatened by climate change.
“We’re seeing more variable weather patterns,” Taylor said. “When we first visited the rainforests in 2015, it may sound strange, but the ‘rain’ forest was going through a period of ‘drought’ in the area that brought much less water than usual. This continued drought had a significant impact on functional traits of the rainforest trees.”
Research on ectomycorrhizas is increasingly important in ecosystem management and restoration, forestry and agriculture in areas like Borneo, which hosts more than 6,000 species found nowhere else on earth, thanks to its evolutionary isolation after the last ice age.
“PacBio sequencing will fundamentally change our understanding of microbial phylogeny and the capacity of soils to cycle nutrients in South-east Asian rainforests, providing many questions for further study,” Taylor said.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win free sequencing.
Thank you to our co-sponsor and PacBio certified service provider, Maryland Genomics, for supporting the 2019 Metagenomics SMRT Grant Program.
Traditional RNA-Seq is done by fragmenting cDNA, and then sequencing the fragmented reads with paired-end sequencing. The problem comes when trying to identify the full-length isoform during assembly. This is computationally challenging, and sometimes intractable.
The solution? Long-read isoform sequencing, according to PacBio Principal Scientist Elizabeth Tseng and PacBio user Gloria Sheynkman, a research fellow at Dana-Farber Cancer Institute. The two recently participated in a webinar, sharing their experiences using PacBio’s Iso-Seq method.
Tseng started by explaining the method and some of its applications.
“In contrast to traditional RNA-Seq, the Iso-Seq method produces full-length cDNA, and using the PacBio long, accurate reads, can sequence the full transcript. No assembly is required,” Tseng said.
She also discussed some of the bioinformatics tools available, including SQANTI2, which can be used to classify full-length transcripts against annotations such as GENCODE, and as a quality control tool.
The Iso-Seq method in action
At Dana Farber, Sheynkman uses the Iso-Seq method to characterize cancer cells and create complete, accurate transcriptomes.
In one example, she ran five breast cancer cell lines and eight melanoma samples in one pooled library on a single SMRT Cell 8M each, achieving around 6 million polymerase reads, with an average base yield of 300 Gb and an average polymerase read length of ~50 kb. Sequencing was performed by Maryland Genomics, a PacBio certified service provider.
With the improvements in the chemistry of the new Sequel II System, Sheynkman said she has been able to capture a much wider range of transcript lengths, without having to do size selection.
“Overall, we’re really detecting a much larger range, with cDNA molecules up to 6 and 7 kb,” she said.
For both breast cancer and melanoma, she obtained about 14,000 unique genes and 11,000 unique isoforms, around 30% of which were novel. And each SMRT Link Iso-Seq job was completed in just 6-9 hours, she noted.
Towards accurate isoform quantification
But the most promising application for Sheynkman is isoform quantification.
“I think this is a really important goal for the field,” she said. “To know how many copies of each transcript is expressed in the cell will really open up a lot of avenues in biomedical research, such as having consistent biomarkers, understanding disease mechanisms, and even just fundamental biological understanding.”
While short-read sequencing can achieve gene quantification with reliable results, the complexity of isoform structure requires more comprehensive coverage.
“Isoform quantification methods are very dependent on having accurate transcript models,” Sheynkman said.
The improved depth of coverage and reduced bias in sampling full-length reads on the new Sequel II System should mitigate these limitations, Sheynkman said. And the high technical reproducibility achieved by PacBio is another big strength, she added.
Targeted or whole transcriptome sequencing?
Sheynkman further discussed cases in which targeted isoform sequencing may be preferred over whole transcriptome sequencing. By targeting only genes of interest, rare isoforms could be identified with low to moderate sequencing effort. Multiplexing could further reduce the cost and increase sample size, which could be useful for applications such as biomarker discovery and validation.
Sheynkman also provided a new solution to a common problem when doing targeted sequencing: probes. ORF Capture-Seq is designed as an easy, versatile option to make capture probes directly from available clones/PCR product within a single day, using low-cost molecular biology reagents. It also allows for the generation of many complex probe sets tailored to different genes, Sheynkman said. The complexity of the probe sets allows you to target anywhere from 1 to 1,000 genes.
The webinar also included a lively Q&A that is well worth the listen.