This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
Cotton crops the world over have benefited from the pest-killing protein from Bacillus thuringiensis (Bt), first used in sprays and then, in 1996, transgenic crops, resulting in reduced insecticide use, enhanced biological control, and increased farmer profits. But the precious plants are under threat once again by a tiny but mighty pest: pink bollworm (Pectinophora gossypiella). In India, where more than 7 million farmers have planted 10.8 million hectares of transgenic Bt cotton, the lepidopteran pest has developed resistance to two different forms of the toxin that made the transgenic crops so effective, creating catastrophic economic losses.
Scientists have been studying the genetics of this insecticide resistance in the lab to address the issue, but do their subjects truly represent the realities of rapid resistance evolution in the field?
An international team of scientists from Arizona, India and Australia set to find out, using targeted SMRT Sequencing to compare Bt resistant lab samples with those collected from cotton fields.
As described in Scientific Reports, they used SMRT Sequencing to analyze barcoded cDNA of 22 larvae from Arizona laboratory-selected strains and Indian field-selected populations of pink bollworms. The PacBio sequencing, conducted by our certified service provider the Arizona Genomics Institute, expanded upon knowledge gleaned from previous allele-specific PCR genotyping and revealed five previously unidentified transcript variants.
The research team, led in the United States by Jeffrey A. Fabrick and Lolita G. Mathew of the U.S. Department of Agriculture’s Arid Land Agricultural Research Center in Maricopa, AZ, were specifically looking for clues into resistance to the Bt toxin Cry2Ab.
Initial Bt cotton crops produced a single toxin from the Cry1 family, Cry1Ac, but at least eight major lepidopteran pests evolved resistance to the toxin in the field, so most Bt crops grown now also produce Cry2Ab. So far, only Indian pink bollworm and Helicoverpa zea in the United States have evolved resistance to Cry2Ab, and data on the genetic basis of Cry2Ab resistance are relatively scarce and limited to laboratory-selected strains.
The researchers found both similarities and differences in the lab- and field-selected strains. They discovered that mutations disrupting the ABC transporter gene PgABCA2 are associated with resistance to Cry2Ab in both. But only one specific mutation — mis-splicing that omits exon 6 and introduces a stop codon at amino acid 373 — was shared between the Arizona and India strains. The other mutations were mostly from splice-site mutations that lead to mis-splicing of PgABCA2, and were more diverse in India.
“The differences in PgABCA2 resistance mutations between India and Arizona could reflect the difference in geographic origin, laboratory versus field selection, or both,” wrote the authors. “The results suggest that focusing on ABCA2 may help to accelerate progress in monitoring and managing field-evolved resistance to Cry2Ab.”
The researchers also noted the benefits of PacBio sequencing for monitoring variants associated with resistance in the field.
Traditional PCR, cloning, and Sanger sequencing methods are laborious and not practical for monitoring the diverse mutations in the field, they wrote, but “this method allowed us to multiplex cDNA samples from 22 individuals and obtain sequencing information from essentially single molecules of full-length PgABCA2 cDNA without post-sequencing assembly.”
“Given that PgABCA2-mediated resistance to Cry2Ab occurred in pink bollworm populations from Arizona and India, long-read sequencing focusing on this gene could provide a valuable alternative to the F1 screen for monitoring resistance to Cry2Ab in this cosmopolitan pest,” they added.
Its reliable return to the same spot year after year has made the barn swallow a beloved symbol of Spring and safe passage, for mariners and landlubbers alike. But our changing climate is altering the birds’ migratory behavior, and Italian ecologists are turning to genetics to figure out how.
As reported previously in this blog, scientists at the University of Milan joined forces with researchers from the University of Pavia and California State Polytechnic University to create the first high-quality reference genome for the European barn swallow (Hirundo rustica rustica), using SMRT Sequencing and newly available Bionano Genomics optical mapping at the Functional Genomics Center in Zurich.
To mark the publication of the work in the journal Gigascience, we spoke with Giulio Formenti, co-first author with Matteo Chiara, about their pioneering use of long-read sequencing in Italy and the technology’s potential to change the field of behavioral ecology.
Why the barn swallow?
The barn swallow is a very important species from a scientific and ecological perspective. It has been a model species in behavioral ecology, with around 1,600 studies published since 1985 about the birds’ migration, reproductive behavior and variability, but there have been very few genetic studies.
In our lab (PI Prof. Nicola Saino), we have been studying this species for a very long time, but we were hindered by a lack of a reference genome. We had been relying on genomes such as that of the flycatcher to design probes, primers and other experiments. The flycatcher is a relatively closely related species, but it has often turned out not to be close enough. It was preventing us from designing probes correctly or, even worse, led us to design probes that seemed OK, but ended up not working, ruining experiments and causing people to waste a lot of time. This is a serious issue, and the reason why it is so important to have top quality reference genomes to work with.
Why did you choose PacBio sequencing?
One of my previous projects was concerning Huntington’s Disease, which is caused by expanded C-A-G repeats. It was very hard to study these repeats due to limitations of short-read technology, but a few years ago it became possible to do it with long-read technology. So I had already explored the realm of long-read technology and proposed it for this project because I knew it was much better in terms of results. I was also inspired by the preliminary results of genomes generated by the Vertebrate Genomes Project.
When we started drafting the proposal in November 2017, I knew very little about the technical aspects of long reads. But I went to the PacBio UGM (User Group Meeting) in Barcelona, met people from the Functional Genomics Center in Zurich, and started collaborating with them. By January, they had started the sequencing, which took a few months, then we spent a few more weeks assembling the reads. By July we had completed the assembly, aligned, annotated and analyzed the results, and submitted our paper. Now, almost exactly one year later, we have a publication; for this kind of project, I think it’s quite impressive in terms of speed.
“Long reads, and in particular PacBio reads, appear to be the key to 21st century genomics.”
What challenges did you face?
Long-read technologies are not so well-known in Italy. In fact, the very first PacBio Sequel machine in Italy was purchased just a few months ago by the Department of Biology at University of Florence. So we had to send our samples to Zurich, where we also had the opportunity to use the latest optical mapping technology, which had just been released by Bionano in February. This added a whole new element to the study, and resulted in very high level, near chromosome-level scaffolding.
What are the potential impacts of this research?
Barn swallows are famous for settling into a particular spot and returning there after migration for several years. Here in Italy, people wait for the barn swallow to appear as a sign that Spring has started. But this is something that is changing as the climate changes. They are not coming back to the same place. They are now dispersing — partly to avoid inbreeding, but also because they seem to be sensing changes in the ecological conditions at their destination.
We had remarkable findings in a paper we recently published in Scientific Reports, which showed that the birds were able to predict the climate conditions of their destinations several weeks in advance, and that they were changing the timing or location of spring migration accordingly. It’s important to understand what’s going on here, and the availability of a high quality genome sequence will accelerate our efforts to learn how these adaptive processes could help populations respond to changing environmental conditions.
I hope that now that the paper is out, there will be many more people joining such efforts. We desperately need more people in the field who know how to use this technology. Most ecologists working with non-model species are still not incorporating genomics. They tend to look at phenotypes, and are not so familiar with changes at the genomic level. Studies of the effects of radiation from the Fukushima nuclear accident on butterflies, for example, have been based on phenotypic abnormalities, but no genetic evidence. This is one of the fastest developing fields in science, and we need to be able to understand genetic and bioinformatic data in order to deal with new challenges, and those that are already here. This is how modern biology is going to be.
As a scientific community, we have started to realize since the Human Genome Project that we need many genomes in addition to reference genomes, to understand variation across populations. With this project we wanted to generate a resource that we and others could then build upon.
As soon as we got the sequence, we started new projects to study the genetic impacts of environmental changes in swallows from different parts of the world, such as radiation fall-out locations. We are also re-visiting our work on migration phenotypes to study their associations with variations in genes, such as those dealing with circadian rhythm.
Their bodies are big, bony and… warm?
Unique among bony fish, Atlantic, Pacific and Southern bluefin tuna have a rare endothermic physiology that has garnered great interest among scientists. Like birds, mammals and some sharks, these kings of the sea are capable of conserving internally generated metabolic heat produced from their swimming muscles and viscera, and maintaining tissue temperatures above that of the environment.
The fish are also renowned among sushi enthusiasts for their delectable, fat-laden muscle, and prized by fisherman because of the high prices they command.
So the preservation of these species is paramount to many, and researchers are keen to monitor and manage their populations, which have suffered precipitous population decline and are now at the lowest levels of their spawning biomasses in recorded history. But progress is being hindered by a lack of knowledge about the evolutionary and genomic processes that have driven the physiological and ecological diversification of the bluefin tunas.
Conservation genomics using SMRT Sequencing could help.
In a recent webinar hosted by Nature, Barbara Block, the Charles and Elizabeth Prothro Professor in Marine Sciences at Stanford University, joined PacBio scientist Paul Peluso to describe a project to protect Pacific and Atlantic bluefin tuna by assembling their genomes and transcriptomes.
At the Monterey Tuna Research and Conservation Center, one of the world’s only captive bluefin centers, Block and colleagues are studying the physiology, energetics, hydrodynamics and transcriptomics of the fish. But tracking the activity of the fish in their natural habitat is also vital.
Among the questions they want to answer: How do these animals adapt to their ocean realms, and what is it about the bluefins that makes them uniquely different than all other tunas in their clade? What limits their performance in a warming world? How will they adapt to hypoxia, increased CO2 and ocean acidity?
“We’re interested in monitoring their genes and transcriptomes to help us understand the health of these tunas in an ocean, but that’s not easy,” Block said. “It’s not easy because the ocean is not transparent. When tunas slip beneath the surface, it becomes hard to follow them and to monitor their populations, their transcriptomics, their genomics, and where it is they go.”
Block said her lab uses a “fish and chips” approach. “We put computers on these animals that record their journeys beneath the sea along with the environmental conditions surrounding them,” she said. “By mapping the tunas on the globe, we are able to show visually, and spatially, how these animals use our planet.”
They’ve discovered that the fish travel far, able to go from Iceland to the Gulf of Mexico, or cross from North America to the Mediterranean, in just a few months. It is not so easy to tell populations apart, but genetics has helped.
As Peluso explained, the team generated approximately 118 Gb of sequence from just under 7 million reads for the Atlantic tuna (Thunnus thynnus), and 15 million reads, yielding just over 208 Gb of sequencing for the Pacific bluefin (Thunnus orientalis). Using FALCON-Unzip, they resolved haplotypes and identified structural variants along diploid assemblies of 1.6 Gb and 1.24 Gb, respectively.
Compared with an existing Pacific tuna (T. orientalis) genome from Japan assembled with short-read technology, the new PacBio assemblies contained much fewer fragments — around 2,000 contigs, compared to 16,802 for the Japanese assembly.
“It helped us to identify some genomic differences between these two species, as well as to develop a set of probes, or markers, that could be used to profile these species in a population scale across the globe,” Peluso said.
Further study could involve deeper dives into the assemblies to compare structural variants with gene models, such as correlations between the presence or absence of genes, as well as downstream implications of enhancer or promoter regions on gene expression.
“Having highly contiguous assemblies will help address these questions,” Peluso said.
Last month’s annual meeting of the American Society of Human Genetics in San Diego was a terrific reminder of how much progress is being made in this field — both in our basic understanding of human biology and in our ability to rapidly translate discoveries into clinical utility.
The PacBio team had the privilege of hosting an educational workshop about the value of long-read SMRT Sequencing for human genetic applications. Customers from Mount Sinai and Stanford University offered their perspectives, while PacBio scientists presented data and the technology roadmap. Here, we recap the highlights and provide recordings for anyone who could not attend.
From the Icahn School of Medicine at Mount Sinai, Assistant Professor Stuart Scott gave a talk about using the PacBio system for amplicon sequencing in pharmacogenomics and clinical genomics workflows. Accurate, phased amplicons for the CYP2D6 gene, for example, has allowed his team to reclassify up to 20% of samples, providing data that’s critical for drug metabolism and dosing. In clinical genomics, Scott presented several case studies illustrating the utility of highly accurate, long-read sequencing for assessing copy number variants and for confirming a suspected medical diagnosis in rare disease patients. He noted that the latest Sequel System chemistry improved throughput and read length, as well as reducing error profile and increasing the capacity for multiplexing.
Watch Stuart Scott’s presentation
In a separate talk, Janet Song from Stanford School of Medicine spoke about resolving a tandem repeat array implicated in bipolar disorder and schizophrenia. These psychiatric diseases share a number of associated genomic regions, she noted, however scientists continue to search for a specific causal risk variant in the CACNA1C gene suggested by previous genome-wide association studies. SMRT Sequencing of this region in 16 individuals identified a series of 30-mer repeats, containing a total of about 50 variants. Analysis showed that 10 variants were linked to protective or risk haplotypes. Song said she hopes to study the function of these variants in mouse models or human brain organoid models in the future.
Watch Janet Song’s presentation
Our Principal Scientist Elizabeth Tseng (@Magdoll) showed how the Iso-Seq method can be used to discover disease-associated alternative splicing. This approach to isoform sequencing yields accurate, full-length transcripts requiring no assembly, and is therefore ideal for disease studies that need a more comprehensive picture of alternative splicing activity. Tseng offered several published examples of how the Iso-Seq method has been used for everything from single-gene studies to whole-transcriptome studies, and also detailed how the latest Sequel System chemistry recovers more genes and produces more usable reads.
Watch Elizabeth Tseng’s presentation
Finally, our CSO Jonas Korlach walked attendees through recent product updates and the coming technology roadmap. The Sequel System 6.0 release offered major improvements to accuracy, throughput, structural variant calling, and large-insert libraries, he said, showing examples of 35 kb libraries. Looking ahead, Korlach said that the V2 express library preparation product should be available early in 2019, with the new 8M SMRT Cell being introduced sometime later.
Watch Jonas Korlach’s presentation
In addition to the workshop, we also presented several posters during the event:
- A Simple Segue from Sanger to High-throughput SMRT Sequencing with an M13 Barcoding System – Lori Aro, PacBio
- FALCON-Phase integrates PacBio and HiC data for de novo assembly, scaffolding and phasing of a diploid Puerto Rican genome (HG00733) – Sarah Kingan, PacBio
- No-amp Targeted SMRT Sequencing using a CRISPR-Cas9 Enrichment Method – Jenny Ekholm, PacBio
- Joint Calling and PacBio SMRT Sequencing for Indel and Structural Variant Detection in Populations – Aaron Wenger, PacBio
We’d like to thank our speakers and all the ASHG scientists who took time out of a busy conference to attend our workshop and stop by our posters. Stay tuned for further coverage of the event.
The new reference genome for Aedes aegypti, just published in Nature, famously got its start through a crowdsourced effort on social media, beginning with a tweet from Rockefeller University scientist Leslie Vosshall pleading for a better mosquito resource. The insect expert has been studying mosquitoes since 2008 but for most of that time did not have access to a high-quality, highly contiguous assembly.
We chatted with her to learn more about mosquitoes, what’s possible with the new reference genome, and how this new assembly has changed the landscape for understanding mosquito biology and its implications in viral transmission.
What made the mosquito genome so challenging to sequence?
It was sequenced 10 years ago but the technology then made it impossible to piece together. It’s extremely repetitive. I like to think it of as a series of blah, blah, blah — many copies of blah, blah, blah and you cannot figure out where it fits in the overall sequence of the genome. What was available in the decade-old genome was thousands and thousands of little pieces. That made it impossible to make any progress in studying the mosquito.
How does SMRT Sequencing fit into the story?
The only way we were able to piece this together is because PacBio [sequencing] allowed us to get really long reads that would bridge all the blah, blah, blah and be able to link the whole thing together.
Now that you have this genome, what’s possible?
Now we know how many genes there are in this deadly insect, and we know where they are on chromosomes. That enables everything that comes after. Until you know where the genes are and how many there are, you can’t figure out where insecticide resistance lies. Now with this new genome we can go in with great precision and find the genes. Then you can try to understand how those animals are becoming resistant and develop insecticides that overcome that resistance.
Another example is that viruses like dengue can replicate in some mosquito strains but not others. There are some strains that are resistant to dengue, and that’s a really cool thing to try to figure out. Again, people had a vague idea that there were resistance genes somewhere and now we can really understand what makes some mosquitoes susceptible to dengue and what makes others resistant.
What spurred you to find a way to develop a new genome assembly?
I came from the field of Drosophila. The fly genome was sequenced in 2000 and it’s an incredible work of art. When I started working on the mosquito I thought, are you kidding me? I thought surely someone was going to do something about this genome. Eventually Ben Matthews and I realized nobody is dealing with this, so we pulled together this huge group and took care of it. None of us had reserves of money to put together a new genome project, so it was amazing that PacBio and other corporate sponsors and academics pulled together to get this done.
Did you ever imagine this kind of project could be launched by a tweet?
I still can’t believe we did it. It was so unlikely — I actually know nothing about genome sequencing. The genome has been out there for the last year and a half, and people have already gotten enormous use out of it. It’s really gratifying.
What’s next for your team and how you hope to use this new reference genome?
The genome is powering every single project in my lab. We study how the mosquito hunts people. The genome is seeping into everything — it’s helping us identify genes that allow mosquitoes to smell people, knock the genes out, and develop genetic tools. It’s so inspiring to be able to do all these things we couldn’t do before the genome came online.
Many thanks to all the PacBio users who attended our annual user group meeting, hosted in St. Louis for the first time. It was great to see so many people sharing best practices and project ideas. If you couldn’t attend, this recap will give you a sense of the highlights from the two day meeting on Wash U’s campus and exciting networking event at the City Museum. You can also download several of the presentations and view video recordings.
Tina Graves-Lindsay from the McDonnell Genome Institute and the Genome Reference Consortium spoke about the importance of phasing human reference genomes. Her team is now working on its fifteenth human genome assembly — part of a major effort to improve genomic representation of ethnic diversity — with a pipeline that generates 60-fold PacBio coverage for a de novo assembly, followed by scaffolding with 10x Genomics or Bionano Genomics technology. They are also using FALCON-Unzip to separate haplotypes, leading to reference-grade diploid assemblies. This approach has already helped resolve errors seen in other genomes and even the gold-standard GRCh38 build of the human reference genome.
Human disease studies were the focus for two of our plenary speakers. Thomas Ray from Duke University spoke about the molecular foundation for wiring the nervous system and studies incorporating the Iso-Seq method to generate an accurate catalog of isoforms — including some that had never been seen before — that may explain how a given gene is able to control distinct neurodevelopmental functions. The project he shared centered on retinal development, covering dystrophies and other conditions that can cause blindness.
Nenad Svrzikapa from Wave Life Sciences spoke about the importance of being able to determine the haplotype of long reads and the application of SMRT Sequencing in Wave’s PRECISION-HD1 and PRECISION-HD2 clinical trials for the treatment of Huntington’s disease. Huntington’s disease (HD) is a devastating autosomal dominant disorder characterized by cognitive decline, psychiatric illness and chorea. HD patients carry both a disease-causing mutant allele (mHTT) of the huntingtin gene (HTT) with an aberrant CAG repeat expansion and a functional wild-type allele (wtHTT). Svrzikapa highlighted that Wave’s allele selective technology enables the specific targeting of the mHTT mRNA by targeting two single nucleotide polymorphisms (SNPs) distal to the CAG repeat region. Wave used SMRT technology to bridge this distance and developed an investigational assay for phasing the SNPs with the CAG repeat. Svrzikapa showed the results of Wave’s observational study, which is the first to prospectively identify the frequency of these SNPs in patients with HD, opening the possibility that these patients may be candidates for SNP-targeted therapies such as those being developed by Wave in the PRECISION-HD1 and PRECISION-HD2 clinical trials.
Moving to the animal world, Tim Smith (@tplsmith) from the USDA’s Agricultural Research Service spoke about efforts to generate reference-grade genome assemblies for various bovine species and analyze them to understand factors such as how selective breeding has affected certain breeds. Genome assemblies he cited spanned cattle, water buffalo, and gaur. Smith showed data for each assembly, noting that as data production shifted to the Sequel System, long-read PacBio data became even better at producing highly contiguous assemblies. He shared that one of the most recent, of the Yaklander cattle interspecies, is now the best bovine assembly ever produced. Smith attributed some of the assembly quality to help from the NHGRI’s Sergey Koren (@sergekoren), who also spoke at the meeting. Koren shared his TrioBinning tool, which is useful for resolving haplotypes in virtually any species. It works well even at low coverage but does require a trio of genomes for analysis.
Insects were represented as well in talks from Evgeny Zakharov of the Canadian Centre for DNA Barcoding and Marcé Lorenzen at North Carolina State University. Zakharov focused on the analysis of vertebrate-feeding arthropods to shed light on a region’s broader biodiversity. These complex samples can be interpreted by sequencing biological barcodes, work for which he and his team implemented the Sequel System two years ago to replace Sanger sequencing. Result concordance between the two technologies was excellent, but SMRT Sequencing is much higher-throughput and lower-cost, Zakharov said. That’s why he plans to use this platform for the next phase of this study, which will involve looking at 1.5 million species. In a lightning talk, Lorenzen talked about finding upstream promoters in non-model insects, a project for which she’s using low-coverage SMRT Sequencing. This work is aimed at using molecular genetics to make these pesks “less pesky,” she told attendees.
Microbial genomes were also on display at the user group meeting. Garth Ehrlich from Drexel University spoke about developing a microbiome assay that uses SMRT Sequencing to provide high-quality coverage of the 16S bacterial rRNA for species identification. The goal is a test that would enable de novo identification, no a priori knowledge required. Ehrlich showed how well his assay performs in a number of mock microbial samples, as well as in a case-control study of samples from lung cancer patients and healthy people.
Two lightning talks also fell into the microbial category. Jonathan Jacobs (@jmjacobs2) from The Ohio State University focused on plant pathogenic bacteria, which are characterized by difficult genomic repeats. Elucidating these transcription activator-like effectors can offer clues about virulence and DNA binding patterns, but long-read PacBio sequencing is needed to resolve the differences in these repeats. Microbial multiplexing makes the process quick and affordable. In the other lightning talk, Ben Auch (@sciberius) from the University of Minnesota, a PacBio certified service provider, also spoke about microbial multiplexing, which his lab performs to analyze both Gram-positive and Gram-negative species. His approach generates about 9 Gb per SMRT Cell and routinely produces single-contig assemblies at a total cost of just $250 for a microbe with a genome of 5 Mb.
Two other lightning talk speakers offered application-driven presentations on the Iso-Seq method and amplicon sequencing. Katy Munson (@zhaneel779) from the University of Washington offered pro tips for providing the Iso-Seq method as a service, walking through her recommendations for minimum RNA requirements (5 ug), RIN scores (7.5 or better), and other best practices. Her presentation also showed helpful examples of results from degraded samples. In a separate talk, Dave Corney from GENEWIZ, a PacBio certified service provider, focused on the use of PacBio short and long amplicon sequencing for diverse applications. With the quality of SMRT Sequencing, he predicted that it would be the ability to amplify an amplicon that would become the bottleneck, rather than the sequencing part of the workflow. He also said that high single-molecule fidelity would pave the way for novel applications and ultimately even clinical diagnostics.
The user group meeting concluded with a presentation from our CSO Jonas Korlach about future developments in SMRT Sequencing. He spoke about the upcoming Sequel System 6.0 release of new chemistry, SMRT Cells, and analysis tools that will improve phased de novo assemblies, overall accuracy, and variant detection. He also showed early data from a prototype of the SMRT Cell 8M, which has 8-times the capacity of our current SMRT Cell 1M and is scheduled to be released in 2019.
We’d like to thank our hosts for the meeting, the McDonnell Genome Institute at Washington University School of Medicine, as well as our partners: Advanced Analytical Technologies, Diagenode, DNAnexus, Integrated DNA Technologies, PerkinElmer, Phase Genomics, and Sage Science.
When scientists want to investigate human-specific evolution, the best place to start is often with a comparison to our closest cousins, the great apes. Some recent high-quality PacBio genome assemblies have provided solid new foundations for these projects, but gene annotation has proven challenging, particularly for segmental duplications — sets of gene families duplicated in the human lineage relative to our last common ancestor with the chimpanzee. Could these photocopied gene families be involved in human-specific traits like the development of a larger frontal cortex?
Until now, technical limitations have stood in the way of answering that question. Two common methods to quantify mRNA abundance, the expression microarray and short-read RNA sequencing, are not very useful when comparing paralogs that diverged so recently. Many of the human-specific segmental duplications are more than 98% identical on the genomic level.
Additionally, what may appear like an exact copy of a gene is often not so simple. In humans, segmental duplications can copy-paste in a genomic context to keep all of the regulatory information, effectively doubling up on that gene’s dose. But segments can also copy in a manner that loosens the selective pressure on one copy, allowing mutations to accumulate and even relegating one copy to the “lost function” or pseudogene category. Duplications can even place the new copy in a different regulatory landscape or adjacent to a neighboring gene, allowing natural gene fusion events to occur.
While a handful of human-specific duplicate genes have seen careful mRNA characterization to distinguish the expressed paralogs, the fate of many of these genes remains unknown. Since automated annotations cannot be relied upon in these highly identical regions, a recent study published in Genome Research by Dougherty and Underwood et al. took on the technical hurdles of characterizing mRNAs with isoform-level resolution for the human-specific duplicate genes.
Those hurdles were overcome largely with the PacBio Iso-Seq method, a long-read sequencing method that reads full-length isoforms. RNA from adult and developing human brain tissue was used as starting material for a modified Iso-Seq method that incorporates barcodes at both ends of the cDNA molecules. The brain cDNAs of interest were enriched using hybridization-capture techniques with probes designed against the exons of duplicate genes. This meant that isoform information for each locus could be effectively purified in cDNA form prior to sequencing.
Eight of the 19 gene families showed a nearly identical photocopy of the original gene, while the others showed patterns of gene truncation or fusion to a neighboring gene. Most of these latter cases represent new gene innovations that appear to be present only in humans.
One interesting case highlighted by the study is CD8B and its paralog, CD8B2. While the CD8B2 paralog used to be considered a pseudogene, the new isoform data indicate that the protein open reading frame is intact, with just a few amino acid changes relative to CD8B.
With better annotations in hand, the researchers went back and queried a large RNA-seq data set called GTEx to see which tissues might express these newly discovered duplicate gene isoforms. Surprisingly, most of the reads that were uniquely assignable to the CD8B2 paralog were found in brain tissue, not, like CD8B, in the blood. The scientists deduced that the segmental duplication event that created CD8B2 did not bring along the regulatory information from CD8B that drives its expression in the blood; instead, it landed in a spot with mild transcriptional activity in the brain, resulting in a complete ORF encoding mRNA for CD8B2 that is expressed in the cortex.
With this modified Iso-Seq method, scientists who know just a little about a gene can still find out a great deal about its expressed isoforms. Along with other recent capture methods, this should be broadly applicable to those interested in studying extremely close paralogs or haplotype-specific isoforms that are difficult to distinguish using short read sequences alone.
We’re pleased to announce the winner of the 2018 Microbial Genomics SMRT Grant. Mark Webber, Research Leader at Quadram Institute Bioscience in the UK, will get free SMRT Sequencing and analysis from our certified service provider, the Genomics Resource Center at the University of Maryland. His goal is to further a project designed to understand how bacteria on the skin of premature babies in neonatal intensive care units acquire resistance to the antiseptics used to prevent infections. We spoke with Mark to learn more about his work and how the SMRT Grant will make a difference.
Q: What’s your research focus?
A: We’re interested in how bacteria deal with stress — how do bugs become resistant to drugs? We’re particularly interested in Staphylococci and how they deal with the antiseptics that we use in hospitals. We’ve looked at patients in intensive care in the UK, examining isolates over time to see how susceptible they were to two antiseptics in a large teaching hospital that had changed its antiseptic use. What we could see was that as you used more and more chlorhexidine, the bugs were more and more tolerant. When they introduced another antiseptic; octenidine, a population quickly emerged that was less tolerant of octenidine.
Q: What inspired the proposal you submitted for this SMRT Grant?
A: We have been studying premature babies in neonatal intensive care. Every year about 450,000 premature babies are born in the US and 60,000 are born in the UK. About 15,000 of these| will suffer from late-onset sepsis. mainly caused by Staphylococci infections. These babies are often very immune-suppressed and the risk of death from sepsis is quite high. The babies almost all have peripheral catheters to allow feeding and drugs to be administered but these are a potential route for bugs to get in to their blood. Therefore, to prevent infection from bugs living on the skin there’s a lot of antiseptic use. We wanted to see how antiseptic tolerance might be developing amongst the bugs living on the skin of these premature babies.
Q: What have you learned so far about this issue?
A: At hospitals in the UK and with our collaborators in Germany, we collected isolates each week from hundreds of babies over a three-month period. We now have a total of 1,300 isolates of Staphylococci which we have tested for their susceptibility to chlorhexidine, the antiseptic used in the UK, and octenidine, the antiseptic used in Germany. We have seen that babies in intensive care appear to pick up Staphylococci with high tolerance to antiseptics very quickly and our UK population is particularly robust in dealing with chlorhexidine which we use.
Q: How will you use the sequencing capacity from your SMRT Grant?
A: After studying all 1,300 isolates, we want to understand how related these bacteria are and take about 20 representatives of the major branches of the Staphylococcal family tree and get really high-quality, full genome sequences. With the PacBio long reads, we will be able to see whether there are common mobile elements that explain the acquisition of tolerance. We can also look for duplications, rearrangements, and methylation patterns which may be responsible for antiseptic adaptation. PacBio sequencing will give us a higher quality picture of what’s going on than other technologies, and this SMRT Grant will let us capture most of the major branches of the phylogeny that we’re hoping to see.
Q: What do you hope to learn when you’ve had a chance to analyze these new reference genomes?
A: Doing the bioinformatics analysis will not take us that long — it should be a matter of a few days to get a pretty decent picture of what differs between antiseptic tolerant and sensitive strains. We hope to understand the mechanisms of resistance that we can then compare to isolates collected in other parts of the world. That will help us determine whether antiseptic use is likely to fail, whether different antiseptics are more or less likely to select for tolerance and whether antiseptic tolerance is linked to antibiotic resistance. Together this information should help us understand whether we need to change the way we use antiseptics to keep babies safe.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for another chance to win SMRT Sequencing. Thank you to our co-sponsor, the University of Maryland’s Genomics Resource Center, for supporting the Microbial Genomics SMRT Grant Program!
It’s one of the most ambitious sequencing projects ever attempted — the assembly of all 1.5 million known species of animals, plants, protozoa and fungi on Earth — and SMRT Sequencing will play a major part.
A greater understanding of Earth’s biodiversity and the responsible stewarding of its resources are among the most crucial scientific and social challenges of the new millennium, and overcoming these challenges requires new scientific knowledge of evolution and interactions among millions of the planet’s organisms, said Earth BioGenome Project Chair Professor Harris Lewin of the University of California, Davis.
“The Earth BioGenome Project can be viewed as the infrastructure for new biology,” Lewin added. “Having the roadmap, the blueprints for all living species of eukaryotes will be a tremendous resource for new discoveries, understanding the rules of life, how evolution works, new approaches for the conservation of rare and endangered species, and provide new resources for people in agricultural and medical fields.”
Headed by the Wellcome Sanger Institute, the Darwin Tree of Life Project will explore the genetic code of 66,000 species in the UK. The Natural History Museum in London, Royal Botanic Gardens, Kew, Earlham Institute, Edinburgh Genomics, University of Edinburgh, EMBL-EBI and others will collaborate in sample collection, DNA sequencing, assembling and annotating genomes, and storing the data.
SMRT Sequencing was previously used to decode the genomes of 25 UK species for the first time in a project to mark the 25th anniversary of the Wellcome Sanger Institute. The insights gained from the 25 Genomes Project form the basis for scaling up to sequence the genomes of 66,000 species.
“We are honored to be an integral part of the Darwin Tree of Life Project as it deploys the power of our sequencing technology on a much broader scale,” said our CSO, Jonas Korlach. “With the recent and ongoing improvements in our technology, we are well positioned to support the needs for scaling the sequencing and assembling of the genomes for the large number of species targeted by this project as well as the Earth BioGenome Project.”
Genome-wide association studies (GWAS) may be powerful tools for the identification of genes underlying complex traits, but what if you have an incredibly complex, uncharacterized genome, with no sequenced progenitor or related species?
A team of scientists from the Chinese Academy of Agricultural Sciences in Changsha, China came up with a solution: a transcriptome-referenced association study (TRAS), powered by our Iso-Seq method.
The approach, outlined in this DNA Research paper, utilized a transcriptome generated by SMRT Sequencing as a reference to score population variation at both transcript sequence and expression levels. The team, led by Touming Liu and first author Xiaojun Chen, used the approach to study the shape of garlic cloves.
Cultivated globally for more than 5,000 years as a vegetable, spice, and medicinal plant, garlic (Allium sativum L.) is a diploid species with a giant genome: ~15.9 Gb, 32 times larger than rice. The most widely consumed part of the plant, the bulb, consists of several cloves that are actually abnormal axillary buds rarely found among vascular plants. The shape of these cloves are economically important quantitative traits, but their genetic mechanisms are poorly understood.
Plant quantitative traits are typically controlled by several major and minor effect genes that constitute complex regulatory networks, and characterization of these traits is time-consuming and labor-intensive when using traditional mapping methods that involve the identification and cloning of dozens of trait-control genes. Previous studies that conducted de novo assembly of the garlic transcriptome were able to produce more than 120,000 transcripts, but many were considered incomplete, with an average length of less than 600 bp, of which only 35–42% were functionally annotated.
So, Liu and colleagues collected bulb samples from 92 landraces in China and 10 from other countries, and selected one candidate from China for Iso-Seq long-read RNA transcript sequencing. From this, they created a high-quality reference transcriptome that consisted of 36,321 transcripts of lengths ranging from 120 to 4,803 bp, accounting for 54.48 million bases in total.
The Iso-Seq method “significantly improved the transcriptome quality—the mean length of the transcripts was 1,500 bp; more than 70% of the transcripts had a complete 3′ end; and only less than 1% of the transcripts remained functionally unannotated,” the authors wrote.
To characterize the genotypes of the rest of the 102 landraces in both sequence and expression, they sequenced the transcriptomes of developing bulbs in the population. The read sequences were aligned to the reference transcriptome, and the variation in both sequence (SNPs) and GE of transcripts were scored.
The team ultimately identified 22 candidate transcripts, most of which showed extensive interactions. Eight transcripts were long non-coding RNAs (lncRNAs), and the others encoded proteins involved mainly in carbohydrate metabolism and protein degradation. These findings can provide a basis for improving clove shape traits in garlic breeding, as well as validate the TRAS approach, the authors said.
“Our results demonstrate that TRAS is a useful approach for association studies, and its independence from a reference genome will extend the applicability of association studies to a broad range of species,” they wrote. TRAS also offered additional advantages in comparison with the GWAS approach, the team noted.
It can directly detect candidate transcripts for a trait by integrating sequence data with expression data, in contrast to GWAS, which identifies only a genome region in which markers are in linkage disequilibrium for the loci controlling the trait. Also, unlike GWAS, after identifying a genome region based on the sequence variation, TRAS uses the information on transcript expression in the identified region to determine whether or not the corresponding transcript is associated with a given trait. And TRAS can detect potential interaction of transcripts by eQTL analysis, and the potential relationship among the transcripts is helpful for further validation of these interactions.
It’s a murder mystery of massive proportion, albeit on a miniature scale: Male-killing among several species of insects, caused by selfish symbiotic bacteria.
Swiss researchers believe they have finally solved a question that has stumped scientists for decades, with potential implications for pest and infection control.
In a recent Nature publication, Toshiyuki Harumoto and Bruno Lemaitre of the Global Health Institute at the École Polytechnique Fédérale de Lausanne (EPFL) in Lausanne, Switzerland, have reported their findings regarding a toxin in Spiroplasma poulsonii, one of several types of symbiotic bacteria that manipulate host reproduction to spread in a population by distorting host sex ratios.
A notable feature of S. poulsonii is male killing in Drosphila, whereby the sons of infected female hosts are selectively killed during development. Male killing has also independently evolved in at least six bacterial taxa, including Wolbachia, which is being investigated by the Gates Foundation as a potential tool to control the transmission of mosquito-borne viruses such as dengue, chikungunya and Zika.
Although male killing in Drosophila caused by S. poulsonii has been studied since the 1950s, its underlying mechanism was not known. Previous studies attributed the selective killing of male progeny to an unknown substance called ‘androcidin’, assumed to be secreted by the bacterium. Further identification of the toxin was hampered by a lack of practical methods to characterize it, but SMRT Sequencing and a chance discovery enabled the Swiss team to pinpoint the protein responsible.
“Our study has uncovered a bacterial protein that affects host cellular machinery in a sex-specific way, which is likely to be the long-searched-for factor responsible for S. poulsonii-induced male killing,” the authors write.
While studying S. poulsonii, the researchers unexpectedly identified a mutant strain that showed reduced male-killing ability (MSRO-SE; the partial male-killing strain), where almost half of the male progeny survived. To identify the genetic basis of this reduced male killing, they sequenced the genome of MSRO-SE and compared it with that of an androcide competent strain, MSRO-H99.
They found a candidate gene that was altered in the compromised strain — encoding a 1,065-amino-acid protein with ankyrin repeats and an OTU (ovarian tumor deubiquitinase) domain — that they named Spaid (S. poulsonii androcidin)
“Overexpression of Spaid in D. melanogaster kills males but not females, and induces massive apoptosis and neural defects, recapitulating the pathology observed in S. poulsonii-infected male embryos,” the authors write. Their data suggests that Spaid targets the dosage compensation machinery on the male X chromosome to mediate its effects, the paper states.
The identification of Spaid in S. poulsonii could also boost the study of androcidins in other symbiont bacteria, including Wolbachia, which has been notoriously difficult to sequence. Wolbachia contains ankyrin-repeat proteins like Spaid, but in much higher frequency than S. poulsonii. Wolbachia genomes encode more than 20 ankyrin-repeat proteins, whereas Spaid is the sole ankyrin-repeat protein in the S. poulsonii genome. While fully investigating all the ankyrin-repeat proteins in Wolbachia would be an ambitious project, the findings from S. poulsonii suggest it might be a fruitful place to search for androcidin activity.
“A thorough understanding of the reproductive manipulations induced by symbionts would not only provide novel insights into fundamental aspects of development, sex determination, and their evolution in insects, but could also provide clues to control insect populations,” the authors conclude.
Today we’re pleased to announce the release of Sequel System 6.0, including new software, consumable reagents and a new SMRT Cell.
Combined, the enhancements in the release improve the performance and affordability of Single Molecule, Real-Time (SMRT) Sequencing by providing individual long reads with greater than 99% accuracy, increasing the throughput up to 50 Gb per SMRT Cell, and delivering average read lengths up to 100,000 base pairs, depending on insert size. These improvements are expected to greatly enhance the accuracy and cost effectiveness of applications such as whole genome sequencing, human structural variant detection, targeted sequencing and RNA transcript isoform sequencing (Iso-Seq method).
- For amplicon and RNA sequencing projects, customers can generate up to 500,000 single-molecule reads with high fidelity (>99% single-molecule accuracy); and
- For whole genome sequencing projects, users can achieve up to 20 Gb per SMRT Cell with average read lengths up to 30 kb and high consensus accuracy (>99.999%).
Since SMRT Sequencing technology was first commercialized in 2011, we have increased the throughput per SMRT Cell by 2,000-fold. These ongoing throughput increases provide a significant cost savings for sequencing projects in the human, plant and animal markets, which allows researchers the opportunity to increase the size and scope of their projects.
“These enhancements represent the most significant improvement in terms of read length, throughput and accuracy that we have ever achieved in a single product release,” said Chief Executive Officer Michael Hunkapiller, Ph.D. “Customers can now enjoy unprecedented capabilities with a new paradigm in long-read sequencing — highly accurate single-molecule reads. Further, many users no longer need to trade off between read length and accuracy, because it is now possible to achieve Sanger-quality reads as long as 15 kb.”
Jonas Korlach, Ph.D., Chief Scientific Officer, added: “Our latest Sequel System improvements open new opportunities for comprehensively mapping all human genetic variation — from SNVs to indels to SVs — in a single assay and pave the way for a new era of population-scale, high-quality human genome studies.”
We’re proud to announce the release of the most contiguous diploid human genome assembly of a single individual to date, representing the nearly complete DNA sequence from all 46 chromosomes inherited from both parents. The sample used was derived from a Puerto Rican female who has been included in population genetics studies such as the 1000 Genomes Project. The phased diploid assembly will give unprecedented views of population-specific variation through the long-range resolution of maternal and paternal haplotypes.
This work is part of a larger effort in the field of personalized medicine and human genomics to add ethnic diversity to the available human reference genomes. More than 40 global initiatives are currently underway to apply de novo assembly methods to individuals representing multiple ethnic populations. Notable among these initiatives is the McDonnell Genome Institute at Washington University, which has contributed 11 high-quality PacBio genomes for individuals representing populations from Africa, Asia, Europe, and the Americas.
Our approach to the Puerto Rican genome relied upon the current best practices for de novo assembly while also pushing read lengths ever longer and adding new methods and data types to better tackle the problem of diploid genome assembly.
The Puerto Rican sample was sequenced on the Sequel System with 2.1 chemistry and v5.1 software using a large insert library aggressively size selected to 35 kb. The resulting contig assembly totaling 2.89 Gb has the highest contiguity to date, with half of the genome contained in gapless contigs longer than 27 Mb. These results are even better than the consistently stellar assemblies MGI has been producing, which typically have contig N50s of 20-25 Mb.
Like the MGI genomes, the new Puerto Rican genome was assembled using FALCON, but with a newer version of FALCON-Unzip that includes algorithmic improvements to phasing and accuracy. Nearly 85% of the genome was resolved as maternal and paternal haplotypes, with more than 600 Mb of sequence in haplotype blocks longer than 1 Mb. An analysis of variants within phase blocks indicates high accuracy with 95% of SNPs showing concordant inheritance from a single parent.
In addition to the improvements in PacBio’s FALCON-Unzip assembler, the Puerto Rican assembly includes the novel use of Hi-C data to extend phasing between haplotype blocks. In collaboration with Phase Genomics, PacBio developed a new method for enhanced phasing that does not rely on family trio data. The new method, called FALCON-Phase, maps ultra-long range Hi-C reads to the FALCON-Unzip contigs to extend phasing to the contig scale. The Hi-C data was also used to scaffold the phased contigs before performing another round of phasing on the scaffolds.
The resulting assembly consists of 46 chromosome-scale scaffolds, representing the maternal and paternal chromosome set for the Puerto Rican individual. Each set of 23 scaffolds contain only 511 gaps and are a total of 2.83 Gb long. The remainder of each haploid genome is contained in 260 scaffolds of 63 Mb in length.
Genome: https://www.ncbi.nlm.nih.gov/genome/?term=RBJD00000000 (currently not live)
King scallops are more genetically diverse than we are? The Roesel’s bush cricket’s genome is four times the size of ours? These are just some of the findings made by scientists at the Wellcome Sanger Institute after undertaking a project to sequence the DNA of 25 wildlife species important to the United Kingdom.
Although many of the species they selected are native to the British Isles, the implications of the research are expected to extend around the globe. The project’s first data release, the Golden Eagle genome, for instance, will impact the study of eagles in North America and elsewhere, according to Sanger Institute associate director Julia Wilson.
Wilson announced the release of the remaining 24 genomes at a 25th anniversary celebration at the Institute’s Cambridgeshire campus today.
“We have learned much through this project already and this new knowledge is flowing into many areas of our large-scale science,” Wilson said. “Now that the genomes have been read, the pieces of each species puzzle need to be put back together during genome assembly before they are made available.”
Among other questions scientists will explore with the new high-quality genomes are why some brown trout migrate to the open ocean while others don’t, and why red squirrels are vulnerable to the squirrel pox virus, yet grey squirrels can carry and spread the virus without becoming ill.
The genomes—selected by scientists to include representatives of flourishing, floundering, dangerous, iconic and cryptic species, as well as five picked by the public during a nationwide vote—were decoded using SMRT Sequencing. They will now be annotated and analyzed.
“Sequencing these species for the first time didn’t come without challenges, but our scientists and staff repeatedly came up with innovative solutions to overcome them,” Wilson said.
These challenges, documented in a blog series by 25 Genomes Project coordinator Dan Mead, included everything from acquiring specimens to “exploding flatworm goop.”
“We are already discovering the surprising secrets these species hold in their genomes,” Mead said. “Similar to when the Human Genome Project first began, we don’t know where these findings could take us.”
Whereas the first human genome took 13 years and billions of dollars to complete, the Sanger Institute was able to newly sequence 25 species’ genomes in less than one year, at a fraction of the cost. The high-quality genomes will be made freely available to scientists to use in their research.
We are honored to have been part of the effort, and extend Happy Birthday wishes to our colleagues across the pond. We look forward to another 25 years of collaboration!
Dec. 3, 2018
Congratulations to the Italian team on the publication of their European barn swallow genome! The paper is now available at GigaScience.
Oct. 3, 2018
With its bold blue plumage, russet throat and chipper chirps, the barn swallow is beloved by many avian enthusiasts. It’s also a favorite of scientists, becoming a flagship species for conservation biology. Numerous evolutionary and ecological studies have focused on its biology, life history, sexual selection, response to climate change, and the divergence between its eight subspecies in Europe, Asia and North America.
But the full potential of such studies has been limited by gaps in genomic data. A 2016 draft genome for the American subspecies (Hirundo rustica erythrogaster), for example, was assembled from short, paired-end reads derived from a male individual, leading to continuity gaps and a lack of information for the W chromosome, as females are the heterogametic (ZW) sex in birds.
To address such limitations, an international team of researchers from the University of Milan, University of Pavia, and California State Polytechnic University used SMRT Sequencing at the Functional Genomics Center in Zurich and Bionano optical mapping to produce a new high-quality genome assembly for the European barn swallow, Hirundo rustica rustica.
The combination led to a final 1.21 Gb assembly with a scaffold N50 value of over 25.95 Mb, representing a more than 650-fold improvement in N50 with respect to the 2016 draft genome, as reported in the pre-print “SMRT long-read sequencing and Direct Label and Stain optical maps allow the generation of a high-quality genome assembly for the European barn swallow (Hirundo rustica rustica).”
The primary assembly’s contiguity metrics even meet the high standards of the Vertebrate Genome Project consortium “Platinum Genome” criteria (contig N50 in excess of 1 Mb and scaffold N50 above 10 Mb).
“Given the inception of large scale sequencing initiatives aiming to produce genome assemblies for a wide range of organisms, it is critical to identify combinations of sequencing and scaffolding approaches that allow the cost effective generation of genuinely high-quality genome assemblies,” the authors write.
“We believe that the data presented here, as well as attesting to the effectiveness of SMRT sequencing combined with DLS optical mapping for the assembly of vertebrate genomes, will provide an invaluable asset for population genetics studies in the barn swallow and for comparative genomics in birds,” they conclude.
The authors also identified several potential future projects based on the improved assembly, including: the phasing of the assembly to generate extended haplotypes, a more thorough gene annotation using RNA/Iso-Seq sequencing data, detailed comparisons with genome data from the American barn swallow, re-evaluation of data from previous population genetics studies conducted in this species, as well as characterization of the epigenetic landscape.
We look forward to additional reports, and will be keeping a bird’s eye view of work done with the genome.
Seeking sequencing for your plant and animal project? Check out opportunities available via our SMRT Grant Program.
Xiaochang Zhang, an assistant professor at the University of Chicago, is poised to get a powerful new data set to help his team understand the role of alternative splicing in brain development. His project, entitled “Uncovering mRNA splicing diversity in cerebral cortex development,” was selected as the winner of the 2018 Iso-Seq SMRT Grant Program. Sequencing for this project will be carried out by our Certified Service Provider RTL Genomics. We caught up with Xiaochang to learn more about his research and how SMRT Sequencing data will make a difference.
Q: What’s your research focus?
A: We are interested in the impact of alternative RNA splicing in neocortex development and disorders, and we are excited about the opportunity to use long-read sequencing to further address this question. Enormous neuronal cell diversity has been described, and it is speculated that the secret of neuronal cell diversity is partly hidden in the heterogeneity of neural progenitor cells. Post-transcriptional mRNA metabolism such as alternative splicing presents another layer of gene regulation and dramatically increases protein diversity. Indeed, work from others and us showed that alternative pre-mRNA splicing is wide spread in developing mouse and human brains, and tight regulation of cell type-specific RNA splicing is required for human brain development. Characterizing mRNA isoforms with long-read sequencing will give us a unique chance to understand how the brain is built – we’re really excited about this.
Q: How have you pursued this prior to long-read sequencing?
A: We did bulk RNA sequencing with mouse brain cells and found hundreds of alternatively spliced exons between neural progenitor cells and post-mitotic neurons. We further analyzed a single-cell data set of fetal human brain cells and identified consistent RNA splicing changes between cell types. However, it is hard to obtain a full picture of alternative RNA splicing with short-read sequencing for genes that have multiple alternatively spliced exons. Long-read sequencing will be superior to uncover complex splicing isoforms.
Q: What do you hope to learn with the SMRT Sequencing data?
A: Single Molecule, Real-Time (SMRT) Sequencing can sequence single molecules of the longest human messenger RNAs. We are excited to directly detect the actual full-length mRNA isoforms among different brain cell types with SMRT Sequencing. We will compare long-read sequencing results with our current datasets, and try to uncover complex splicing isoforms that are previously unobservable. With this SMRT Grant we hope to get a better view of alternative RNA splicing in brain development.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win SMRT Sequencing. Also, thank you to our co-sponsor RTL Genomics for supporting the Iso-Seq SMRT Grant Program!
In addition to the most common applications, like whole genome sequencing for de novo assembly, there are several other features you can utilize to advance your science or incorporate to offer your customers a broad range of the best PacBio services. Here’s a sampling of the most recent updates and releases.
Iso-Seq Analysis for Genome Annotation or Targeted Isoform Discovery
The isoform sequence (Iso-Seq) application generates full-length cDNA sequences – from the 5’ end of transcripts to the poly-A tail – eliminating the need for transcriptome reconstruction using isoform-inference algorithms. It’s even easier to help your customers annotate their genomes or perform isoform discovery with full-length transcripts now that diffusion loading is supported for Iso-Seq projects. (For more information on switching to diffusion loading for Iso-Seq analysis projects, please contact your local FAS.)
Multiplexing for Bacterial Whole Genome Assembly
A new solution for multiplexed bacterial whole genome sequencing on the Sequel System is now available, enabling pooling of as many as 16 samples that total up to 30 Mb of genomes. With two new barcoded adaptor kits, a run setup calculator, and data analysis workflow, it’s now fast and easy for your customers to generate multiple high-quality bacterial genomes in a single Sequel System experiment.
Structural Variant Detection with Low-Fold Coverage Sequencing
The PacBio SV application provides high-sensitivity detection of structural variants in human genomes with modest coverage and a low false discovery rate. These larger variant types are typically missed with short-read methods but are known to cause disease. A simple library prep, using a modest amount (~3 µg) of unamplified genomic DNA from a blood sample, is effective for gene discovery in rare and Mendelian disease as well as broader population-scale SV characterization.
When creating a global genomic ark of creatures great and small, scientists are turning to the comprehensive coverage and quality of PacBio sequencing.
The Vertebrate Genomes Project (VGP), an international consortium of more than 150 scientists from 50 academic, industry and government institutions in 12 countries, recently released the first 15 of an anticipated 66,000 high-quality reference genomes representing all vertebrate species on Earth.
The VGP consortium spent three years selecting technologies and workflows to produce higher quality, “platinum-level” genomes, and SMRT Sequencing was selected to generate the initial assemblies.
“Until recently, sequencing the complete genome of a single animal required millions of dollars and years of effort. New sequencing technologies have dramatically reduced the cost and made it possible to reconstruct near-perfect genomes for the first time,” said VGP member Adam Phillippy of the Genome Informatics Section at the National Human Genome Research Institute.
From the duck-billed platypus to the limbless serpentine amphibian Two-lined caecilian, the first data release represents species from all five vertebrate classes – mammals, birds, reptiles, amphibians, and fishes.
The first phase of the project will continue with the sequencing of at least one species representing each of the 260 orders of living vertebrates. Subsequent sequencing will cover all 1,045 families, then 9,478 genera, and ultimately all of the approximately 66,000 species of vertebrates.
“The last 20 years have proven the value of openly available high-quality reference genome sequences to scientific research, but until now these have mostly been available just for humans and other key organisms,” said Richard Durbin, of the University of Cambridge and the Wellcome Sanger Institute. “We are entering an era in which we will obtain reference genome sequences for all species across the Tree of Life.”
VGP is one of many large-scale international projects to sequence the DNA of thousands of plant, animal, fungal and bacterial species that have chosen PacBio Single Molecule, Real-Time (SMRT) Sequencing to assemble some of the most complete genomes to date. These comprehensive catalogs of genetic code provide valuable resources to researchers in their quest to understand the biology, physiology, development and evolution of a multitude of living organisms, and will aide in their conservation.
Another is the Bat1K initiative, and effort by Sonja Vernes of the Max Planck Institute and others to catalog the genetic diversity in 1,300 types of bats.
“The long-read sequencing technology from PacBio is allowing us to produce bat genomes of unprecedented quality and resolution as part of the Bat1K project,” said Vernes. “This is going to be a big step forward for understanding how the genes and also the non-coding DNA in these genomes influence the weird and wonderful features of bats.”
Other projects include:
- The Bird 10,000 Genomes (B10K) Project, which is aiming to generate representative draft genome sequences from all extant bird species; many of its members became founders of the The Genome 10,000 consortium (G10K), which evolved into the Vertebrate Genome Project;
- Efforts to sequence nationally significant species, such as the Sanger 25 Project by the Wellcome Trust Sanger Institute and the Canada 150 Sequencing Initiative (CanSeq150) by Canada’s Genomics Enterprise.
- The NCTC 3000 initiative by the UK’s National Collection of Type Cultures to sequence the genomes of 3,000 strains of bacteria;
- Whole Genome Assembly of the Maize NAM Founders, a multi-institutional effort to create a 26-line pangenome maize reference collection, one of many initiative to sequence important agricultural crops to discover and utilize novel genes, traits and/or genomic regions for crop improvement and basic research;
- The Pan-Genome Analysis of Sorghum project at the Donald Danforth Plant Science Center, which includes 15 sorghum lines covering the diversity of this important bioenergy, food, and feed crop. The project is supported through the Community Science Program (CSP) of the DOE Joint Genome Institute with PacBio sequencing at HudsonAlpha Institute for Biotechnology.
- The Open Green Genomes Initiative, also supported by DOE Joint Genome Institute, which will generate high-quality genome assemblies and annotations for 35 species representing all major evolutionary lineages in the land plant tree of life.
- The Functional Annotation of Animal Genomes Project (FAANG), which is aiming to produce comprehensive maps of functional elements in the genomes of domesticated animal species;
- Marine and aquaculture efforts such as The Aqua-Hundred Genome Project;
- Insect initiatives, including the i5k Project to sequence 5,000 arthropod genomes and The Global Ant Genomics Alliance (GAGA) to sequence 200 ant species.
If you’re interested in supporting this important effort, the group is soliciting donations for ongoing project support.
Many people who run a sequencing core lab would prefer to focus on science instead of business, but all core lab managers know that it’s imperative to keep a steady stream of clients and projects filling the pipeline. In a recent blog post we offered 5 ways to attract more customers to your sequencing services. Now let’s take a look at how you can incorporate new services and upgrades into your facility.
Keeping up with the latest and greatest advancements in sequencing technology isn’t just about the sequencing instruments. Companies like PacBio regularly release instrument improvements, new chemistries, software features, and new applications for their sequencing platforms. Making sure that you are running the latest chemistries and supporting the newest features will help your lab continue to generate the best results for your customers. Here are several ways you can keep up to date with all things PacBio.
- Keep in close contact with your local FAS
The local Field Application Scientist (FAS) who trained your team and whom you call with questions is the same person who can give you real-time information on the newest releases and applications. He or she can give you the in-depth training to get started offering a new service or upgrading your current software.
- Join the Certified Service Provider Program
As a PacBio Certified Service Provider (CSP), you can take advantage of benefits that other providers cannot. Benefits include preferential consideration for early access to, and sometimes even beta testing of, new features and applications. In addition, you’ll have quarterly check-ins with the PacBio team for the latest updates and information about the products we offer. Find out more about joining our CSP Program.
- Connect with us digitally
We try to deliver a steady, but not overwhelming, stream of the latest information about the uses for SMRT Sequencing across multiple channels. From our market area newsletters (Plant and Animal, Human Biomedical, Microbial) to the snappy one-liners on Twitter, there’s a mix of communication out there perfect for keeping you informed. Subscribe to our blog, follow us on Twitter, Medium, and LinkedIn, and sign up for updates to make sure you’re getting all the latest news delivered to you as it happens.
- Attend PacBio events
Throughout the year we host a series of User Group Meetings all over the world with the goal of bringing together our customers, end users of SMRT Sequencing, and anyone else interested in learning more. These multi-day events consist of updates from PacBio staff as well as cool biological stories from many different labs covering a variety of applications. Because of the smaller nature of these events compared to large industry conferences, a lot of individual information exchange occurs and collaborations are formed. Check out our upcoming events – we hope to see you at the next one!
In an exciting paper that made the cover of Genome Research, scientists from Cold Spring Harbor Laboratory and collaborating institutions report the genome sequence and transcriptome of a commonly used breast cancer cell line. They determined that the cell line harbors far more structural variants than previously thought with results that call into question cancer genome analysis based solely on short-read sequencing data.
In “Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line,” lead author Maria Nattestad, senior author Michael Schatz, and collaborators describe an in-depth investigation of SK-BR-3, an important model for HER2-positive breast cancer. “SK-BR-3 is known to be highly rearranged, although much of the variation is in complex and repetitive regions that may be underreported,” they write, explaining their choice of PacBio long-read sequencing to conduct a new genomic and transcriptomic analysis of the cell line.
Investigating genomic instability is essential to understanding cancer but attempts to do so using short-read sequencing have seen limited success due to challenges in detecting structural variation. Even large-scale cancer projects “have performed somewhat limited analysis of structural variations, as both the false positive rate and the false negative rate for detecting structural variants from short reads are reported to be 50% or more,” Nattestad, et al. report. “Furthermore, the variations that are detected are rarely close enough to determine whether they occur in phase on the same molecule, limiting the analysis of how the overall chromosome structure has been altered.”
With the goal of creating a comprehensive map of structural variations in cancer, scientists sequenced the SK-BR-3 genome using SMRT Sequencing. To enable comparison between sequencing technologies, they also used a short-read technology. The team found that PacBio data was more mappable: more than 90% of PacBio reads align with a mapping quality of 60, while just 69% of short reads did the same. “We also observed a smaller GC bias in the PacBio sequencing compared to the Illumina sequence data,” they note, “which enables more robust copy number analysis and generally better variant detection overall.”
An analysis of variants showed that long-read sequencing detected more than 17,000 structural variants of at least 50 bp in length, while the short-read data yielded only about 4,100, a difference that could largely be attributed to the lack of insertions called in the latter data set. This closely mirrors the results of researchers working on population-specific reference genomes.
The scientists coupled their genomic variant discovery with the Iso-Seq method to capture full-length transcripts from SK-BR-3, noting that short-read data often cannot span or accurately reconstruct entire isoforms. “Long reads overcome such limitations by spanning multiple exon junctions and often covering complete transcripts,” they explain. Within the transcriptome analysis, the team closely examined several gene fusions. Some of the gene fusions were found to be the product of two or three rearrangement events occurring in sequence. For example, “CYTH1-EIF3H had been discovered previously with RNA-seq and been validated with RT-PCR, but it was not known to be a “2-hop” gene fusion (taking place through a series of two variants) until now,” the scientists report. “This fusion was also captured in full by several individual SMRT-seq reads that contain both variants and have alignments in both genes.” The authors also report finding direct evidence that a gene fusion previously thought to be the result of a 2-hop path is actually a 3-hop fusion.
One detailed illustration of the careful analysis performed for this project involved the ERBB2 oncogene, which is also called HER2. “We discover a complex sequence of nested duplications and translocations, suggesting a punctuated progression,” the team writes. They were able to “reconstruct the progression of rearrangements resulting in the amplification of the ERBB2 oncogene, including a previously unrecognized inverted duplication spanning a large portion of the region.”
“Long-read read sequencing can expose complex variants with great certainty and context, suggesting that more multi-hop gene fusions, inverted duplications, and complex events may be found in other cancer genomes,” the scientists conclude. “There may be many other types of complex variations present in other cancer genomes that were not found in SK-BR-3, so it is essential to continue building a catalogue of these variant types using the best available technologies.”