This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
The recent Nature paper describing the first evidence of somatic gene recombination in the human brain has been getting so much attention that we went back to the lab’s PI to learn more. Jerold Chun is Professor in the Degenerative Diseases Program and Senior Vice President of Neuroscience Drug Discovery at Sanford Burnham Prebys Medical Discovery Institute in La Jolla, Calif. He spoke with us about this remarkable discovery in the APP gene in patients with sporadic Alzheimer’s disease, the decades-long hunt for somatic recombination in genes active in the brain, and how SMRT Sequencing made a difference.
Previous efforts to find somatic recombination in the human brain failed. Why did you continue the hunt?
This goes way, way, way back. Anyone who knew about V(D)J recombination that was originally reported in the ’70s and knew something about the nervous system has been intrigued by that possibility. It was the seed for trying to identify some type of similar recombination in the brain. But back then ideas were very vague; it was simply trying to take what we knew about the immune system and projecting what might occur in the nervous system. Nevertheless, the concept remained compelling and our studies on genomic mosaicism that occurred in the interim supported something interesting going on. As it turns out, the thought was good but the details were quite different from what we originally thought. We’re now at the point where we can talk about it not as a phantom but as reality.
After all those years of looking for this evidence, what was it like to finally find it?
You kind of scratch your head about the vagaries of science. This is a concept that was written off by almost any sane scientist years ago because so much effort had gone into chasing it and nothing emerged.
In the paper, you noted that short-read sequencing had been used for these efforts in the past but wasn’t successful. Why was that?
We had originally thought that if we could use single-cell technologies which rely on short-read sequencing, it would open this area up. The challenge is that the resolution of the sequencing technology is not sufficient even to interrogate the wild type locus. Even under the best circumstances we’re pretty much around 1 million base pairs. That’s not going to allow us to see 300 kilobases, which is where the APP locus is. That was a major limitation. Also, most short-read sequencing approaches require mapping to a reference genome. If there were inversions, insertions, or deletions, they may well be missed or be filtered out because they don’t map to what was expected in the reference. As soon as PacBio came onto the scene for our work, it just became absolutely clear that this was the way to pursue it so we could look at the complete sequence of what we now know are variants.
How did your team use SMRT Sequencing for this project?
There’s a really cool and special kind of sequencing with PacBio — circular consensus sequencing, or CCS. If you have a small enough piece — say, in the 3 kb to 5 kb range — the polymerase can go around and around and around the template. As a result, you can get many, many reads of the same template, so you can line those up and take the consensus read by looking at which of the residues show up most often. This is a way to get around the inherent polymerase error rates. In so doing, you get enormously high Phred scores as well as certainty levels. I think in this case we had a median Phred score of around 93 and a certainty of 99.999999%. It was actually approaching Sanger sequencing levels of certainty.
In the publication you speculated that HIV antiretroviral therapies might be used for patients with sporadic Alzheimer’s disease. Do you see that as near-term or will it take a long time to assess the possibility?
I think this is now. What we need to do is convince the clinical community to embark on it. The epidemiological signals are some of the most compelling of any that one could hope for. The total number of individuals in the United States who have HIV, are being treated with these antiretrovirals, and are at risk of Alzheimer’s disease because they are 65 or older is more than 120,000. The projections for developing Alzheimer’s in that age group is about 3% to 10%. But in 2016 the first reported case of an HIV patient with Alzheimer’s appeared in the literature, and as of now that’s the only case. I think it would be to everyone’s benefit to look at whether we can recapitulate that signal in a controlled, prospective clinical trial. Importantly, these are FDA-approved agents, some of which have been in humans since the 1980s, and thus there is sufficient proven safety to use these agents over long periods of time. Based on the science, we now have an explanation for why this might work.
This discovery must open new doors for your lab. What’s next?
There’s a new universe that’s been accessed here. It should impact both other forms of Alzheimer’s disease as well as other brain diseases and perhaps even other diseases that involve cells with a long life span. I think we’re in a position to search for and test whether gene recombination producing genomic cDNAs are more prevalent and involving other genes, and PacBio is certainly going to be a big part of that analysis.
We’re excited to report on new SMRT Sequencing advances that will ultimately help users generate extremely accurate, single-source data for large-scale genome projects. We demonstrate this new approach in a preprint on bioRxiv, and intend to fully support the new data type in upcoming product releases for the broader SMRT Sequencing community.
The preprint describes a collaborative effort to comprehensively characterize a human genome — we chose the well-analyzed HG002/NA24385 sample available as a benchmark from the Genome in a Bottle consortium — Lead authors Aaron Wenger and Paul Peluso, senior authors David Rank and Michael Hunkapiller, and co-authors at PacBio, Google, NIST, and a host of leading academic institutions and companies contributed to the publication.
The work stems from our ongoing commitment to keep increasing the quality and usability of data generated from SMRT Sequencing systems. “Today, human genomes are sequenced at population scales, but it remains necessary to combine sequencing technologies to cover all types of genetic variation, which increases cost and adds complexity to projects,” the paper’s authors explain. “A sequencing technology with long read length and high accuracy would enable a single experiment for comprehensive variant discovery.”
To that end, the team developed a new protocol based on the CCS method, which builds a consensus sequence based on many passes across the same template. “Recent gains in read length for SMRT Sequencing and optimized DNA template preparation suggested an opportunity to unify high accuracy with long read lengths using CCS,” the scientists report.
Using the human genome as a proving ground, the authors selected a library tightly-distributed at 15 kb, generated CCS reads with an average of 10 passes, and sequenced the genome to 28-fold coverage. The average read accuracy is 99.8%, matching the accuracy of the typical short read. De novo assembly of the reads yielded “a highly contiguous and accurate genome with contig N50 above 15 Mb and concordance of Q48 (99.998%),” they add.
The team also interrogated a broad range of variants and performed phasing. “We analyze the CCS reads to call SNVs, indels, and structural variants; to phase variants into haplotype blocks; and to de novo assemble the HG002 genome,” the scientists report. “The CCS performance for SNV and indel calling rivals that of the commonly-used pairing of BWA and GATK on 30-fold short-read coverage.” Detection of variants was consistently strong for SNVs (99.91%), indels (95.98%), and structural variants (95.99%). As the authors note, “Nearly all (99.6%) variants are phased into haplotypes, which further improves variant detection.”
Beyond the remarkable quality results from this protocol, the scientists note a number of other advantages with this approach. These include easier sample prep, since there is no need for ultra-long genomic DNA, reduced computational time, and the ability to use familiar tools like GATK designed for accurate reads.
Future improvements to the method — such as faster generation of HiFi reads from subreads and increasing the number of reads produced in a run — should “facilitate rapid, population-scale analysis of full genomes to improve human health,” the authors write. The HiFi protocol also will have application outside of human genomics, with utility in metagenomics as well as plant and animal genome assembly.
According to many, PacBio is the new “gold standard” in microbial sequencing. Chief Scientific Officer Jonas Korlach notes that its ability to simultaneously provide long sequencing reads (genome contiguity), high consensus accuracy (genome accuracy), minimal sequence bias (genome completeness), and methylation detection (bacterial epigenome) has made it the technology of choice for users who need to reliably produce high quality genomes.
In a presentation for the virtual Microbiology & Immunology conference, Korlach highlighted PacBio’s strengths in the field, including multiplexed microbial sequencing on the Sequel System and full-length bacterial RNA sequencing.
Microbial de novo genome assembly
Multiple bacterial genomes can now be sequenced in one SMRT Cell. Not only are bacterial chromosomes revealed, but plasmids are also assembled. This is significant because these mobile genetic elements often carry the genes that drive virulence, drug resistance, and other traits that are important to understand microbial biology, Korlach said.
Getting the full picture of bacterial plasmids allows scientists to track the transmission and follow hospital-associated infection outbreaks, for instance. It also gives insights into the evolution of bacterial strains, both through single nucleotide changes and larger structural rearrangements.
“It is now possible for the first time to really understand the evolution and generation of some of these very dangerous superbugs that are resistant to all known antibiotics,” Korlach said, citing a study of 16 Klebsiella pneumoniae isolates collected in German hospitals.
Korlach also referenced a study that solved a decades-old mystery about Spiroplasma poulsonii, a type of symbiotic bacteria that manipulate host reproduction to spread in a population by selectively killing off the sons of infected female hosts during development. A team of Swiss researchers identified the toxin responsible, located on a plasmid, using SMRT Sequencing.
“This thorough understanding of the functional ramifications was really only provided by PacBio sequencing, and forms the foundation now for thinking about controlling insect populations.”
Other advantages of PacBio sequencing? Every bacteria has multiple copies of the 16S housekeeping gene, and PacBio is the only technology able to produce multiple distinct 16S sequences per bacterial genome, Korlach said. PacBio sequencing can also overcome long-standing challenges in assembling yeast, fungi and other eukaryotic genomes connected to infectious disease.
Malaria causing agent plasmodium falciparum, for instance, was nearly impossible to sequence in traditional platforms because it is highly repetitive and AT-rich. But PacBio long-read sequencing enabled complete telomere-to-telomere de novo assembly of the genome. Another recent publication highlighted the utility of PacBio sequencing for closing yeast genomes, revealing that differently named food processing and pathogenic strains of yeast are in fact the same species.
Characterizing the bacterial epigenome and transcriptome
One of the three pillars of understanding bacterial biology is the epigenome, which PacBio is uniquely suited to characterize without any additional sequencing beyond what is needed for genome assembly. Characterizing methylomes and methylation status can shed light on how some bacteria can switch pathogenicity between not-so-dangerous to fatal, or the changes that occur when free-living bacteria associate with a host and become symbiotic instead.
Researchers at The Forsyth Institute have created a method for transforming formerly resistant bacterial targets by leveraging the epigenetic fingerprinting information revealed by SMRT Sequencing.
Their SyngenicDNA stealth-based evasion of restriction-modification barriers during bacterial genetic engineering could unlock myriad applications in basic research, industrial biology, synthetic biology, and translational science, Korlach said.
Finally, Korlach referenced a collaboration with New England BioLabs to adapt the Iso-Seq full-length RNA Sequencing protocol to bacterial transcripts, providing the first detailed look at the dynamic structure and regulation of operon transcription.
“It was an unprecedented view of the complexity of the bacterial transcriptome and bacterial gene expression that was previously hidden.” Korlach said.
Korlach also delivered a warning: Relying on existing NCBI database entries to establish what you have in your lab may be risky.
“A number of the so-called complete genomes in the NCBI database that were done a few years ago contain, in some cases, quite dramatic errors,” he said.
A survey of 20 “reference strains” contained in the GenBank found that 30% had significant structural alterations when re-sequenced and assembled from scratch using PacBio technology, either due to previous sequencing errors or bacterial changes while in the repository.
“If you really want to be sure about what you have in your tube or on your plate, I suggest doing the genome from scratch,” Korlach said.
Luckily, this has become easier and more accessible with microbial multiplexing. Korlach outlined some of the nuts and bolts of the new microbial multiplexing kit, released in June, that allows researchers to pool numerous bacteria (of around 30 Mb) in one sample and sequence it in one reaction. He also pointed people to a microbial multiplexing calculator to help ensure equimolar pooling and even sequencing representation across all pooled samples despite different genome sizes, shear sizes, and sample concentrations.
And those interested in attending and presenting at ASM Microbe in San Francisco, June 20-24, should note that the deadline for abstract submissions has been extended to Jan. 25, 2019.
Scientists in Japan report using the unique properties of SMRT Sequencing to detect a structural variant (SV) responsible for a hereditary form of epilepsy. The 4.6 kb intronic repeat insertion was found from low-coverage whole genome sequence data, leading the team to suggest that this approach could be useful for determining the genetic mechanisms behind many unexplained diseases.
“Detecting a long insertion variant in SAMD12 by SMRT sequencing: implications of long-read whole-genome sequencing for repeat expansion diseases” comes from lead author Takeshi Mizuguchi, senior author Satoko Miyatake, and collaborators at Yokohama City University and the University of Occupational and Environmental Health School of Medicine. The Journal of Human Genetics paper describes how the scientists turned to long-read SMRT Sequencing after finding that short-read platforms were not well-suited to detecting large, challenging SVs. “Many patients with conditions for which the genetic cause is unknown are still encountered, suggesting that certain types of pathogenic variation evade detection by the currently available short-read technology,” the authors note. Because long-read sequencing can now routinely produce reads of 10 kb or more, they add, this technology “may pave the way for the detection of unprecedented SVs as well as repeat expansions.”
For this project, scientists worked with a Japanese family affected by benign adult familial myoclonus epilepsy, or BAFME, which generally manifests in adulthood. Previous studies had used linkage analysis to identify four loci in the family associated with the disease. The team used whole genome SMRT Sequencing to analyze one affected family member as well as three healthy controls.
Using pbsv, they identified 9,138 insertions and 6,498 deletions in the affected individual, of which 2,420 insertions and 1,086 deletions were not seen in the unaffected family members. That included six SVs in the linked SAMD12 region of interest, with a 4,661 bp insertion identified as mostly likely pathogenic. “The insertion was a novel sequence, rather than a tandem duplication,” the scientists report. “A total of 95.41% was found to be a low-complexity sequence.”
The team suggests that this approach could provide an unbiased means of detecting pathogenic SVs. “These results indicate that long-read WGS is potentially useful for evaluating all of the known SVs in a genome and identifying new disease-causing SVs in combination with other genetic methods to resolve the genetic causes of currently unexplained diseases,” they report.
What has four legs, lots of fat and fur, and will possibly help uncover novel mechanisms to combat diabetes?
If humans were to undergo regular, extended cycles of weight gain and inactivity, they’d likely end up with obesity, muscle atrophy, or type 2 diabetes. But grizzly bears experience no ill effects from their annual fat gain and sedentary hibernation. Somehow they are able to switch their insulin resistance between seasons, and researchers at Washington State University are hoping to figure out how, with possible therapeutic value for humans.
We’re proud to support this outstanding research, by awarding graduate student Shawn Trojahn and Associate Professor Joanna Kelley the 2018 Plant and Animal SMRT Grant. We recently caught up with them to learn more about their research project and the bears that make it possible.
Why grizzly bears?
Well, we have access to an incredibly unique resource, the WSU Bear Research, Education, and Conservation Center, the only dedicated research populations of grizzly bears. When our lab first came to WSU five years ago, we became interested in the studies being done there, in fields from nutrition to physiology.
Scientists are really starting to appreciate hibernators.
They do some pretty unusual things and they do them well.
We could learn a lot from them.
A lot of the phenotypes that we see in hibernating bears could give us insight into genetic mechanisms that might be relevant for human diseases too. They gain and lose a lot of fat. They have insulin insensitivity during hibernation—which is what happens in type 2 diabetes— but they reverse the insensitivity in the active season.
We hypothesize that these reversible states are achieved in grizzly bears through differential expression of transcript isoforms, possibly with human homologs. Preliminary evidence from proteomic work on hibernating and non-hibernating bear serum supports this hypothesis, as there is no change in the identity of proteins present, but peptides differ between seasons.
What does the project entail?
We plan to compare full-length isoforms between hibernating and active bears in three metabolically active tissues: skeletal muscle, liver, and adipose.
We will also collect blood samples at different stages along the annual cycle, from six bears who have been trained from birth to take part in approved research.
The bears are pretty amazing. They can respond to cues, and present their paws for inspection, making it easy to take blood draws without the need for sedation.
How will PacBio Sequencing support this project?
SMRT Sequencing with Iso-Seq analysis is perfect for this work, as it will allow us to identify the full-length isoforms that are differentially expressed between seasons.
We’ve used SMRT Sequencing for genome assembly of organisms in extreme environments, such as polar fishes, and we’ve been following the development of the Iso-Seq method since its introduction to the field. We’re extremely excited to finally have the opportunity to try it.
What do you expect to find?
No one has done this before, so we have no idea what we will find. But we’re going to extract as much information as possible about alternative splicing, binding sites, regulators and protein translation. And we’re really excited to see what other questions might arise as a result. This will open up so many opportunities—the sky’s the limit.
We are so excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs including the the 2019 Plant and Animal Science SMRT Grant, opening February 1, with video proposals. So start practicing your best YouTube-worthy pitch and stay tuned for more information.
Thank you to our co-sponsor, the University of Delaware Sequencing & Genotyping Center, for supporting the 2018 Plant and Animal Science SMRT Grant Program.
Scientists were certainly sequencing with confidence in 2018, as evidenced by the number of significant and wide-ranging advancements made using SMRT Sequencing technology, several of which made the cover of high-impact journals. As the year draws to a close, we have taken this opportunity to reflect on the many achievements made by members of our community, from newly sequenced plant and animal species to human disease breakthroughs that even captivated the popular press.
“It’s been a phenomenal year for science. We are proud of our partners and honored that our technology is helping to drive such discovery across all fields of the life science.”
Jonas Korlach, Chief Scientific Officer
Human Biomedical Research
Our understanding of human health and disease increased with new population-specific genomes, breast cancer cell line variants, on-target mutagenesis of CRISPR-Cas9 editing and insights into genomic cDNAs and their potential role in in Alzheimer’s disease.
- Work from A. Ameur et al. from Uppsala University, “De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human Grch38 Reference and Improves Variant Calling of Population-Scale Sequencing Data,” featured in Genes and on our blog
- “Complex Rearrangements and Oncogene Amplifications Revealed by Long-read DNA and RNA Sequencing of a Breast Cancer Cell Line” by M. Nattestad et al. garnered lots of reads in Genome Research and on our blog
- A study by scientists at the Sanger Institute, “Repair of Double-Strand Breaks Induced by CRISPR-Cas9 Leads to Large Deletions and Complex Rearrangements” in Nature Biotechnology (and our blog) added to the gene editing debate
- Nature featured a “remarkable phenomenon” observed by Lee et al. in “Somatic APP Gene Recombination in Alzheimer’s Disease and Normal Neurons.”
Plant & Animal Sciences
Research in plant and animal genomes uncovered exciting biology, including a tiny animal with a huge genome that gave us a glimpse into tissue regeneration and great apes that are helping us better understand human evolution. In addition to big achievements from international consortiums such as the Vertebrate Genome Project, the Earth Biogenome Project, and the Sanger 25, we also shared in the sweet success of the sugarcane genome and explored architectural differences between maize and sorghum.
- Nature’s report on “The Axolotl Genome and the Evolution of Key Tissue Formation Regulators” by S. Nowoshilow and colleagues around the world captured the attention of the popular press, and our blog
- Kronenberg’s “High-Resolution Comparative Analysis of Great Ape Genomes” graced the cover of Science, and our blog
- “Allele-defined Genome of The Autopolyploid Sugarcane Saccharum spontaneum L” by J. Zhang et al. was also a cover star, in Nature Genetics and our blog
- Our RNA sequencing capabilities featuring the Iso-Seq method were nicely showcased in “A Comparative Transcriptional Landscape of Maize and Sorghum Obtained by Single-molecule Sequencing” by B. Wang et al. in Genome Research
Microbiology & Infectious Disease
From selfish symbiotic bacteria to HIV variants in the brain, we were enthralled by new views into the microbial world. In addition to the release of 3,000 bacterial genomes by the UK’s National Collection of Type Cultures (NCTC), scientists also contributed new methods to distinguish between microbial genomes.
- RL Brese et al. made important inroads in understanding why patients with HIV develop neurological disorders in their Journal of Neurovirology paper, “Ultradeep Single-molecule Real-time Sequencing of HIV Envelope Reveals Complete Compartmentalization of Highly Macrophage-Tropic R5 Proviral Variants in Brain and CXCR4-Using Variants in Immune and Peripheral Tissues,” also featured on our blog
- AP Douglass et al. made a concerning discovery in their PLoS Pathogens paper “Population Genomics Shows No Distinction Between Pathogenic Candida krusei and Environmental Pichia kudriavzevii: One Species, Four Names,” also featured on our blog
- Nature featured a discovery by Swiss researchers, with potential implications for insect control in other species, “Male-Killing Toxin in a Bacterial Symbiont of Drosophila,” also featured on our blog
- Nature Biotechnology described a new method of sequence binning by researchers at Icahn School of Medicine at Mount Sinai, “Metagenomic Binning and Association of Plasmids with Bacterial Host Genomes Using DNA Methylation,” also featured on our blog
Did we miss one of your favorite publications of 2018? Tweet us @PacBio, using #PoweredbyPacBio. And check out our searchable publications database for more than 1500 examples of outstanding SMRT Science from 2018.
In the rapidly evolving world of DNA sequencing, the community is often focused on what’s new and what’s next. There’s not much opportunity for retrospection. But two recent articles offer an insightful look at the history of SMRT Sequencing technology, from the time it was just a gleam in the eye of some Cornell University scientists to how it works and some exciting new applications.
At Technology Networks, reporter Ruairi MacKenzie writes about the scientific beginnings of SMRT Sequencing with memories from PacBio CSO Jonas Korlach, one of the inventors of the technology.
“Korlach concluded that if only you could see DNA polymerase doing its incredible, evolution-assisted work, then you could simply let the enzyme do the heavy lifting and take notes on its performance to create a top-quality sequencing technique,” MacKenzie reports. He describes the powerful collaboration of Korlach, Watt Webb, Steven Turner, and Harold Craighead in the project that would ultimately begin the path to PacBio.
“This started off a series of experiments that aimed to create a microscope that was, in effect, a thousand times more powerful than any currently available,” the article continues. “The eventual product of a number of attempts was the zero-mode waveguide, the foundation of PacBio’s sequencing technology.”
The article also covers the various optimization experiments that ensued, such as figuring out how to create fluorescent tags that wouldn’t decrease sequencing efficiency. If you ever wondered where SMRT Sequencing got its start, this piece provides the answer.
A Wall Street Journal article covers the past 15 years of DNA sequencing, from the public/private competition to sequence the first human genome all the way to some of the most recent and compelling scientific projects being powered by SMRT Sequencing.
The article reports on potential clinical applications for long-read sequencing, such as helping to diagnose rare diseases. “Mr. Hunkapiller says PacBio’s machines can help by detecting what are called ‘structural variants,’ changes to DNA that may involve hundreds or even thousands of base pairs, making them difficult to pick up with earlier technology,” Kyle Peterson writes. “Last year a group at Stanford was able to diagnose a young man whose heart had repeatedly grown benign tumors. One of his genes on Chromosome 17 was missing 2,200 base pairs.”
The article also describes other interesting recent applications of SMRT Sequencing platforms, including the largest known genome from the tiny Mexican salamander, the 100 ants project, and bat longevity.
It’s a great time of year to reflect on the history of the technology and how scientists are applying it today, and we encourage you to check out both articles on your next coffee break.
Advances in personalized medicine — whether it’s the discovery of a new pathogenic variant or a success story about a patient treated with a tailored therapy — seem to be almost a daily occurrence. That’s why we’re particularly excited to attend the Precision Medicine World Conference (PMWC), co-hosted by Stanford, UCSF, Duke, Johns Hopkins & U. of Michigan, taking place January 20-23 at the Santa Clara Convention Center. The meeting brings together thought-leaders of business, government, healthcare-delivery, research and technology to share the latest developments, challenges, and triumphs in the field.
PMWC is well known for giving out prestigious awards, and this year’s slate of honorees is as impressive as ever. This year’s Luminary Awards, given for recent contributions to accelerate personalized medicine, will go to Carl June at the University of Pennsylvania for his CAR-T work, Genetic Alliance’s Sharon Terry for her efforts to empower individuals with their own health data, and Feng Zhang at the Broad Institute for his development of optogenetics and CRISPR.
The Pioneer Award, which honors “rare individuals who presaged the advent of personalized medicine when less evolved technology and encouragement from peers existed,” will be given to George Yancopoulos at Regeneron Pharmaceuticals.
If you’ll be attending the meeting, don’t miss our own Lori Aro, senior director of clinical genomics, who will give a talk entitled “Sequencing with confidence: Highly accurate single-molecule long reads” on January 22nd at 8:30 am. We’re also looking forward to talks from Andrew Carroll at Google AI and Randy Scott from Invitae, among many others.
There’s still time to register for PMWC 2019 and they’re offering our blog readers a 10% discount until December 31. We hope to see you at the meeting!
You may be more likely to get five gold rings or three French hens than two Turtle doves this Christmas. The subject of the famous holiday carol is in precipitous decline across Europe, with 94 percent of Turtle doves lost since 1995, and fewer than 5,000 breeding pairs left in the UK.
In an attempt to save the species, geneticists at the Wellcome Sanger Institute identified it as a priority species to be sequenced as part of a year-long 25th anniversary project.
Collaborators at the University of Lincoln sent samples (collected from live birds during routine health checks) to the Sanger Institute. The sequencing teams extracted DNA from the samples and used SMRT Sequencing technology to generate the first reference genome for Turtle doves (Streptopelia turtur).
The results, announced today and set for release in early 2019, will provide a genetic reference for determining effective population sizes and establishing breeding programs in efforts to help conserve the threatened bird species, which has been listed as vulnerable on the International Union for Conservation of Nature (IUCN) Red List.
Jenny Dunn from the University of Lincoln, said: “To give Turtle doves the best chance of survival in the future, we need to first understand the pressures that are affecting their population decline. The Turtle dove genome will give insights into how diseases and limited food resources impact on their health and will aid practical conservation efforts to maximize the genetic diversity of introduced populations.”
On the course to discovery
Scientists also hope to solve another mystery: how some migrating birds “see” the Earth’s magnetic fields for navigation.
To do this, Sanger and collaborators created a high-quality genome of the European robin (Erithacus rubecula), completed to the “platinum standard” set by the Vertebrate Genome Project (contig N50 in excess of 1 Mb and scaffold N50 above 10 Mb).
European robins live throughout Europe, Russia and western Siberia. While most British robins reside in the UK over winter, some birds will migrate to southern Europe to overwinter in warmer climates. Simultaneously in winter, migrant robins from Scandinavia, continental Europe and Russia head to the UK to avoid the harsh weather back home.
“Birds can use the Earth’s magnetic field as a reference for orientation during the migratory journeys, and the magnetic compass in birds was first described in a robin,” said Miriam Liedvogel from the Max Planck Institute for Evolutionary Biology in Plön, Germany. “The European robin genome will allow us to identify what’s driving migration in birds, and understand the variability of migration in other bird species as well.”
The two birds join the Golden Eagle as the first of 25 UK species to have their genetic code sequenced and assembled as part of the Sanger Institute’s 25 Genomes Project, which also includes species such as grey and red squirrels, blackberry and brown trout.
January 18, 2019
This paper is now available at Genes.
December 19, 2018
High-quality reference and de novo genomes have been celebrated by geneticists, population biologists and conservationists alike, but it’s been a dream deferred for entomologists and others grappling with limited DNA samples, due to previous relatively high DNA input requirements (~5 μg for standard library protocol).
A new low-input protocol now makes it possible to create high-quality de novo genome assemblies from just 100 ng of starting genomic DNA, without the need for time-consuming inbreeding or pooling strategies. The targeted release date for the protocol is February 2019.
The protocol, developed as a collaboration by scientists at the Wellcome Sanger Institute and PacBio, was used to assemble the genome of an Anopheles coluzzii mosquito with unamplified DNA from a single individual female insect.
As described in a bioXriv pre-print, Sarah B. Kingan, Haynes Heaton, et al. used a modified SMRTbell library construction protocol without DNA shearing and size selection to facilitate the use of lower input amounts, as shearing and clean up steps typically lead to loss of DNA material.
“This new low-input approach puts PacBio-based assemblies in reach for small and highly heterozygous organisms that comprise much of the diversity of life,” said co-corresponding author Jonas Korlach, our chief scientific officer.
The sample was run on the Sequel System with the latest v6.0 software, followed by de novo genome assembly with FALCON-Unzip, resulting in a highly continuous (contig N50 3.5 Mb) and complete (more than 98% of conserved genes were present and full-length) genome assembly.
About a third of the new de novo genome is haplotype-resolved and represented as two separate sequences for the two alleles, providing additional information about the extent and structure of heterozygosity that was not available in previous assemblies, all of which were constructed from many pooled individuals.
“The ability to generate high-quality genomes from single individuals greatly simplifies the assembly process and interpretation, and will allow far clearer lineage and evolutionary conclusions from the sequencing of members of different populations and species,” the authors state.
The first Anopheles gambiae genome, published in 2002, was created using BACs and Sanger sequencing. Further work over the years to order and orient contigs improved this reference and to date, AgamP4 remains the highest quality Anopheles genome among the 21 that have now been sequenced. However, AgamP4 still has 6,302 gaps of Ns in the primary chromosome scaffolds and a large bin of unplaced contigs known as the “UNKN” (unknown) chromosome.
The Sanger/PacBio single-insect assembly was able to place 667 (>90%) of the genes on the UNKN contigs into their appropriate chromosomal contexts.
The assembly’s “gap-less mega-base scale contiguity” will also provide insights into promoters, enhancers, repeat elements, large-scale structural variation relative to other species, and many other aspects relative to functional and comparative genomics questions, the authors state.
The protocol’s potential could also extend to other areas with typically low DNA input regimes, such as metagenomic community characterizations of small biofilms, DNA isolated from needle biopsy samples, and minimization of amplification cycles for targeted or single-cell sequencing applications, the authors add.
Scientists in California recently released exciting results that could offer an entirely new approach to treating the most common form of Alzheimer’s disease. The project, which was reported in a Nature publication, made extensive use of SMRT Sequencing data using targeted sequencing and some previously released full-length RNA sequencing data.
“Somatic APP gene recombination in Alzheimer’s disease and normal neurons” comes from lead author Ming-Hsiang Lee, senior author Jerold Chun, and collaborators at the Sanford Burnham Prebys Medical Discovery Institute and the University of California, San Diego. The team aimed to determine whether somatic gene recombination, which is used throughout the genome to boost molecular diversity but has never been found in the brain, could be linked to Alzheimer’s disease.
Using an impressive array of novel and cutting-edge technologies, the scientists found evidence of significant recombination in the APP gene, which encodes amyloid precursor protein in neurons and has been associated with Alzheimer’s. They focused on APP because it has previously been shown to harbor mosaic copy number variants, with higher numbers in patients with sporadic Alzheimer’s disease (SAD). They found that the APP gene harbored thousands of variant genomic cDNAs (gencDNAs) that occurred mosaically in human neurons. The gencDNAs lacked introns and ranged from full-length cDNA copies of expressed, brain-specific RNA splice variants to myriad smaller forms that contained intra-exonic junctions, insertions, deletions, and/or single nucleotide variations.
But past attempts to find gene recombination in APP had failed. “Interrogation of APP genomic loci (about 0.3 Mb) using low-depth, short-read single-cell sequencing capable of detecting CNVs produced negative results that were complicated by resolution limitations,” the authors report. “We therefore developed an alternative strategy focused on APP in small cell populations, using nine distinct methodologies.”
Among those approaches was the use of SMRT Sequencing of PCR amplicons to assess the diversity of gencDNA sequences. The authors used small neural populations from five individuals with SAD (149 reactions from 96,434 nuclei) and five healthy brain (244 reactions from 162,248 nuclei). The authors generated CCS data and used a cut-off that provided them with ultra-high accuracy reads (99.999999% accuracy), and report that these SMRT Sequencing results were “comparable in fidelity to Sanger sequencing.”
They identified 6,299 unique sequences — including 45 different intra-exonic junctions — in neural nuclei from the brains of individuals with SAD, and 1,084 unique sequences — including 20 intra-exonic junctions — in neuronal nuclei from the non-diseased brains. “Critically, both qualitative and quantitative differences in the sequences of gencDNA variants distinguished the brains of individuals with SAD from healthy brains,” the authors note. “Distinctions included gencDNAs with novel intra-exonic junctions and SNVs, which were far more prevalent in the brains of individuals with SAD.”
Because of the need for reverse transcriptase in genomic cDNAs, the scientists also speculate that existing anti-retroviral therapies used for patients with HIV might inhibit the progression of SAD. They note that HIV patients who take such therapies and are older than 65 appear less likely to develop Alzheimer’s disease. “If confirmed, this observation would suggest the immediate use of FDA-approved [combined anti-retroviral therapy]” for patients with this form of Alzheimer’s, they write.
The team concludes with the idea that the recombination findings are unlikely to be specific to the one gene they chose to study. Additional investigation should be considered for other genes active in the brain using the types of technologies that made such a difference in this project.
It took nearly 20 years until the technology was right, and five years of hard graft by more than 100 scientists from 16 institutions, but the result was worth it, according to University of Illinois plant biology professor Ray Ming.
One of several authors of a paper published and featured on the cover of Nature Genetics reporting the assembly of a 3.13 Gb reference genome of the incredibly complex autopolyploid sugarcane Saccharum spontaneum L, Ming said he dreamed about having a reference genome for sugarcane while working on sugarcane genome mapping in the late 1990s.
But sequencing technology was not ready to handle large autopolyploid genomes until 2015, when the throughput, read length, and cost of long-read SMRT Sequencing by PacBio became competitive enough, he said.
The Saccharum spontaneum AP85-441 contig-level assembly incorporated sequencing data from a mixture of sequencing technologies, including BAC pools sequenced with short reads and whole-genome shotgun SMRT Sequencing.
Additional mapping using Hi-C allowed the team to dissect genetic information from all four haplotypes that make up the sugarcane hybrid currently used in the field, which combines Saccharum officinarum (desirable for its high sugar content) and Saccharum spontaneum (desirable for its hardiness and disease resistance).
Evolutionary changes which resulted in double duplication of the hybrid genome created additional technical challenges, quadrupling the size of the genome and introducing repeat elements that were difficult to parse out among the four haplotypes.
“By combining long sequence reads and the Hi-C physical map, we assembled an autotetraploid genome into 32 chromosomes and realized our goal of allele-specific annotation among homologous chromosomes,” Ming said.
The new assembly has already provided insights into the hybrid’s evolution as well as characteristics of its parent lines, information that could help breeders to mine effective alleles of disease resistance and other desirable traits into future molecular breeding efforts.
“This reference genome offers substantial new knowledge and unprecedented genomic resources for sugarcane breeders and researchers to mine disease resistance and other alleles in rearranged chromosomes from historic hybrid cultivars, and to track them in breeding populations to shorten the 13-year breeding cycle,” the authors wrote.
Cotton crops the world over have benefited from the pest-killing protein from Bacillus thuringiensis (Bt), first used in sprays and then, in 1996, transgenic crops, resulting in reduced insecticide use, enhanced biological control, and increased farmer profits. But the precious plants are under threat once again by a tiny but mighty pest: pink bollworm (Pectinophora gossypiella). In India, where more than 7 million farmers have planted 10.8 million hectares of transgenic Bt cotton, the lepidopteran pest has developed resistance to two different forms of the toxin that made the transgenic crops so effective, creating catastrophic economic losses.
Scientists have been studying the genetics of this insecticide resistance in the lab to address the issue, but do their subjects truly represent the realities of rapid resistance evolution in the field?
An international team of scientists from Arizona, India and Australia set to find out, using targeted SMRT Sequencing to compare Bt resistant lab samples with those collected from cotton fields.
As described in Scientific Reports, they used SMRT Sequencing to analyze barcoded cDNA of 22 larvae from Arizona laboratory-selected strains and Indian field-selected populations of pink bollworms. The PacBio sequencing, conducted by our certified service provider the Arizona Genomics Institute, expanded upon knowledge gleaned from previous allele-specific PCR genotyping and revealed five previously unidentified transcript variants.
The research team, led in the United States by Jeffrey A. Fabrick and Lolita G. Mathew of the U.S. Department of Agriculture’s Arid Land Agricultural Research Center in Maricopa, AZ, were specifically looking for clues into resistance to the Bt toxin Cry2Ab.
Initial Bt cotton crops produced a single toxin from the Cry1 family, Cry1Ac, but at least eight major lepidopteran pests evolved resistance to the toxin in the field, so most Bt crops grown now also produce Cry2Ab. So far, only Indian pink bollworm and Helicoverpa zea in the United States have evolved resistance to Cry2Ab, and data on the genetic basis of Cry2Ab resistance are relatively scarce and limited to laboratory-selected strains.
The researchers found both similarities and differences in the lab- and field-selected strains. They discovered that mutations disrupting the ABC transporter gene PgABCA2 are associated with resistance to Cry2Ab in both. But only one specific mutation — mis-splicing that omits exon 6 and introduces a stop codon at amino acid 373 — was shared between the Arizona and India strains. The other mutations were mostly from splice-site mutations that lead to mis-splicing of PgABCA2, and were more diverse in India.
“The differences in PgABCA2 resistance mutations between India and Arizona could reflect the difference in geographic origin, laboratory versus field selection, or both,” wrote the authors. “The results suggest that focusing on ABCA2 may help to accelerate progress in monitoring and managing field-evolved resistance to Cry2Ab.”
The researchers also noted the benefits of PacBio sequencing for monitoring variants associated with resistance in the field.
Traditional PCR, cloning, and Sanger sequencing methods are laborious and not practical for monitoring the diverse mutations in the field, they wrote, but “this method allowed us to multiplex cDNA samples from 22 individuals and obtain sequencing information from essentially single molecules of full-length PgABCA2 cDNA without post-sequencing assembly.”
“Given that PgABCA2-mediated resistance to Cry2Ab occurred in pink bollworm populations from Arizona and India, long-read sequencing focusing on this gene could provide a valuable alternative to the F1 screen for monitoring resistance to Cry2Ab in this cosmopolitan pest,” they added.
Its reliable return to the same spot year after year has made the barn swallow a beloved symbol of Spring and safe passage, for mariners and landlubbers alike. But our changing climate is altering the birds’ migratory behavior, and Italian ecologists are turning to genetics to figure out how.
As reported previously in this blog, scientists at the University of Milan joined forces with researchers from the University of Pavia and California State Polytechnic University to create the first high-quality reference genome for the European barn swallow (Hirundo rustica rustica), using SMRT Sequencing and newly available Bionano Genomics optical mapping at the Functional Genomics Center in Zurich.
To mark the publication of the work in the journal Gigascience, we spoke with Giulio Formenti, co-first author with Matteo Chiara, about their pioneering use of long-read sequencing in Italy and the technology’s potential to change the field of behavioral ecology.
Why the barn swallow?
The barn swallow is a very important species from a scientific and ecological perspective. It has been a model species in behavioral ecology, with around 1,600 studies published since 1985 about the birds’ migration, reproductive behavior and variability, but there have been very few genetic studies.
In our lab (PI Prof. Nicola Saino), we have been studying this species for a very long time, but we were hindered by a lack of a reference genome. We had been relying on genomes such as that of the flycatcher to design probes, primers and other experiments. The flycatcher is a relatively closely related species, but it has often turned out not to be close enough. It was preventing us from designing probes correctly or, even worse, led us to design probes that seemed OK, but ended up not working, ruining experiments and causing people to waste a lot of time. This is a serious issue, and the reason why it is so important to have top quality reference genomes to work with.
Why did you choose PacBio sequencing?
One of my previous projects was concerning Huntington’s Disease, which is caused by expanded C-A-G repeats. It was very hard to study these repeats due to limitations of short-read technology, but a few years ago it became possible to do it with long-read technology. So I had already explored the realm of long-read technology and proposed it for this project because I knew it was much better in terms of results. I was also inspired by the preliminary results of genomes generated by the Vertebrate Genomes Project.
When we started drafting the proposal in November 2017, I knew very little about the technical aspects of long reads. But I went to the PacBio UGM (User Group Meeting) in Barcelona, met people from the Functional Genomics Center in Zurich, and started collaborating with them. By January, they had started the sequencing, which took a few months, then we spent a few more weeks assembling the reads. By July we had completed the assembly, aligned, annotated and analyzed the results, and submitted our paper. Now, almost exactly one year later, we have a publication; for this kind of project, I think it’s quite impressive in terms of speed.
“Long reads, and in particular PacBio reads, appear to be the key to 21st century genomics.”
What challenges did you face?
Long-read technologies are not so well-known in Italy. In fact, the very first PacBio Sequel machine in Italy was purchased just a few months ago by the Department of Biology at University of Florence. So we had to send our samples to Zurich, where we also had the opportunity to use the latest optical mapping technology, which had just been released by Bionano in February. This added a whole new element to the study, and resulted in very high level, near chromosome-level scaffolding.
What are the potential impacts of this research?
Barn swallows are famous for settling into a particular spot and returning there after migration for several years. Here in Italy, people wait for the barn swallow to appear as a sign that Spring has started. But this is something that is changing as the climate changes. They are not coming back to the same place. They are now dispersing — partly to avoid inbreeding, but also because they seem to be sensing changes in the ecological conditions at their destination.
We had remarkable findings in a paper we recently published in Scientific Reports, which showed that the birds were able to predict the climate conditions of their destinations several weeks in advance, and that they were changing the timing or location of spring migration accordingly. It’s important to understand what’s going on here, and the availability of a high quality genome sequence will accelerate our efforts to learn how these adaptive processes could help populations respond to changing environmental conditions.
I hope that now that the paper is out, there will be many more people joining such efforts. We desperately need more people in the field who know how to use this technology. Most ecologists working with non-model species are still not incorporating genomics. They tend to look at phenotypes, and are not so familiar with changes at the genomic level. Studies of the effects of radiation from the Fukushima nuclear accident on butterflies, for example, have been based on phenotypic abnormalities, but no genetic evidence. This is one of the fastest developing fields in science, and we need to be able to understand genetic and bioinformatic data in order to deal with new challenges, and those that are already here. This is how modern biology is going to be.
As a scientific community, we have started to realize since the Human Genome Project that we need many genomes in addition to reference genomes, to understand variation across populations. With this project we wanted to generate a resource that we and others could then build upon.
As soon as we got the sequence, we started new projects to study the genetic impacts of environmental changes in swallows from different parts of the world, such as radiation fall-out locations. We are also re-visiting our work on migration phenotypes to study their associations with variations in genes, such as those dealing with circadian rhythm.
Their bodies are big, bony and… warm?
Unique among bony fish, Atlantic, Pacific and Southern bluefin tuna have a rare endothermic physiology that has garnered great interest among scientists. Like birds, mammals and some sharks, these kings of the sea are capable of conserving internally generated metabolic heat produced from their swimming muscles and viscera, and maintaining tissue temperatures above that of the environment.
The fish are also renowned among sushi enthusiasts for their delectable, fat-laden muscle, and prized by fisherman because of the high prices they command.
So the preservation of these species is paramount to many, and researchers are keen to monitor and manage their populations, which have suffered precipitous population decline and are now at the lowest levels of their spawning biomasses in recorded history. But progress is being hindered by a lack of knowledge about the evolutionary and genomic processes that have driven the physiological and ecological diversification of the bluefin tunas.
Conservation genomics using SMRT Sequencing could help.
In a recent webinar hosted by Nature, Barbara Block, the Charles and Elizabeth Prothro Professor in Marine Sciences at Stanford University, joined PacBio scientist Paul Peluso to describe a project to protect Pacific and Atlantic bluefin tuna by assembling their genomes and transcriptomes.
At the Monterey Tuna Research and Conservation Center, one of the world’s only captive bluefin centers, Block and colleagues are studying the physiology, energetics, hydrodynamics and transcriptomics of the fish. But tracking the activity of the fish in their natural habitat is also vital.
Among the questions they want to answer: How do these animals adapt to their ocean realms, and what is it about the bluefins that makes them uniquely different than all other tunas in their clade? What limits their performance in a warming world? How will they adapt to hypoxia, increased CO2 and ocean acidity?
“We’re interested in monitoring their genes and transcriptomes to help us understand the health of these tunas in an ocean, but that’s not easy,” Block said. “It’s not easy because the ocean is not transparent. When tunas slip beneath the surface, it becomes hard to follow them and to monitor their populations, their transcriptomics, their genomics, and where it is they go.”
Block said her lab uses a “fish and chips” approach. “We put computers on these animals that record their journeys beneath the sea along with the environmental conditions surrounding them,” she said. “By mapping the tunas on the globe, we are able to show visually, and spatially, how these animals use our planet.”
They’ve discovered that the fish travel far, able to go from Iceland to the Gulf of Mexico, or cross from North America to the Mediterranean, in just a few months. It is not so easy to tell populations apart, but genetics has helped.
As Peluso explained, the team generated approximately 118 Gb of sequence from just under 7 million reads for the Atlantic tuna (Thunnus thynnus), and 15 million reads, yielding just over 208 Gb of sequencing for the Pacific bluefin (Thunnus orientalis). Using FALCON-Unzip, they resolved haplotypes and identified structural variants along diploid assemblies of 1.6 Gb and 1.24 Gb, respectively.
Compared with an existing Pacific tuna (T. orientalis) genome from Japan assembled with short-read technology, the new PacBio assemblies contained much fewer fragments — around 2,000 contigs, compared to 16,802 for the Japanese assembly.
“It helped us to identify some genomic differences between these two species, as well as to develop a set of probes, or markers, that could be used to profile these species in a population scale across the globe,” Peluso said.
Further study could involve deeper dives into the assemblies to compare structural variants with gene models, such as correlations between the presence or absence of genes, as well as downstream implications of enhancer or promoter regions on gene expression.
“Having highly contiguous assemblies will help address these questions,” Peluso said.
Last month’s annual meeting of the American Society of Human Genetics in San Diego was a terrific reminder of how much progress is being made in this field — both in our basic understanding of human biology and in our ability to rapidly translate discoveries into clinical utility.
The PacBio team had the privilege of hosting an educational workshop about the value of long-read SMRT Sequencing for human genetic applications. Customers from Mount Sinai and Stanford University offered their perspectives, while PacBio scientists presented data and the technology roadmap. Here, we recap the highlights and provide recordings for anyone who could not attend.
From the Icahn School of Medicine at Mount Sinai, Assistant Professor Stuart Scott gave a talk about using the PacBio system for amplicon sequencing in pharmacogenomics and clinical genomics workflows. Accurate, phased amplicons for the CYP2D6 gene, for example, has allowed his team to reclassify up to 20% of samples, providing data that’s critical for drug metabolism and dosing. In clinical genomics, Scott presented several case studies illustrating the utility of highly accurate, long-read sequencing for assessing copy number variants and for confirming a suspected medical diagnosis in rare disease patients. He noted that the latest Sequel System chemistry improved throughput and read length, as well as reducing error profile and increasing the capacity for multiplexing.
Watch Stuart Scott’s presentation
In a separate talk, Janet Song from Stanford School of Medicine spoke about resolving a tandem repeat array implicated in bipolar disorder and schizophrenia. These psychiatric diseases share a number of associated genomic regions, she noted, however scientists continue to search for a specific causal risk variant in the CACNA1C gene suggested by previous genome-wide association studies. SMRT Sequencing of this region in 16 individuals identified a series of 30-mer repeats, containing a total of about 50 variants. Analysis showed that 10 variants were linked to protective or risk haplotypes. Song said she hopes to study the function of these variants in mouse models or human brain organoid models in the future.
Watch Janet Song’s presentation
Our Principal Scientist Elizabeth Tseng (@Magdoll) showed how the Iso-Seq method can be used to discover disease-associated alternative splicing. This approach to isoform sequencing yields accurate, full-length transcripts requiring no assembly, and is therefore ideal for disease studies that need a more comprehensive picture of alternative splicing activity. Tseng offered several published examples of how the Iso-Seq method has been used for everything from single-gene studies to whole-transcriptome studies, and also detailed how the latest Sequel System chemistry recovers more genes and produces more usable reads.
Watch Elizabeth Tseng’s presentation
Finally, our CSO Jonas Korlach walked attendees through recent product updates and the coming technology roadmap. The Sequel System 6.0 release offered major improvements to accuracy, throughput, structural variant calling, and large-insert libraries, he said, showing examples of 35 kb libraries. Looking ahead, Korlach said that the V2 express library preparation product should be available early in 2019, with the new 8M SMRT Cell being introduced sometime later.
Watch Jonas Korlach’s presentation
In addition to the workshop, we also presented several posters during the event:
- A Simple Segue from Sanger to High-throughput SMRT Sequencing with an M13 Barcoding System – Lori Aro, PacBio
- FALCON-Phase integrates PacBio and HiC data for de novo assembly, scaffolding and phasing of a diploid Puerto Rican genome (HG00733) – Sarah Kingan, PacBio
- No-amp Targeted SMRT Sequencing using a CRISPR-Cas9 Enrichment Method – Jenny Ekholm, PacBio
- Joint Calling and PacBio SMRT Sequencing for Indel and Structural Variant Detection in Populations – Aaron Wenger, PacBio
We’d like to thank our speakers and all the ASHG scientists who took time out of a busy conference to attend our workshop and stop by our posters. Stay tuned for further coverage of the event.
The new reference genome for Aedes aegypti, just published in Nature, famously got its start through a crowdsourced effort on social media, beginning with a tweet from Rockefeller University scientist Leslie Vosshall pleading for a better mosquito resource. The insect expert has been studying mosquitoes since 2008 but for most of that time did not have access to a high-quality, highly contiguous assembly.
We chatted with her to learn more about mosquitoes, what’s possible with the new reference genome, and how this new assembly has changed the landscape for understanding mosquito biology and its implications in viral transmission.
What made the mosquito genome so challenging to sequence?
It was sequenced 10 years ago but the technology then made it impossible to piece together. It’s extremely repetitive. I like to think it of as a series of blah, blah, blah — many copies of blah, blah, blah and you cannot figure out where it fits in the overall sequence of the genome. What was available in the decade-old genome was thousands and thousands of little pieces. That made it impossible to make any progress in studying the mosquito.
How does SMRT Sequencing fit into the story?
The only way we were able to piece this together is because PacBio [sequencing] allowed us to get really long reads that would bridge all the blah, blah, blah and be able to link the whole thing together.
Now that you have this genome, what’s possible?
Now we know how many genes there are in this deadly insect, and we know where they are on chromosomes. That enables everything that comes after. Until you know where the genes are and how many there are, you can’t figure out where insecticide resistance lies. Now with this new genome we can go in with great precision and find the genes. Then you can try to understand how those animals are becoming resistant and develop insecticides that overcome that resistance.
Another example is that viruses like dengue can replicate in some mosquito strains but not others. There are some strains that are resistant to dengue, and that’s a really cool thing to try to figure out. Again, people had a vague idea that there were resistance genes somewhere and now we can really understand what makes some mosquitoes susceptible to dengue and what makes others resistant.
What spurred you to find a way to develop a new genome assembly?
I came from the field of Drosophila. The fly genome was sequenced in 2000 and it’s an incredible work of art. When I started working on the mosquito I thought, are you kidding me? I thought surely someone was going to do something about this genome. Eventually Ben Matthews and I realized nobody is dealing with this, so we pulled together this huge group and took care of it. None of us had reserves of money to put together a new genome project, so it was amazing that PacBio and other corporate sponsors and academics pulled together to get this done.
Did you ever imagine this kind of project could be launched by a tweet?
I still can’t believe we did it. It was so unlikely — I actually know nothing about genome sequencing. The genome has been out there for the last year and a half, and people have already gotten enormous use out of it. It’s really gratifying.
What’s next for your team and how you hope to use this new reference genome?
The genome is powering every single project in my lab. We study how the mosquito hunts people. The genome is seeping into everything — it’s helping us identify genes that allow mosquitoes to smell people, knock the genes out, and develop genetic tools. It’s so inspiring to be able to do all these things we couldn’t do before the genome came online.
Many thanks to all the PacBio users who attended our annual user group meeting, hosted in St. Louis for the first time. It was great to see so many people sharing best practices and project ideas. If you couldn’t attend, this recap will give you a sense of the highlights from the two day meeting on Wash U’s campus and exciting networking event at the City Museum. You can also download several of the presentations and view video recordings.
Tina Graves-Lindsay from the McDonnell Genome Institute and the Genome Reference Consortium spoke about the importance of phasing human reference genomes. Her team is now working on its fifteenth human genome assembly — part of a major effort to improve genomic representation of ethnic diversity — with a pipeline that generates 60-fold PacBio coverage for a de novo assembly, followed by scaffolding with 10x Genomics or Bionano Genomics technology. They are also using FALCON-Unzip to separate haplotypes, leading to reference-grade diploid assemblies. This approach has already helped resolve errors seen in other genomes and even the gold-standard GRCh38 build of the human reference genome.
Human disease studies were the focus for two of our plenary speakers. Thomas Ray from Duke University spoke about the molecular foundation for wiring the nervous system and studies incorporating the Iso-Seq method to generate an accurate catalog of isoforms — including some that had never been seen before — that may explain how a given gene is able to control distinct neurodevelopmental functions. The project he shared centered on retinal development, covering dystrophies and other conditions that can cause blindness.
Nenad Svrzikapa from Wave Life Sciences spoke about the importance of being able to determine the haplotype of long reads and the application of SMRT Sequencing in Wave’s PRECISION-HD1 and PRECISION-HD2 clinical trials for the treatment of Huntington’s disease. Huntington’s disease (HD) is a devastating autosomal dominant disorder characterized by cognitive decline, psychiatric illness and chorea. HD patients carry both a disease-causing mutant allele (mHTT) of the huntingtin gene (HTT) with an aberrant CAG repeat expansion and a functional wild-type allele (wtHTT). Svrzikapa highlighted that Wave’s allele selective technology enables the specific targeting of the mHTT mRNA by targeting two single nucleotide polymorphisms (SNPs) distal to the CAG repeat region. Wave used SMRT technology to bridge this distance and developed an investigational assay for phasing the SNPs with the CAG repeat. Svrzikapa showed the results of Wave’s observational study, which is the first to prospectively identify the frequency of these SNPs in patients with HD, opening the possibility that these patients may be candidates for SNP-targeted therapies such as those being developed by Wave in the PRECISION-HD1 and PRECISION-HD2 clinical trials.
Moving to the animal world, Tim Smith (@tplsmith) from the USDA’s Agricultural Research Service spoke about efforts to generate reference-grade genome assemblies for various bovine species and analyze them to understand factors such as how selective breeding has affected certain breeds. Genome assemblies he cited spanned cattle, water buffalo, and gaur. Smith showed data for each assembly, noting that as data production shifted to the Sequel System, long-read PacBio data became even better at producing highly contiguous assemblies. He shared that one of the most recent, of the Yaklander cattle interspecies, is now the best bovine assembly ever produced. Smith attributed some of the assembly quality to help from the NHGRI’s Sergey Koren (@sergekoren), who also spoke at the meeting. Koren shared his TrioBinning tool, which is useful for resolving haplotypes in virtually any species. It works well even at low coverage but does require a trio of genomes for analysis.
Insects were represented as well in talks from Evgeny Zakharov of the Canadian Centre for DNA Barcoding and Marcé Lorenzen at North Carolina State University. Zakharov focused on the analysis of vertebrate-feeding arthropods to shed light on a region’s broader biodiversity. These complex samples can be interpreted by sequencing biological barcodes, work for which he and his team implemented the Sequel System two years ago to replace Sanger sequencing. Result concordance between the two technologies was excellent, but SMRT Sequencing is much higher-throughput and lower-cost, Zakharov said. That’s why he plans to use this platform for the next phase of this study, which will involve looking at 1.5 million species. In a lightning talk, Lorenzen talked about finding upstream promoters in non-model insects, a project for which she’s using low-coverage SMRT Sequencing. This work is aimed at using molecular genetics to make these pesks “less pesky,” she told attendees.
Microbial genomes were also on display at the user group meeting. Garth Ehrlich from Drexel University spoke about developing a microbiome assay that uses SMRT Sequencing to provide high-quality coverage of the 16S bacterial rRNA for species identification. The goal is a test that would enable de novo identification, no a priori knowledge required. Ehrlich showed how well his assay performs in a number of mock microbial samples, as well as in a case-control study of samples from lung cancer patients and healthy people.
Two lightning talks also fell into the microbial category. Jonathan Jacobs (@jmjacobs2) from The Ohio State University focused on plant pathogenic bacteria, which are characterized by difficult genomic repeats. Elucidating these transcription activator-like effectors can offer clues about virulence and DNA binding patterns, but long-read PacBio sequencing is needed to resolve the differences in these repeats. Microbial multiplexing makes the process quick and affordable. In the other lightning talk, Ben Auch (@sciberius) from the University of Minnesota, a PacBio certified service provider, also spoke about microbial multiplexing, which his lab performs to analyze both Gram-positive and Gram-negative species. His approach generates about 9 Gb per SMRT Cell and routinely produces single-contig assemblies at a total cost of just $250 for a microbe with a genome of 5 Mb.
Two other lightning talk speakers offered application-driven presentations on the Iso-Seq method and amplicon sequencing. Katy Munson (@zhaneel779) from the University of Washington offered pro tips for providing the Iso-Seq method as a service, walking through her recommendations for minimum RNA requirements (5 ug), RIN scores (7.5 or better), and other best practices. Her presentation also showed helpful examples of results from degraded samples. In a separate talk, Dave Corney from GENEWIZ, a PacBio certified service provider, focused on the use of PacBio short and long amplicon sequencing for diverse applications. With the quality of SMRT Sequencing, he predicted that it would be the ability to amplify an amplicon that would become the bottleneck, rather than the sequencing part of the workflow. He also said that high single-molecule fidelity would pave the way for novel applications and ultimately even clinical diagnostics.
The user group meeting concluded with a presentation from our CSO Jonas Korlach about future developments in SMRT Sequencing. He spoke about the upcoming Sequel System 6.0 release of new chemistry, SMRT Cells, and analysis tools that will improve phased de novo assemblies, overall accuracy, and variant detection. He also showed early data from a prototype of the SMRT Cell 8M, which has 8-times the capacity of our current SMRT Cell 1M and is scheduled to be released in 2019.
We’d like to thank our hosts for the meeting, the McDonnell Genome Institute at Washington University School of Medicine, as well as our partners: Advanced Analytical Technologies, Diagenode, DNAnexus, Integrated DNA Technologies, PerkinElmer, Phase Genomics, and Sage Science.
When scientists want to investigate human-specific evolution, the best place to start is often with a comparison to our closest cousins, the great apes. Some recent high-quality PacBio genome assemblies have provided solid new foundations for these projects, but gene annotation has proven challenging, particularly for segmental duplications — sets of gene families duplicated in the human lineage relative to our last common ancestor with the chimpanzee. Could these photocopied gene families be involved in human-specific traits like the development of a larger frontal cortex?
Until now, technical limitations have stood in the way of answering that question. Two common methods to quantify mRNA abundance, the expression microarray and short-read RNA sequencing, are not very useful when comparing paralogs that diverged so recently. Many of the human-specific segmental duplications are more than 98% identical on the genomic level.
Additionally, what may appear like an exact copy of a gene is often not so simple. In humans, segmental duplications can copy-paste in a genomic context to keep all of the regulatory information, effectively doubling up on that gene’s dose. But segments can also copy in a manner that loosens the selective pressure on one copy, allowing mutations to accumulate and even relegating one copy to the “lost function” or pseudogene category. Duplications can even place the new copy in a different regulatory landscape or adjacent to a neighboring gene, allowing natural gene fusion events to occur.
While a handful of human-specific duplicate genes have seen careful mRNA characterization to distinguish the expressed paralogs, the fate of many of these genes remains unknown. Since automated annotations cannot be relied upon in these highly identical regions, a recent study published in Genome Research by Dougherty and Underwood et al. took on the technical hurdles of characterizing mRNAs with isoform-level resolution for the human-specific duplicate genes.
Those hurdles were overcome largely with the PacBio Iso-Seq method, a long-read sequencing method that reads full-length isoforms. RNA from adult and developing human brain tissue was used as starting material for a modified Iso-Seq method that incorporates barcodes at both ends of the cDNA molecules. The brain cDNAs of interest were enriched using hybridization-capture techniques with probes designed against the exons of duplicate genes. This meant that isoform information for each locus could be effectively purified in cDNA form prior to sequencing.
Eight of the 19 gene families showed a nearly identical photocopy of the original gene, while the others showed patterns of gene truncation or fusion to a neighboring gene. Most of these latter cases represent new gene innovations that appear to be present only in humans.
One interesting case highlighted by the study is CD8B and its paralog, CD8B2. While the CD8B2 paralog used to be considered a pseudogene, the new isoform data indicate that the protein open reading frame is intact, with just a few amino acid changes relative to CD8B.
With better annotations in hand, the researchers went back and queried a large RNA-seq data set called GTEx to see which tissues might express these newly discovered duplicate gene isoforms. Surprisingly, most of the reads that were uniquely assignable to the CD8B2 paralog were found in brain tissue, not, like CD8B, in the blood. The scientists deduced that the segmental duplication event that created CD8B2 did not bring along the regulatory information from CD8B that drives its expression in the blood; instead, it landed in a spot with mild transcriptional activity in the brain, resulting in a complete ORF encoding mRNA for CD8B2 that is expressed in the cortex.
With this modified Iso-Seq method, scientists who know just a little about a gene can still find out a great deal about its expressed isoforms. Along with other recent capture methods, this should be broadly applicable to those interested in studying extremely close paralogs or haplotype-specific isoforms that are difficult to distinguish using short read sequences alone.
We’re pleased to announce the winner of the 2018 Microbial Genomics SMRT Grant. Mark Webber, Research Leader at Quadram Institute Bioscience in the UK, will get free SMRT Sequencing and analysis from our certified service provider, the Genomics Resource Center at the University of Maryland. His goal is to further a project designed to understand how bacteria on the skin of premature babies in neonatal intensive care units acquire resistance to the antiseptics used to prevent infections. We spoke with Mark to learn more about his work and how the SMRT Grant will make a difference.
Q: What’s your research focus?
A: We’re interested in how bacteria deal with stress — how do bugs become resistant to drugs? We’re particularly interested in Staphylococci and how they deal with the antiseptics that we use in hospitals. We’ve looked at patients in intensive care in the UK, examining isolates over time to see how susceptible they were to two antiseptics in a large teaching hospital that had changed its antiseptic use. What we could see was that as you used more and more chlorhexidine, the bugs were more and more tolerant. When they introduced another antiseptic; octenidine, a population quickly emerged that was less tolerant of octenidine.
Q: What inspired the proposal you submitted for this SMRT Grant?
A: We have been studying premature babies in neonatal intensive care. Every year about 450,000 premature babies are born in the US and 60,000 are born in the UK. About 15,000 of these| will suffer from late-onset sepsis. mainly caused by Staphylococci infections. These babies are often very immune-suppressed and the risk of death from sepsis is quite high. The babies almost all have peripheral catheters to allow feeding and drugs to be administered but these are a potential route for bugs to get in to their blood. Therefore, to prevent infection from bugs living on the skin there’s a lot of antiseptic use. We wanted to see how antiseptic tolerance might be developing amongst the bugs living on the skin of these premature babies.
Q: What have you learned so far about this issue?
A: At hospitals in the UK and with our collaborators in Germany, we collected isolates each week from hundreds of babies over a three-month period. We now have a total of 1,300 isolates of Staphylococci which we have tested for their susceptibility to chlorhexidine, the antiseptic used in the UK, and octenidine, the antiseptic used in Germany. We have seen that babies in intensive care appear to pick up Staphylococci with high tolerance to antiseptics very quickly and our UK population is particularly robust in dealing with chlorhexidine which we use.
Q: How will you use the sequencing capacity from your SMRT Grant?
A: After studying all 1,300 isolates, we want to understand how related these bacteria are and take about 20 representatives of the major branches of the Staphylococcal family tree and get really high-quality, full genome sequences. With the PacBio long reads, we will be able to see whether there are common mobile elements that explain the acquisition of tolerance. We can also look for duplications, rearrangements, and methylation patterns which may be responsible for antiseptic adaptation. PacBio sequencing will give us a higher quality picture of what’s going on than other technologies, and this SMRT Grant will let us capture most of the major branches of the phylogeny that we’re hoping to see.
Q: What do you hope to learn when you’ve had a chance to analyze these new reference genomes?
A: Doing the bioinformatics analysis will not take us that long — it should be a matter of a few days to get a pretty decent picture of what differs between antiseptic tolerant and sensitive strains. We hope to understand the mechanisms of resistance that we can then compare to isolates collected in other parts of the world. That will help us determine whether antiseptic use is likely to fail, whether different antiseptics are more or less likely to select for tolerance and whether antiseptic tolerance is linked to antibiotic resistance. Together this information should help us understand whether we need to change the way we use antiseptics to keep babies safe.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for another chance to win SMRT Sequencing. Thank you to our co-sponsor, the University of Maryland’s Genomics Resource Center, for supporting the Microbial Genomics SMRT Grant Program!