This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
For the thousands of scientists who attended The Plant and Animal Genome Conference in San Diego this January, the sentiment seemed to be “ask not if PacBio is for you, but how PacBio can work best for you.”
The answer that emerged during PacBio’s PAG workshop and subsequent SMRT Informatics Developers Conference was a complex one.
Recent developments, such as new chemistry, new SMRT Cells, the SMRTbell Express Template Prep Kit, and SMRT Link 6.0 software have already led to faster and easier library prep, longer reads with more data and reliability, better transcript characterization (Iso-Seq) and phasing (FALCON-Unzip) capabilities (discussed by PacBio principal scientist Liz Tseng), and deeper insight with less waiting.
As senior product manager Justin Blethrow laid out in his talk, “Sequence with Confidence – How SMRT Sequencing is Accelerating Plant and Animal Genomics,” the upcoming release of a new Express Template Prep Kit will make the faster library prep available for more applications, and the Sequel II System will provide 8-times the amount of data while also integrating circular consensus sequencing. Also in the product roadmap for 2019 is a low DNA input protocol for limited samples and the smallest of organisms.
Big data from tiny samples
A talk at the PacBio PAG workshop by Andrew Clark of Cornell University gave attendees a sneak peek at one of these exciting opportunities. In a collaboration with Manyuan Long of the University of Chicago and Rod Wing of the University of Arizona, the evolutionary ecologist was able to use PacBio sequencing to create new genome assemblies of 10 drosophila species, including de novo assemblies of two individual flies, using as little as 26 ng of gDNA. Clark was most curious about why D. virilis diverges so dramatically from other species in terms of heterochromatic regions with long sections of simple satellite repeats — up to 40% of the genome.
“These regions of the genomes are really a bear to work with,” he said. But with PacBio long-read sequencing, the assemblies “flew together nicely,” with almost whole chromosome arms assembling into single contigs. And patterns were readily apparent after taking a look at the raw reads.
“This method to develop whole genome sequencing from single individuals is terrifically exciting in terms of the kinds of new questions that we can generate and answer with those data,” Clark added.
The pangenome era
Other speakers at the PAG workshop heralded the dawn of the pangenome era and delved into detail about their work to create multiple references for plant and animal species.
Max Planck researcher Sonja Vernes, director of the Bat1K consortium, discussed her group’s ambitious efforts to sequence the genome of every living bat species.
Initial data shows clear improvement in the quality of assemblies generated with long reads, she said. The PacBio assembly of the greater horseshoe bat (Rhinolophus ferrumequinum), for instance, contained just 679 contigs, to a standard of 19.9 Mbp NG50, compared to its previously posted assembly made up of 290,000 contigs at a standard of 0.01 Mbp NG50.
Isoform sequence (Iso-Seq) analysis has also provided a wealth of information about transcripts that differ between different sites throughout the body, enabling comprehensive genome annotation.
“We’re really missing out on this information if we don’t go in and collect the functional data to understand the gene structure,” Vernes said.
Kevin Fengler, of Corteva Agriscience, described his work with maize. As he pointed out, genome assemblies must be very accurate and robust to be research-ready, which is why he favors a combination of the latest PacBio technology and old-fashioned manual curation to elevate scaffolds to platinum-grade assemblies.
“Base pair error is not sequence diversity, and mis-assembly is not structural diversity,” Fengler said.
He described his workflow and scaffolding assembly across maize pangenome lines, then moved on to the bigger question: what now?
“Here’s really where the fun begins,” Fengler said as he went on to present some pangenome visualization tools, including TagDots, “rapid dot blots for the pangenome era,” and PANDA (PANgenome Diversity Alignments).
A magical world: The Sequel’s sequel
At SMRT Informatics, Jeremy Schmutz of the HudsonAlpha Institute for Biotechnology and the DOE Joint Genome Institute, charted the evolution of plant sequencing over the last 10 years and gave a glimpse at its future using the Sequel II System.
A 2008 iteration of the soybean genome done with 15M Sanger reads cost 200 times the amount needed to create a more complete genome using 13 Sequel SMRT Cells in 2018, Schmutz said.
He provided more fun facts. Just how many plant genomes can one sequence with three Sequels and four full-time equivalent staff? Forty-three, Schmutz said — 6 outbred trees, 12 sorghums, 4 maizes, 4 mosses and 8 complex grasses — for a total of 48 Gb of completed genomes.
And his preview of the latest advances included a project to catalog somatic genetic and epigenetic mutations across 200-year-old poplar trees. Circular consensus sequencing (CCS) to generate HiFi reads on the new Sequel II allowed scientists to capture and correlate data from several sites along the trees, such as individual branches, as well as call SNPs, detect structural variants, and phase haplotypes.
“What can we do with the new Sequel IIs?” he asked the crowd. “Tackle outrageous sized plants, create high-quality pangenomes of species, use CCS for metagenomes, precise SNP and structural variation detection, phase haplotypes for alignable regions, and develop new hybrid, outbred, polyploid strategies.”
Other speakers at the half-day event also heralded the new HiFi paradigm, and the final session turned into friendly debates — about the ideal default parameter set for Iso-Seq to get the best bang for your buck, between PacBio scientist Liz Tseng and Roslin Institute bioinformatician Richard Kuo; and the speed of the CCS algorithm (used to generate HiFi data) and its feasibility in large-scale studies, between PacBio algorithm expert Jim Drake, HudsonAlpha’s Jeremy Schmutz and Sergey Koren of the National Human Genome Research Institute.
The next SMRT Scientific Symposium and Informatics Developers Meeting will take place May 7 – 9 in Leiden, Netherlands. Registration is now open.
PacBio posters from PAG can be viewed here:
- “Library Prep and Bioinformatics Improvements for Full-Length Transcript Sequencing on the PacBio Sequel System” – Michelle Vierra, et al
- “A Low DNA Input Protocol for High-quality PacBio De Novo Genome Assemblies from Single Invertebrate Individuals” – Sarah B. Kingan, et al.
- “Haplotyping Using Full-Length Transcript Sequencing Reveals Allele-Specific Expression” –
Elizabeth Tseng, et al.
- “Single Molecule High-Fidelity (HiFi) Sequencing with >10 kb Libraries” – Paul Peluso, et al.
The PacBio team was honored to have the opportunity to give several talks at this year’s Advances in Genome Biology & Technology conference. If you weren’t able to be there, we’ve got you covered with videos and highlights.
In a plenary session, Marty Badgett, senior director of product management, gave attendees a look at the latest results using the HiFi reads with the circular consensus sequencing (CCS) mode as well as a sneak peek at data from our soon-to-be-released Sequel II System. As he demonstrated, HiFi reads cover the same molecule many times, delivering high consensus accuracy (Q30 or 99.9%) at long read lengths.
This mode now works with fragments as long as 20 kb, as we showed in a recent preprint. Badgett offered several examples where this is useful, such as pharmacogenomic gene analysis and resolving metagenomic communities. He also updated attendees on our Iso-Seq method, which can now segregate transcripts into haplotype-specific alleles using a new tool called Iso-Phase.
Of course, the big highlight of the talk was a look at early data from the Sequel II System, which delivers approximately eight times the data of the Sequel System. Badgett showed that read length distributions and many other factors are essentially the same as the current system, but that the new model has improved raw read accuracy, taking eight passes around a molecule instead of ten to get to Q30 accuracy in CCS mode. He also presented results from Iso-Seq analysis, plant genome assembly, and continuous long-read mode. The Sequel II System is in five early access labs and will be commercially released in the second quarter of this year.
Later, CEO Mike Hunkapiller and Principal Scientist Jason Underwood gave talks in the much-anticipated technology session. Hunkapiller focused on the use of HiFi reads for comprehensive genomic analysis, offering examples such as the sequencing of a Genome in a Bottle reference sample, which concluded with Q48 accuracy, 18 Mb contigs, and clearly phased haplotypes.
That work also entailed variant analysis — Hunkapiller noted that SMRT Sequencing delivered good recall and precision for deletions and insertions — which performed best using DeepVariant from Google to model the data. The results showed that several seemingly high-confidence variant calls from previous analyses of the same sample were incorrect and added a significant number of new variants to the catalog.
Underwood spoke about single-cell isoform sequencing (scIso-Seq), focusing on a collaborative project with the labs of Evan Eichler and Alex Pollen. For this effort, scientists used Drop-seq sample prep and then loaded cDNA products onto the Sequel System. Results from a barnyard experiment using mouse and human cells as well as from cerebral organoids showed that this approach could deliver cell type-specific gene expression data. Underwood also presented data from the Sequel II System comparing chimp and human organoids, resulting in information for about 14,000 unique genes with important insights for post-transcriptional gene regulation, transcription start sites, and more.
Finally, Primo Baybayan, our Director of Applications, presented a poster entitled ‘A high-quality de novo genome assembly from a single mosquito using PacBio sequencing.’ In the poster, a modified SMRTbell library construction protocol was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System, generating, on average, 25 Gb of sequence per SMRT Cell with 20-hour movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes are present and full-length). This new low-input approach now puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.
Many thanks to the AGBT organizers for inviting our team to present this exciting science!
You may have missed last week’s Advances in Genome Biology & Technology conference in sunny Marco Island, Fla., but you definitely shouldn’t miss the two posters presented there by Justin Zook and Justin Wagner from NIST’s Genome in a Bottle (GIAB) consortium.
The GIAB team has made critical progress in generating high-quality human genome reference materials and benchmarks that have helped to improve the accuracy and reproducibility of variant calling across laboratories. The latest results advance that work with an expansion of the benchmark set to include additional small (single-nucleotide variant and indel) variants and — for the first time — large (structural) variants.
The poster on structural variants (“A new benchmark for human germline structural variant calls”) describes a benchmark set of 11,869 insertions and deletions ≥50 bp and corresponding benchmark regions that span 2.69 Gb (89%) of the human genome. The set is derived from multiple technologies and is validated by manual curation and consistent inheritance in a mother-father-son trio. With tools like Truvari, the benchmark set provides a direct measure of false positives and false negatives in individual variant callsets. This will enable the improvement of structural variant calling software, just as the small variant benchmark did for single-nucleotide variants and indels.
In a second poster (“Expanding the Genome in a Bottle benchmark callsets with high-confidence small variant calls from long and linked read sequencing technologies”), the GIAB team discusses expanding and improving their small variant benchmark set by integrating linked- and long-read technologies, including PacBio circular consensus sequencing (CCS) reads. CCS reads — described in a recent preprint — have similar base accuracy to typical NGS reads but are much longer, and thus map unambiguously to repetitive or low-complexity regions of the genome that are not accessible with short NGS reads.
Integrating linked reads and PacBio CCS reads expands the region over which GIAB can confidently call variants by more than 84 Mb (>3%), and detects an additional 156,000 variants (>4%), “mostly in regions difficult to map with short reads,” the authors report. In a list of medically relevant genes, this new benchmark adds 418 more variants, an increase of about 5%.
For more information about GIAB, check out the public workshop being held March 28-29 at Stanford University. As indicated on both posters, the GIAB team welcomes new collaborators interested in the accurate and complete characterization of human genomes.
A new publication in the Journal of Human Genetics describes an impressive effort to identify the pathogenic variant causing progressive myoclonic epilepsy in two siblings. The scientific team used SMRT Sequencing to discover a 12.4 kb structural variant in a repetitive, GC-rich region after several other methods — including whole exome sequencing — failed to find the answer.
The paper comes from lead author Takeshi Mizuguchi, senior author Naomichi Matsumoto, and collaborators at Yokohama City University, Aichi Prefectural Colony Central Hospital, and other institutions in Japan. As the authors note, whole exome sequencing has delivered strong results for many cases that would otherwise have gone undiagnosed; for progressive myoclonic epilepsy in particular, the diagnostic yield is 31%. “However, the remaining 69% of cases present a genetic challenge,” the scientists report. “These findings suggest that certain types of pathogenic variation evade detection by the currently available genetic analysis.”
In this project, researchers were stumped by two siblings — a 20-year-old female and a 13-year-old male — who both showed signs of a severe neurodegenerative condition. While a genetic cause was highly suspected, trio-based whole exome sequencing and a subsequent search for causative single nucleotide variants turned up no leads. The scientists then deployed SMRT Sequencing, focusing on structural variants ranging in size from 50 bp to 50 kb, especially in regions that are challenging for short-read platforms to sequence. They used the Sequel System to generate low-coverage whole genome sequencing of an affected sibling and three unrelated controls.
Analysis of the 6-fold coverage of the case sample with PacBio’s pbsv software identified more than 17,000 structural variants — including more than 7,200 deletions and nearly 10,000 insertions. The scientists filtered out structural variants seen in the control samples to quickly narrow the list of potentially causal candidates, and whittled the list further by selecting candidates that impact a coding gene. Fifty variants remained, five of which affected genes associated with an autosomal recessive phenotype. “Surprisingly, a 12.4-kb deletion call spanning the first coding exon of CLN6 was found,” the team writes. Biallelic mutations in CLN6 cause neuronal ceroid lipofuscinosis, a disease with clinical features that match those of the two siblings. Additional Southern blot and RT-PCR analysis validated the deletion and demonstrated that it was pathogenic.
With this finding in hand, the team went back to try to understand why the deletion had proven so elusive earlier. Two exome analysis methods “completely missed the homozygous CLN6 deletion … probably due to the scanty read coverage against CLN6 exon 1 with high GC content (77.6%) even in controls,” the scientists report. “By contrast, PacBio long reads showed uniform coverage … which improved the variant detection in GC-rich regions containing multiple repetitive elements.” Even with only three SMRT Sequencing reads of the CLN6 region, “the long sequences of the reads conferred excellent mappability and ensured the robust detection of [structural variants],” the team adds.
The authors encourage other scientists to consider using long-read sequencing for similar cases where exome analysis reveals no pathogenic variants. They also call for the development of a robust structural variation database, along the lines of what gnomAD does for small variants. “For the purpose of reducing the number of candidate of diseases-causing mutations, it would be extremely beneficial if a public database for [structural variants] were available,” they note.
Please join us in congratulating Kristen Sund from Cincinnati Children’s Hospital Medical Center for winning our 2018 Structural Variation SMRT Grant Program!
Her proposal to use SMRT Sequencing to pinpoint the genetic mechanism responsible for neurological disease in patients with complex structural rearrangements definitely captured our attention. We caught up with Kristen to learn more about her background, her research, and how she hopes to use the data generated through this grant.
How did you get into this field?
I have always had a very strong interest in research and patient care, so I decided to get training as a genetic counselor and to get my PhD in molecular and developmental biology. I guess it makes sense from there that I am constantly looking for ways to use the latest technologies to find the genetic cause for disorders that were previously undiagnosable.
What does your day-to-day work look like?
Right now, I’m a laboratory fellow in a combined program for cytogenetics and molecular genetics at the ABMGG Laboratory for Genetics and Genomics. My activities focus on learning everything from the wet lab to analysis to quality control to interpretation for clinical genetic testing. What I really love about the combined approach to molecular genetics and cytogenetics is that it allows us to fully integrate what we’re doing for a particular case and focus on finding an answer. It feels more holistic.
What’s the background behind your SMRT Grant proposal?
When I was a genetic counselor in the lab, I was involved with research projects that focused on using the latest genetic technologies. At the time we were not offering clinical whole exome sequencing and there was a strong interest in using the technology on a research basis for some families that hadn’t been diagnosed. I wound up developing an analysis algorithm which I’m sure is very primitive by today’s standards, but at the time it got the job done. We actually solved a number of those cases. I loved that work — getting to know the families and being able to find them an answer in some cases. In my lab now, we do offer whole exome sequencing, but I began wondering what else we could do with other technologies that wasn’t possible with exome sequencing. How could we use long-read sequencing to search for answers for cases that are undetectable with other technologies?
What is it about these cases that makes them challenging to solve with other approaches?
Here’s one example of a case that we’re planning to submit for long-read sequencing. This patient has a neurologic phenotype and a known chromosome abnormality that is a little bit unusual because it involves two chromosomes and four chromosome breaks from an insertion and a translocation. The patient has had extensive follow-up testing including a SNP microarray and a couple of NGS panels, all of which came back normal. I’m convinced that one of these breakpoints holds the answer. I’ve been able to estimate the location of the breakpoint and some genes that might be in the region, but all we can do is guess until we can get a higher resolution look at the breakpoints and hopefully find a gene of interest.
What does it mean to long-undiagnosed patients to finally get an answer?
Families use the information in different ways. One family that comes to mind started a support group through Facebook. This child was a teenager, so this family had been dealing with this her whole life, but they didn’t know what to expect for her prognosis or how to explain it to other people. For them, it was huge to get an answer. There are no real treatment options, but it meant so much to the family to find out what to expect.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win free sequencing. Thank you to our co-sponsor, the University of Minnesota Genomics Center, for supporting the 2018 Structural Variation SMRT Grant Program!
For Research Use Only. Not for use in diagnostic procedures.
With their large brains, sophisticated sense organs and complex nervous systems, cephalopods could teach us a thing or two about learning, memory, and adaptability. But despite their evolutionary, biological, and economic significance, their genome information is still limited to a few species.
To bridge this gap, a team of Korean scientists has assembled the genome of the common long-arm octopus (Octopus minor) using PacBio technology to sequence both the DNA and RNA of the emerging model species.
Found in Northeast Asia, particularly in coastal mudflats of South Korea, China, and Japan, O. minor has become a major commercial fishery product with a high annual yield. They are also promising organisms for studies of the molecular basis of plasticity and their adaptation to the harsh environmental conditions in mudflats that are subject to temperature changes, steep salinity and pH gradients, varying oxygen availability, wave action and tides.
With this in mind, Hye Suck An from the National Marine Biodiversity Institute of Korea and colleagues from Korea Polar Research Institute, Chungbuk National University, and the University of Science & Technology of Yuseong-gu, Daejeon, created a 5.09 Gb assembly of the challenging genome with 30,010 genes, nearly half of which were composed of repeat elements.
Additionally, they annotated the genome using the Iso-Seq method on pooled RNA from thirteen organs. This enabled them to identify characteristics like intron length, protein-coding genes and transposable elements.
Together, this data enabled them to elucidate the molecular mechanisms underlying the O. minor’s adaptations.
As described in a paper in GigaScience, the team also compared their results with the published genome and multiple transcriptomes of the California two-spot octopus (Octopus bimaculoides) to study their evolution.
“We discovered that they evolved recently and independently from the octopus lineage during the successful transition from an aquatic habitat to mudflats,” the authors stated. “We also found evidence suggesting that speciation in the genus Octopus is closely related to the gene family expansion associated with environmental adaptation.”
Curiosity of the Catfish
A reference genome for another important aquaculture species, the yellow catfish (Pelteobagrus fulvidraco), has also been created by scientists in China.
Popular in fisheries of Southern China, the species has suffered from germplasm degeneration and poor disease resistance. Similar to other aquaculture species, like the previously featured tilapia, sex is an important commercial trait, with adult males growing two- to three-fold bigger than females.
In order to decipher the economic traits and sex determination of the species, the multi-institutional team constructed “the first high-quality chromosome-level genome assembly” of the yellow catfish, using PacBio sequencing as the base technology to build long contigs.
As reported in GigaScience, they annotated the assembly, identifying 24,552 protein-coding genes, and explored the phylogenetic relationships of the yellow catfish with other teleosts. They found almost 2,000 gene families that were expanded in the yellow catfish, mainly enriched in immune system, signal transduction, glycosphingolipid biosynthesis and fatty acid biosynthesis, providing a rich path for future studies.
“We believe that the high-quality reference genome generated in this work will … accelerate the development of more efficient sex control techniques and improve the artificial breeding industry for this economically important fish species,” the authors wrote.
Getting the word out about your services is a surefire way to get more interest — and ultimately more projects — into your pipeline. The good news is, it doesn’t take a marketing specialist or tens of thousands of dollars to get started. You’d be surprised at some of the big gains you can get with a little outlay.
There are many ways to boost your name in the genomics service provider world at any budget level. Here we highlight our top three. Start with one and work up from there!
No matter your personal feelings about social media, it’s been shown again and again to be a great way to engage potential customers. Platforms like Twitter and LinkedIn give you an opportunity to broadcast promotional pricing, exciting results, or new services you’re excited about for exactly zero dollars and just a little bit of your time. Our advice to you is to get an account on both platforms, put some effort into a good description of your products and services, and spend 10-20 minutes a day interacting with others on the sites. Not only will you be able to reach more people yourself, but you’ll start to notice the upcoming movers and shakers, be able to answer questions directly, and see trends in the markets you’re interested in penetrating.
Some best practices for social media include:
- Be concise – Cut to the point right off the bat. After all, you only have 280 characters.
- Tag people – Did you get a great result with a collaborator or think a particular scientist would be interested in the result? Tag them! People are more likely to engage with posts in which they are tagged.
- #hashtags – Whether witty or targeted, including a simple tag that is specific to your core facility encourages people to promote your services for you. Just take a look at #PoweredByPacBio to see what we mean.
- Include links – References and links directly to the information you’re looking to spread encourages people to click on the content.
Get an email marketing service and keep in touch with your customers and prospects. Every inquiry, project, and person who comes by your booth at conferences is a lead – and leads need to be nurtured. Using a simple email marketing service like MailChimp or Campaign Monitor can make keeping in touch with them feel like less of a burden, and gives you an opportunity to stretch your creativity muscles. You can do individualized follow-up emails to prospects or start a monthly newsletter to share your current services and capabilities. And if you take the time to develop some content (a case study, a recorded webinar, or a brochure) you can create an automated drip campaign to send out emails at specific intervals, delivering your relevant content directly to prospects’ inboxes.
Some best practices for email marketing include:
- Catch their eye – A well thought-out subject line can draw someone in who may otherwise skip your email. Be sure to entice them with an offer!
- Keep it simple – With our unlimited access to information via the internet, it’s easy to get overwhelmed by content. Be sure to hone the message you’re trying to send and only include content in your emails that strengthen that message.
- Repetition is key – It takes anywhere from 3-5 interactions with a lead to turn them into a customer, so don’t feel bad about sending multiple emails. Just make sure to spread them over several weeks.
Webinars, or video seminars, give you unique opportunities to speak with hundreds of people from all over the world in one place without the cost of an event sponsorship or plane ticket. You can ask one or several of your customers to present exciting research that was enabled by your services, and then give a quick overview of the products and services you have available. There are many options when it comes to hosting webinars. Publishers, such as Nature, give you the benefit of access to their marketing team with promotion, logistics, and lead capture, at a price of about $10,000. More reasonably priced webinar hosting services with monthly fees include GoToWebinar and Zoom. And if you’re really on a budget and don’t expect more than 100 attendees, there are free services like ezTalks.
Some best practices for webinars include:
- Be prepared – Make sure you have your speakers and content ready long beforehand, so you can promote the event and capture leads via a registration page.
- Promote the event – A webinar is only as successful as the number of people who attend (or sign up). You will want to put a little bit of effort into promoting and reminding your target audience about the webinar via email and social media.
- Follow up – So you had 75 attendees, and everything went great? Awesome! Now it’s time to follow up with the folks in attendance (and registrants) to get them into the projects pipeline.
We hope this list was helpful and helped inspire your marketing strategy for 2019! Remember to start small with measurable results and, most importantly, have fun with it.
In an effort to produce a comprehensive list of structural variants in the human genome, scientists from the University of Washington, the University of Chicago, Washington University, and Ohio State University sequenced 15 human genomes and have now released the results of their in-depth analysis.
The Cell publication, “Characterizing the Major Structural Variant Alleles of the Human Genome,” comes from lead authors Peter Audano and Arvis Sulovari, senior author Evan Eichler, and collaborators. The data generated by this work “provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity,” the authors report.
The analysis represents remarkable genomic diversity. The team used the PacBio RS II and the Sequel System to produce high-coverage, long-read sequence data for 11 diploid genomes, primarily sourced from HapMap samples and spanning Yoruban, Gambian, Luhya, Han Chinese, Vietnamese, Puerto Rican, Columbian, Peruvian, Telugu, Northwestern Europe, and Finnish ethnicities. They also used existing PacBio genome assemblies for two hydatidiform moles (CHM1 and CHM13) as well as the recently published Korean (AK1) and Chinese (HX1) genomes.
From this wealth of long-read data, the scientists then resolved and annotated nearly 100,000 common structural variants (defined as insertions, deletions, or inversions at least 50 bp long). Of those, more than 2,200 variants were shared by all genomes analyzed and another 13,000 were detected in most genomes — “indicating minor alleles or errors in the reference,” the team notes. Most of the variants were not reported in previous studies that relied on short-read technology. “Importantly, the breakpoints and content of these major alleles are now resolved at the single-base-pair level,” the scientists add, “providing the requisite sequence specificity on a GRCh38 coordinate system to begin to develop not only alternate haplotypes, but also to develop a more comprehensive graph-based assembly representation of the human genome.” The authors also noted that there are more structural variant alleles to discover, estimating that adding 35 more genomes (50 total) would increase the number of alleles by 39%.
The scientists also make clear that this kind of study would not have been possible even a few years ago. “Recent advances in sequencing technology have now allowed us to systematically whole-genome shotgun (WGS) sequence large stretches (>10 kbp) of native DNA without the need to propagate clone inserts in E. coli,” they explain. “This is particularly advantageous for structural variation since the long reads provide the necessary context to anchor and sequence resolve most structural variants (SVs) irrespective of sequence composition.” In addition, the analysis determined that variants were more likely to be found in GC-rich or GC-poor sequences, which means they “were likely problematic to clone, sequence, and assemble using large-insert BAC clones” during the Human Genome Project, the scientists add.
The results of this impressive work now comprise the first database of structural variants in control individuals sequenced with long reads, making it a valuable resource for researchers seeking to discover pathogenic structural variants associated with particular diseases. “The sequences we now add to the human genome provide the necessary substrate to discover new disease associations, especially as they relate to repeat instability,” the authors conclude.
There are more structural variants waiting to be found in human genomes. If you’re interested in related research, use our project calculator to estimate the time and materials needed and to get suggested study designs.
The recent Nature paper describing the first evidence of somatic gene recombination in the human brain has been getting so much attention that we went back to the lab’s PI to learn more. Jerold Chun is Professor in the Degenerative Diseases Program and Senior Vice President of Neuroscience Drug Discovery at Sanford Burnham Prebys Medical Discovery Institute in La Jolla, Calif. He spoke with us about this remarkable discovery in the APP gene in patients with sporadic Alzheimer’s disease, the decades-long hunt for somatic recombination in genes active in the brain, and how SMRT Sequencing made a difference.
Previous efforts to find somatic recombination in the human brain failed. Why did you continue the hunt?
This goes way, way, way back. Anyone who knew about V(D)J recombination that was originally reported in the ’70s and knew something about the nervous system has been intrigued by that possibility. It was the seed for trying to identify some type of similar recombination in the brain. But back then ideas were very vague; it was simply trying to take what we knew about the immune system and projecting what might occur in the nervous system. Nevertheless, the concept remained compelling and our studies on genomic mosaicism that occurred in the interim supported something interesting going on. As it turns out, the thought was good but the details were quite different from what we originally thought. We’re now at the point where we can talk about it not as a phantom but as reality.
After all those years of looking for this evidence, what was it like to finally find it?
You kind of scratch your head about the vagaries of science. This is a concept that was written off by almost any sane scientist years ago because so much effort had gone into chasing it and nothing emerged.
In the paper, you noted that short-read sequencing had been used for these efforts in the past but wasn’t successful. Why was that?
We had originally thought that if we could use single-cell technologies which rely on short-read sequencing, it would open this area up. The challenge is that the resolution of the sequencing technology is not sufficient even to interrogate the wild type locus. Even under the best circumstances we’re pretty much around 1 million base pairs. That’s not going to allow us to see 300 kilobases, which is where the APP locus is. That was a major limitation. Also, most short-read sequencing approaches require mapping to a reference genome. If there were inversions, insertions, or deletions, they may well be missed or be filtered out because they don’t map to what was expected in the reference. As soon as PacBio came onto the scene for our work, it just became absolutely clear that this was the way to pursue it so we could look at the complete sequence of what we now know are variants.
How did your team use SMRT Sequencing for this project?
There’s a really cool and special kind of sequencing with PacBio — circular consensus sequencing, or CCS. If you have a small enough piece — say, in the 3 kb to 5 kb range — the polymerase can go around and around and around the template. As a result, you can get many, many reads of the same template, so you can line those up and take the consensus read by looking at which of the residues show up most often. This is a way to get around the inherent polymerase error rates. In so doing, you get enormously high Phred scores as well as certainty levels. I think in this case we had a median Phred score of around 93 and a certainty of 99.999999%. It was actually approaching Sanger sequencing levels of certainty.
In the publication you speculated that HIV antiretroviral therapies might be used for patients with sporadic Alzheimer’s disease. Do you see that as near-term or will it take a long time to assess the possibility?
I think this is now. What we need to do is convince the clinical community to embark on it. The epidemiological signals are some of the most compelling of any that one could hope for. The total number of individuals in the United States who have HIV, are being treated with these antiretrovirals, and are at risk of Alzheimer’s disease because they are 65 or older is more than 120,000. The projections for developing Alzheimer’s in that age group is about 3% to 10%. But in 2016 the first reported case of an HIV patient with Alzheimer’s appeared in the literature, and as of now that’s the only case. I think it would be to everyone’s benefit to look at whether we can recapitulate that signal in a controlled, prospective clinical trial. Importantly, these are FDA-approved agents, some of which have been in humans since the 1980s, and thus there is sufficient proven safety to use these agents over long periods of time. Based on the science, we now have an explanation for why this might work.
This discovery must open new doors for your lab. What’s next?
There’s a new universe that’s been accessed here. It should impact both other forms of Alzheimer’s disease as well as other brain diseases and perhaps even other diseases that involve cells with a long life span. I think we’re in a position to search for and test whether gene recombination producing genomic cDNAs are more prevalent and involving other genes, and PacBio is certainly going to be a big part of that analysis.
We’re excited to report on new SMRT Sequencing advances that will ultimately help users generate extremely accurate, single-source data for large-scale genome projects. We demonstrate this new approach in a preprint on bioRxiv, and intend to fully support the new data type in upcoming product releases for the broader SMRT Sequencing community.
The preprint describes a collaborative effort to comprehensively characterize a human genome — we chose the well-analyzed HG002/NA24385 sample available as a benchmark from the Genome in a Bottle consortium — Lead authors Aaron Wenger and Paul Peluso, senior authors David Rank and Michael Hunkapiller, and co-authors at PacBio, Google, NIST, and a host of leading academic institutions and companies contributed to the publication.
The work stems from our ongoing commitment to keep increasing the quality and usability of data generated from SMRT Sequencing systems. “Today, human genomes are sequenced at population scales, but it remains necessary to combine sequencing technologies to cover all types of genetic variation, which increases cost and adds complexity to projects,” the paper’s authors explain. “A sequencing technology with long read length and high accuracy would enable a single experiment for comprehensive variant discovery.”
To that end, the team developed a new protocol based on the CCS method, which builds a consensus sequence based on many passes across the same template. “Recent gains in read length for SMRT Sequencing and optimized DNA template preparation suggested an opportunity to unify high accuracy with long read lengths using CCS,” the scientists report.
Using the human genome as a proving ground, the authors selected a library tightly-distributed at 15 kb, generated CCS reads with an average of 10 passes, and sequenced the genome to 28-fold coverage. The average read accuracy is 99.8%, matching the accuracy of the typical short read. De novo assembly of the reads yielded “a highly contiguous and accurate genome with contig N50 above 15 Mb and concordance of Q48 (99.998%),” they add.
The team also interrogated a broad range of variants and performed phasing. “We analyze the CCS reads to call SNVs, indels, and structural variants; to phase variants into haplotype blocks; and to de novo assemble the HG002 genome,” the scientists report. “The CCS performance for SNV and indel calling rivals that of the commonly-used pairing of BWA and GATK on 30-fold short-read coverage.” Detection of variants was consistently strong for SNVs (99.91%), indels (95.98%), and structural variants (95.99%). As the authors note, “Nearly all (99.6%) variants are phased into haplotypes, which further improves variant detection.”
Beyond the remarkable quality results from this protocol, the scientists note a number of other advantages with this approach. These include easier sample prep, since there is no need for ultra-long genomic DNA, reduced computational time, and the ability to use familiar tools like GATK designed for accurate reads.
Future improvements to the method — such as faster generation of HiFi reads from subreads and increasing the number of reads produced in a run — should “facilitate rapid, population-scale analysis of full genomes to improve human health,” the authors write. The HiFi protocol also will have application outside of human genomics, with utility in metagenomics as well as plant and animal genome assembly.
According to many, PacBio is the new “gold standard” in microbial sequencing. Chief Scientific Officer Jonas Korlach notes that its ability to simultaneously provide long sequencing reads (genome contiguity), high consensus accuracy (genome accuracy), minimal sequence bias (genome completeness), and methylation detection (bacterial epigenome) has made it the technology of choice for users who need to reliably produce high quality genomes.
In a presentation for the virtual Microbiology & Immunology conference, Korlach highlighted PacBio’s strengths in the field, including multiplexed microbial sequencing on the Sequel System and full-length bacterial RNA sequencing.
Microbial de novo genome assembly
Multiple bacterial genomes can now be sequenced in one SMRT Cell. Not only are bacterial chromosomes revealed, but plasmids are also assembled. This is significant because these mobile genetic elements often carry the genes that drive virulence, drug resistance, and other traits that are important to understand microbial biology, Korlach said.
Getting the full picture of bacterial plasmids allows scientists to track the transmission and follow hospital-associated infection outbreaks, for instance. It also gives insights into the evolution of bacterial strains, both through single nucleotide changes and larger structural rearrangements.
“It is now possible for the first time to really understand the evolution and generation of some of these very dangerous superbugs that are resistant to all known antibiotics,” Korlach said, citing a study of 16 Klebsiella pneumoniae isolates collected in German hospitals.
Korlach also referenced a study that solved a decades-old mystery about Spiroplasma poulsonii, a type of symbiotic bacteria that manipulate host reproduction to spread in a population by selectively killing off the sons of infected female hosts during development. A team of Swiss researchers identified the toxin responsible, located on a plasmid, using SMRT Sequencing.
“This thorough understanding of the functional ramifications was really only provided by PacBio sequencing, and forms the foundation now for thinking about controlling insect populations.”
Other advantages of PacBio sequencing? Every bacteria has multiple copies of the 16S housekeeping gene, and PacBio is the only technology able to produce multiple distinct 16S sequences per bacterial genome, Korlach said. PacBio sequencing can also overcome long-standing challenges in assembling yeast, fungi and other eukaryotic genomes connected to infectious disease.
Malaria causing agent plasmodium falciparum, for instance, was nearly impossible to sequence in traditional platforms because it is highly repetitive and AT-rich. But PacBio long-read sequencing enabled complete telomere-to-telomere de novo assembly of the genome. Another recent publication highlighted the utility of PacBio sequencing for closing yeast genomes, revealing that differently named food processing and pathogenic strains of yeast are in fact the same species.
Characterizing the bacterial epigenome and transcriptome
One of the three pillars of understanding bacterial biology is the epigenome, which PacBio is uniquely suited to characterize without any additional sequencing beyond what is needed for genome assembly. Characterizing methylomes and methylation status can shed light on how some bacteria can switch pathogenicity between not-so-dangerous to fatal, or the changes that occur when free-living bacteria associate with a host and become symbiotic instead.
Researchers at The Forsyth Institute have created a method for transforming formerly resistant bacterial targets by leveraging the epigenetic fingerprinting information revealed by SMRT Sequencing.
Their SyngenicDNA stealth-based evasion of restriction-modification barriers during bacterial genetic engineering could unlock myriad applications in basic research, industrial biology, synthetic biology, and translational science, Korlach said.
Finally, Korlach referenced a collaboration with New England BioLabs to adapt the Iso-Seq full-length RNA Sequencing protocol to bacterial transcripts, providing the first detailed look at the dynamic structure and regulation of operon transcription.
“It was an unprecedented view of the complexity of the bacterial transcriptome and bacterial gene expression that was previously hidden.” Korlach said.
Korlach also delivered a warning: Relying on existing NCBI database entries to establish what you have in your lab may be risky.
“A number of the so-called complete genomes in the NCBI database that were done a few years ago contain, in some cases, quite dramatic errors,” he said.
A survey of 20 “reference strains” contained in the GenBank found that 30% had significant structural alterations when re-sequenced and assembled from scratch using PacBio technology, either due to previous sequencing errors or bacterial changes while in the repository.
“If you really want to be sure about what you have in your tube or on your plate, I suggest doing the genome from scratch,” Korlach said.
Luckily, this has become easier and more accessible with microbial multiplexing. Korlach outlined some of the nuts and bolts of the new microbial multiplexing kit, released in June, that allows researchers to pool numerous bacteria (of around 30 Mb) in one sample and sequence it in one reaction. He also pointed people to a microbial multiplexing calculator to help ensure equimolar pooling and even sequencing representation across all pooled samples despite different genome sizes, shear sizes, and sample concentrations.
And those interested in attending and presenting at ASM Microbe in San Francisco, June 20-24, should note that the deadline for abstract submissions has been extended to Jan. 25, 2019.
Scientists in Japan report using the unique properties of SMRT Sequencing to detect a structural variant (SV) responsible for a hereditary form of epilepsy. The 4.6 kb intronic repeat insertion was found from low-coverage whole genome sequence data, leading the team to suggest that this approach could be useful for determining the genetic mechanisms behind many unexplained diseases.
“Detecting a long insertion variant in SAMD12 by SMRT sequencing: implications of long-read whole-genome sequencing for repeat expansion diseases” comes from lead author Takeshi Mizuguchi, senior author Satoko Miyatake, and collaborators at Yokohama City University and the University of Occupational and Environmental Health School of Medicine. The Journal of Human Genetics paper describes how the scientists turned to long-read SMRT Sequencing after finding that short-read platforms were not well-suited to detecting large, challenging SVs. “Many patients with conditions for which the genetic cause is unknown are still encountered, suggesting that certain types of pathogenic variation evade detection by the currently available short-read technology,” the authors note. Because long-read sequencing can now routinely produce reads of 10 kb or more, they add, this technology “may pave the way for the detection of unprecedented SVs as well as repeat expansions.”
For this project, scientists worked with a Japanese family affected by benign adult familial myoclonus epilepsy, or BAFME, which generally manifests in adulthood. Previous studies had used linkage analysis to identify four loci in the family associated with the disease. The team used whole genome SMRT Sequencing to analyze one affected family member as well as three healthy controls.
Using pbsv, they identified 9,138 insertions and 6,498 deletions in the affected individual, of which 2,420 insertions and 1,086 deletions were not seen in the unaffected family members. That included six SVs in the linked SAMD12 region of interest, with a 4,661 bp insertion identified as mostly likely pathogenic. “The insertion was a novel sequence, rather than a tandem duplication,” the scientists report. “A total of 95.41% was found to be a low-complexity sequence.”
The team suggests that this approach could provide an unbiased means of detecting pathogenic SVs. “These results indicate that long-read WGS is potentially useful for evaluating all of the known SVs in a genome and identifying new disease-causing SVs in combination with other genetic methods to resolve the genetic causes of currently unexplained diseases,” they report.
What has four legs, lots of fat and fur, and will possibly help uncover novel mechanisms to combat diabetes?
If humans were to undergo regular, extended cycles of weight gain and inactivity, they’d likely end up with obesity, muscle atrophy, or type 2 diabetes. But grizzly bears experience no ill effects from their annual fat gain and sedentary hibernation. Somehow they are able to switch their insulin resistance between seasons, and researchers at Washington State University are hoping to figure out how, with possible therapeutic value for humans.
We’re proud to support this outstanding research, by awarding graduate student Shawn Trojahn and Associate Professor Joanna Kelley the 2018 Plant and Animal SMRT Grant. We recently caught up with them to learn more about their research project and the bears that make it possible.
Why grizzly bears?
Well, we have access to an incredibly unique resource, the WSU Bear Research, Education, and Conservation Center, the only dedicated research populations of grizzly bears. When our lab first came to WSU five years ago, we became interested in the studies being done there, in fields from nutrition to physiology.
Scientists are really starting to appreciate hibernators.
They do some pretty unusual things and they do them well.
We could learn a lot from them.
A lot of the phenotypes that we see in hibernating bears could give us insight into genetic mechanisms that might be relevant for human diseases too. They gain and lose a lot of fat. They have insulin insensitivity during hibernation—which is what happens in type 2 diabetes— but they reverse the insensitivity in the active season.
We hypothesize that these reversible states are achieved in grizzly bears through differential expression of transcript isoforms, possibly with human homologs. Preliminary evidence from proteomic work on hibernating and non-hibernating bear serum supports this hypothesis, as there is no change in the identity of proteins present, but peptides differ between seasons.
What does the project entail?
We plan to compare full-length isoforms between hibernating and active bears in three metabolically active tissues: skeletal muscle, liver, and adipose.
We will also collect blood samples at different stages along the annual cycle, from six bears who have been trained from birth to take part in approved research.
The bears are pretty amazing. They can respond to cues, and present their paws for inspection, making it easy to take blood draws without the need for sedation.
How will PacBio Sequencing support this project?
SMRT Sequencing with Iso-Seq analysis is perfect for this work, as it will allow us to identify the full-length isoforms that are differentially expressed between seasons.
We’ve used SMRT Sequencing for genome assembly of organisms in extreme environments, such as polar fishes, and we’ve been following the development of the Iso-Seq method since its introduction to the field. We’re extremely excited to finally have the opportunity to try it.
What do you expect to find?
No one has done this before, so we have no idea what we will find. But we’re going to extract as much information as possible about alternative splicing, binding sites, regulators and protein translation. And we’re really excited to see what other questions might arise as a result. This will open up so many opportunities—the sky’s the limit.
We are so excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs including the the 2019 Plant and Animal Science SMRT Grant, opening February 1, with video proposals. So start practicing your best YouTube-worthy pitch and stay tuned for more information.
Thank you to our co-sponsor, the University of Delaware Sequencing & Genotyping Center, for supporting the 2018 Plant and Animal Science SMRT Grant Program.
Scientists were certainly sequencing with confidence in 2018, as evidenced by the number of significant and wide-ranging advancements made using SMRT Sequencing technology, several of which made the cover of high-impact journals. As the year draws to a close, we have taken this opportunity to reflect on the many achievements made by members of our community, from newly sequenced plant and animal species to human disease breakthroughs that even captivated the popular press.
“It’s been a phenomenal year for science. We are proud of our partners and honored that our technology is helping to drive such discovery across all fields of the life science.”
Jonas Korlach, Chief Scientific Officer
Human Biomedical Research
Our understanding of human health and disease increased with new population-specific genomes, breast cancer cell line variants, on-target mutagenesis of CRISPR-Cas9 editing and insights into genomic cDNAs and their potential role in in Alzheimer’s disease.
- Work from A. Ameur et al. from Uppsala University, “De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human Grch38 Reference and Improves Variant Calling of Population-Scale Sequencing Data,” featured in Genes and on our blog
- “Complex Rearrangements and Oncogene Amplifications Revealed by Long-read DNA and RNA Sequencing of a Breast Cancer Cell Line” by M. Nattestad et al. garnered lots of reads in Genome Research and on our blog
- A study by scientists at the Sanger Institute, “Repair of Double-Strand Breaks Induced by CRISPR-Cas9 Leads to Large Deletions and Complex Rearrangements” in Nature Biotechnology (and our blog) added to the gene editing debate
- Nature featured a “remarkable phenomenon” observed by Lee et al. in “Somatic APP Gene Recombination in Alzheimer’s Disease and Normal Neurons.”
Plant & Animal Sciences
Research in plant and animal genomes uncovered exciting biology, including a tiny animal with a huge genome that gave us a glimpse into tissue regeneration and great apes that are helping us better understand human evolution. In addition to big achievements from international consortiums such as the Vertebrate Genome Project, the Earth Biogenome Project, and the Sanger 25, we also shared in the sweet success of the sugarcane genome and explored architectural differences between maize and sorghum.
- Nature’s report on “The Axolotl Genome and the Evolution of Key Tissue Formation Regulators” by S. Nowoshilow and colleagues around the world captured the attention of the popular press, and our blog
- Kronenberg’s “High-Resolution Comparative Analysis of Great Ape Genomes” graced the cover of Science, and our blog
- “Allele-defined Genome of The Autopolyploid Sugarcane Saccharum spontaneum L” by J. Zhang et al. was also a cover star, in Nature Genetics and our blog
- Our RNA sequencing capabilities featuring the Iso-Seq method were nicely showcased in “A Comparative Transcriptional Landscape of Maize and Sorghum Obtained by Single-molecule Sequencing” by B. Wang et al. in Genome Research
Microbiology & Infectious Disease
From selfish symbiotic bacteria to HIV variants in the brain, we were enthralled by new views into the microbial world. In addition to the release of 3,000 bacterial genomes by the UK’s National Collection of Type Cultures (NCTC), scientists also contributed new methods to distinguish between microbial genomes.
- RL Brese et al. made important inroads in understanding why patients with HIV develop neurological disorders in their Journal of Neurovirology paper, “Ultradeep Single-molecule Real-time Sequencing of HIV Envelope Reveals Complete Compartmentalization of Highly Macrophage-Tropic R5 Proviral Variants in Brain and CXCR4-Using Variants in Immune and Peripheral Tissues,” also featured on our blog
- AP Douglass et al. made a concerning discovery in their PLoS Pathogens paper “Population Genomics Shows No Distinction Between Pathogenic Candida krusei and Environmental Pichia kudriavzevii: One Species, Four Names,” also featured on our blog
- Nature featured a discovery by Swiss researchers, with potential implications for insect control in other species, “Male-Killing Toxin in a Bacterial Symbiont of Drosophila,” also featured on our blog
- Nature Biotechnology described a new method of sequence binning by researchers at Icahn School of Medicine at Mount Sinai, “Metagenomic Binning and Association of Plasmids with Bacterial Host Genomes Using DNA Methylation,” also featured on our blog
Did we miss one of your favorite publications of 2018? Tweet us @PacBio, using #PoweredbyPacBio. And check out our searchable publications database for more than 1500 examples of outstanding SMRT Science from 2018.
In the rapidly evolving world of DNA sequencing, the community is often focused on what’s new and what’s next. There’s not much opportunity for retrospection. But two recent articles offer an insightful look at the history of SMRT Sequencing technology, from the time it was just a gleam in the eye of some Cornell University scientists to how it works and some exciting new applications.
At Technology Networks, reporter Ruairi MacKenzie writes about the scientific beginnings of SMRT Sequencing with memories from PacBio CSO Jonas Korlach, one of the inventors of the technology.
“Korlach concluded that if only you could see DNA polymerase doing its incredible, evolution-assisted work, then you could simply let the enzyme do the heavy lifting and take notes on its performance to create a top-quality sequencing technique,” MacKenzie reports. He describes the powerful collaboration of Korlach, Watt Webb, Steven Turner, and Harold Craighead in the project that would ultimately begin the path to PacBio.
“This started off a series of experiments that aimed to create a microscope that was, in effect, a thousand times more powerful than any currently available,” the article continues. “The eventual product of a number of attempts was the zero-mode waveguide, the foundation of PacBio’s sequencing technology.”
The article also covers the various optimization experiments that ensued, such as figuring out how to create fluorescent tags that wouldn’t decrease sequencing efficiency. If you ever wondered where SMRT Sequencing got its start, this piece provides the answer.
A Wall Street Journal article covers the past 15 years of DNA sequencing, from the public/private competition to sequence the first human genome all the way to some of the most recent and compelling scientific projects being powered by SMRT Sequencing.
The article reports on potential clinical applications for long-read sequencing, such as helping to diagnose rare diseases. “Mr. Hunkapiller says PacBio’s machines can help by detecting what are called ‘structural variants,’ changes to DNA that may involve hundreds or even thousands of base pairs, making them difficult to pick up with earlier technology,” Kyle Peterson writes. “Last year a group at Stanford was able to diagnose a young man whose heart had repeatedly grown benign tumors. One of his genes on Chromosome 17 was missing 2,200 base pairs.”
The article also describes other interesting recent applications of SMRT Sequencing platforms, including the largest known genome from the tiny Mexican salamander, the 100 ants project, and bat longevity.
It’s a great time of year to reflect on the history of the technology and how scientists are applying it today, and we encourage you to check out both articles on your next coffee break.
Advances in personalized medicine — whether it’s the discovery of a new pathogenic variant or a success story about a patient treated with a tailored therapy — seem to be almost a daily occurrence. That’s why we’re particularly excited to attend the Precision Medicine World Conference (PMWC), co-hosted by Stanford, UCSF, Duke, Johns Hopkins & U. of Michigan, taking place January 20-23 at the Santa Clara Convention Center. The meeting brings together thought-leaders of business, government, healthcare-delivery, research and technology to share the latest developments, challenges, and triumphs in the field.
PMWC is well known for giving out prestigious awards, and this year’s slate of honorees is as impressive as ever. This year’s Luminary Awards, given for recent contributions to accelerate personalized medicine, will go to Carl June at the University of Pennsylvania for his CAR-T work, Genetic Alliance’s Sharon Terry for her efforts to empower individuals with their own health data, and Feng Zhang at the Broad Institute for his development of optogenetics and CRISPR.
The Pioneer Award, which honors “rare individuals who presaged the advent of personalized medicine when less evolved technology and encouragement from peers existed,” will be given to George Yancopoulos at Regeneron Pharmaceuticals.
If you’ll be attending the meeting, don’t miss our own Lori Aro, senior director of clinical genomics, who will give a talk entitled “Sequencing with confidence: Highly accurate single-molecule long reads” on January 22nd at 8:30 am. We’re also looking forward to talks from Andrew Carroll at Google AI and Randy Scott from Invitae, among many others.
There’s still time to register for PMWC 2019 and they’re offering our blog readers a 10% discount until December 31. We hope to see you at the meeting!
You may be more likely to get five gold rings or three French hens than two Turtle doves this Christmas. The subject of the famous holiday carol is in precipitous decline across Europe, with 94 percent of Turtle doves lost since 1995, and fewer than 5,000 breeding pairs left in the UK.
In an attempt to save the species, geneticists at the Wellcome Sanger Institute identified it as a priority species to be sequenced as part of a year-long 25th anniversary project.
Collaborators at the University of Lincoln sent samples (collected from live birds during routine health checks) to the Sanger Institute. The sequencing teams extracted DNA from the samples and used SMRT Sequencing technology to generate the first reference genome for Turtle doves (Streptopelia turtur).
The results, announced today and set for release in early 2019, will provide a genetic reference for determining effective population sizes and establishing breeding programs in efforts to help conserve the threatened bird species, which has been listed as vulnerable on the International Union for Conservation of Nature (IUCN) Red List.
Jenny Dunn from the University of Lincoln, said: “To give Turtle doves the best chance of survival in the future, we need to first understand the pressures that are affecting their population decline. The Turtle dove genome will give insights into how diseases and limited food resources impact on their health and will aid practical conservation efforts to maximize the genetic diversity of introduced populations.”
On the course to discovery
Scientists also hope to solve another mystery: how some migrating birds “see” the Earth’s magnetic fields for navigation.
To do this, Sanger and collaborators created a high-quality genome of the European robin (Erithacus rubecula), completed to the “platinum standard” set by the Vertebrate Genome Project (contig N50 in excess of 1 Mb and scaffold N50 above 10 Mb).
European robins live throughout Europe, Russia and western Siberia. While most British robins reside in the UK over winter, some birds will migrate to southern Europe to overwinter in warmer climates. Simultaneously in winter, migrant robins from Scandinavia, continental Europe and Russia head to the UK to avoid the harsh weather back home.
“Birds can use the Earth’s magnetic field as a reference for orientation during the migratory journeys, and the magnetic compass in birds was first described in a robin,” said Miriam Liedvogel from the Max Planck Institute for Evolutionary Biology in Plön, Germany. “The European robin genome will allow us to identify what’s driving migration in birds, and understand the variability of migration in other bird species as well.”
The two birds join the Golden Eagle as the first of 25 UK species to have their genetic code sequenced and assembled as part of the Sanger Institute’s 25 Genomes Project, which also includes species such as grey and red squirrels, blackberry and brown trout.
January 18, 2019
This paper is now available at Genes.
December 19, 2018
High-quality reference and de novo genomes have been celebrated by geneticists, population biologists and conservationists alike, but it’s been a dream deferred for entomologists and others grappling with limited DNA samples, due to previous relatively high DNA input requirements (~5 μg for standard library protocol).
A new low-input protocol now makes it possible to create high-quality de novo genome assemblies from just 100 ng of starting genomic DNA, without the need for time-consuming inbreeding or pooling strategies. The targeted release date for the protocol is February 2019.
The protocol, developed as a collaboration by scientists at the Wellcome Sanger Institute and PacBio, was used to assemble the genome of an Anopheles coluzzii mosquito with unamplified DNA from a single individual female insect.
As described in a bioXriv pre-print, Sarah B. Kingan, Haynes Heaton, et al. used a modified SMRTbell library construction protocol without DNA shearing and size selection to facilitate the use of lower input amounts, as shearing and clean up steps typically lead to loss of DNA material.
“This new low-input approach puts PacBio-based assemblies in reach for small and highly heterozygous organisms that comprise much of the diversity of life,” said co-corresponding author Jonas Korlach, our chief scientific officer.
The sample was run on the Sequel System with the latest v6.0 software, followed by de novo genome assembly with FALCON-Unzip, resulting in a highly continuous (contig N50 3.5 Mb) and complete (more than 98% of conserved genes were present and full-length) genome assembly.
About a third of the new de novo genome is haplotype-resolved and represented as two separate sequences for the two alleles, providing additional information about the extent and structure of heterozygosity that was not available in previous assemblies, all of which were constructed from many pooled individuals.
“The ability to generate high-quality genomes from single individuals greatly simplifies the assembly process and interpretation, and will allow far clearer lineage and evolutionary conclusions from the sequencing of members of different populations and species,” the authors state.
The first Anopheles gambiae genome, published in 2002, was created using BACs and Sanger sequencing. Further work over the years to order and orient contigs improved this reference and to date, AgamP4 remains the highest quality Anopheles genome among the 21 that have now been sequenced. However, AgamP4 still has 6,302 gaps of Ns in the primary chromosome scaffolds and a large bin of unplaced contigs known as the “UNKN” (unknown) chromosome.
The Sanger/PacBio single-insect assembly was able to place 667 (>90%) of the genes on the UNKN contigs into their appropriate chromosomal contexts.
The assembly’s “gap-less mega-base scale contiguity” will also provide insights into promoters, enhancers, repeat elements, large-scale structural variation relative to other species, and many other aspects relative to functional and comparative genomics questions, the authors state.
The protocol’s potential could also extend to other areas with typically low DNA input regimes, such as metagenomic community characterizations of small biofilms, DNA isolated from needle biopsy samples, and minimization of amplification cycles for targeted or single-cell sequencing applications, the authors add.
Scientists in California recently released exciting results that could offer an entirely new approach to treating the most common form of Alzheimer’s disease. The project, which was reported in a Nature publication, made extensive use of SMRT Sequencing data using targeted sequencing and some previously released full-length RNA sequencing data.
“Somatic APP gene recombination in Alzheimer’s disease and normal neurons” comes from lead author Ming-Hsiang Lee, senior author Jerold Chun, and collaborators at the Sanford Burnham Prebys Medical Discovery Institute and the University of California, San Diego. The team aimed to determine whether somatic gene recombination, which is used throughout the genome to boost molecular diversity but has never been found in the brain, could be linked to Alzheimer’s disease.
Using an impressive array of novel and cutting-edge technologies, the scientists found evidence of significant recombination in the APP gene, which encodes amyloid precursor protein in neurons and has been associated with Alzheimer’s. They focused on APP because it has previously been shown to harbor mosaic copy number variants, with higher numbers in patients with sporadic Alzheimer’s disease (SAD). They found that the APP gene harbored thousands of variant genomic cDNAs (gencDNAs) that occurred mosaically in human neurons. The gencDNAs lacked introns and ranged from full-length cDNA copies of expressed, brain-specific RNA splice variants to myriad smaller forms that contained intra-exonic junctions, insertions, deletions, and/or single nucleotide variations.
But past attempts to find gene recombination in APP had failed. “Interrogation of APP genomic loci (about 0.3 Mb) using low-depth, short-read single-cell sequencing capable of detecting CNVs produced negative results that were complicated by resolution limitations,” the authors report. “We therefore developed an alternative strategy focused on APP in small cell populations, using nine distinct methodologies.”
Among those approaches was the use of SMRT Sequencing of PCR amplicons to assess the diversity of gencDNA sequences. The authors used small neural populations from five individuals with SAD (149 reactions from 96,434 nuclei) and five healthy brain (244 reactions from 162,248 nuclei). The authors generated CCS data and used a cut-off that provided them with ultra-high accuracy reads (99.999999% accuracy), and report that these SMRT Sequencing results were “comparable in fidelity to Sanger sequencing.”
They identified 6,299 unique sequences — including 45 different intra-exonic junctions — in neural nuclei from the brains of individuals with SAD, and 1,084 unique sequences — including 20 intra-exonic junctions — in neuronal nuclei from the non-diseased brains. “Critically, both qualitative and quantitative differences in the sequences of gencDNA variants distinguished the brains of individuals with SAD from healthy brains,” the authors note. “Distinctions included gencDNAs with novel intra-exonic junctions and SNVs, which were far more prevalent in the brains of individuals with SAD.”
Because of the need for reverse transcriptase in genomic cDNAs, the scientists also speculate that existing anti-retroviral therapies used for patients with HIV might inhibit the progression of SAD. They note that HIV patients who take such therapies and are older than 65 appear less likely to develop Alzheimer’s disease. “If confirmed, this observation would suggest the immediate use of FDA-approved [combined anti-retroviral therapy]” for patients with this form of Alzheimer’s, they write.
The team concludes with the idea that the recombination findings are unlikely to be specific to the one gene they chose to study. Additional investigation should be considered for other genes active in the brain using the types of technologies that made such a difference in this project.
It took nearly 20 years until the technology was right, and five years of hard graft by more than 100 scientists from 16 institutions, but the result was worth it, according to University of Illinois plant biology professor Ray Ming.
One of several authors of a paper published and featured on the cover of Nature Genetics reporting the assembly of a 3.13 Gb reference genome of the incredibly complex autopolyploid sugarcane Saccharum spontaneum L, Ming said he dreamed about having a reference genome for sugarcane while working on sugarcane genome mapping in the late 1990s.
But sequencing technology was not ready to handle large autopolyploid genomes until 2015, when the throughput, read length, and cost of long-read SMRT Sequencing by PacBio became competitive enough, he said.
The Saccharum spontaneum AP85-441 contig-level assembly incorporated sequencing data from a mixture of sequencing technologies, including BAC pools sequenced with short reads and whole-genome shotgun SMRT Sequencing.
Additional mapping using Hi-C allowed the team to dissect genetic information from all four haplotypes that make up the sugarcane hybrid currently used in the field, which combines Saccharum officinarum (desirable for its high sugar content) and Saccharum spontaneum (desirable for its hardiness and disease resistance).
Evolutionary changes which resulted in double duplication of the hybrid genome created additional technical challenges, quadrupling the size of the genome and introducing repeat elements that were difficult to parse out among the four haplotypes.
“By combining long sequence reads and the Hi-C physical map, we assembled an autotetraploid genome into 32 chromosomes and realized our goal of allele-specific annotation among homologous chromosomes,” Ming said.
The new assembly has already provided insights into the hybrid’s evolution as well as characteristics of its parent lines, information that could help breeders to mine effective alleles of disease resistance and other desirable traits into future molecular breeding efforts.
“This reference genome offers substantial new knowledge and unprecedented genomic resources for sugarcane breeders and researchers to mine disease resistance and other alleles in rearranged chromosomes from historic hybrid cultivars, and to track them in breeding populations to shorten the 13-year breeding cycle,” the authors wrote.