This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
As we geared up for the launch of our new Sequel II System, we had the good fortune of working closely with several expert customers in an early access program. Recently, three of those customers reported on their experience with the new sequencing system in a webinar. In this blog series, we’ll be summarizing each speaker’s presentation, and the full recording is available to view.
First up was Kiran Garimella (@KiranGarimella), a senior computational scientist at the Broad Institute who focused on the use of HiFi reads, which are long (>10 kb) and accurate (>99%) sequences produced by the Sequel II System with circular consensus sequencing (CCS). Garimella and the team at the Broad Institute used the early access program to sequence trios from the Human Genome Structural Variation Consortium (HGSVC), clinical samples, and tumor/normal pairs.
Garimella reported average raw yields of 300 Gb per SMRT Cells 8M across 32 runs. Using a cloud-based pipeline he developed, the Broad Institute processed raw reads into HiFi reads and variant calls in 1-2 days. The HiFi reads, which averaged 10 subread passes, achieved quality scores from Q23 to Q25, which is comparable to the Q24 to Q25 of recent short-read data from Platinum Genomes. Garimella called the level of accuracy “remarkable” for long reads. “We’re very impressed by the PacBio Sequel II data,” Garimella added.
Garimella used the HiFi reads to look at structural variation and haplotype phasing, which has been difficult to detect with short reads. He showed an example of a heterozygous structural variant in the well characterized NA12878 that is clear in HiFi reads but difficult to detect with short reads. He also showed an example of variant calling in complex loci like the HLA genes. This is “why the Broad is so excited about long-read sequencing,” he added.
The NA12878 HiFi dataset, and others from the HGSVC, will be released publicly to help with establishing ground truth benchmarks for structural variation.
For more details, watch Garimella’s full presentation:
Single Molecule, Real-Time (SMRT) Sequencing continues to get smarter and more powerful, with the recent launch of the Sequel II system increasing capabilities and efficiencies of the long-read DNA and RNA PacBio sequencing technology even further. In a special issue devoted entirely to the technology in the MDPI open access journal Genes, guest editors Adam Ameur of Uppsala University and Matthew S. Hestand of the Cincinnati Children’s Hospital Medical Center present eight articles highlighting research conducted using SMRT Sequencing.
As this special issue demonstrates, the benefits of SMRT Sequencing to many different areas of research are becoming evident, not only in basic science, but also in more applied areas such as agricultural, environmental, and medical research. Examples from each of these areas are included in this issue.
Maximizing Minimum Sample Sizes
A new mosquito genome assembly generated via a collaboration between PacBio and the Wellcome Sanger Institute highlighted the capabilities of one of the latest SMRT Sequencing advancements: a new low DNA input protocol. The Sanger scientists used the new protocol to create a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA.
Protecting the Fungus Among Us
Scientists in China and Michigan used SMRT Sequencing to elucidate the medicinal properties of Gloeostereum incarnatum, a precious edible mushroom that is widely grown in Asia. They assembled a high-quality genome of the fungus — the first complete genome to be sequenced in the family Cyphellaceae — and identified gene clusters associated with terpenoid and polysaccharide biosynthesis.
Another team from Jilin Agricultural University and Shenyang Agricultural University in China also investigated edible mushrooms — and a mycoparasite that threatens them. They assembled a high-quality genome of Cladobotryum protrusum, which causes cobweb disease on cultivated mushrooms. They found that the C. protrusum genome, the first complete genome to be sequenced in the genus Cladobotryum, encodes a large and diverse set of genes involved in pathogen–host interactions, mycotoxins, and pigments, and harbors arrays of genes with the potential to produce bioactive secondary metabolites and stress response-related proteins that are significant for adaptation to hostile environments.
Improving Protein Production… via Insects
A new genome assembly of the cabbage looper moth, Trichoplusia ni, may have implications for large-scale genome engineering. As reported by scientists from the National Cancer Institute’s Frederick National Laboratory for Cancer Research, insect cell protein production has emerged as a viable alternative to bacterial and mammalian cells for the production of therapeutically relevant proteins, with several approved vaccines generated in baculovirus-infected insect cells. However, improved protein production using these lepidopteran hosts has been hindered by limited genomic data. By performing de novo genome assembly of the Trichoplusia ni-derived cell line Tni-FNL, the team hopes the reference will bolster future large-scale genome engineering work in recombinant protein production hosts.
Detecting Distinction in Bone Marrow Subpopulations
In a study led by Anne Deslattes Mays and Anton Wellstein from the Lombardi Comprehensive Cancer Center at Georgetown University, the transcriptomes of freshly harvested human bone marrow progenitor (lineage-negative) and differentiated (lineage-positive) cells were analyzed with SMRT full-length RNA sequencing. This Iso-Seq analysis revealed a ~5-fold higher number of transcript isoforms than previously detected and showed a distinct composition of individual transcript isoforms characteristic for bone marrow subpopulations. Check out an additional Q&A with Mays here.
Expanding Genetic Diversity in Human Dataset
Swedish scientists used SMRT Sequencing to expand the diversity of the human genome dataset. They performed de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual, and around 6 Mb of novel sequences (NS) shared with a Chinese personal genome. “Our results highlight that the GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data,” the authors wrote.
Solving Methylation Calling Challenges
Lastly, a team of bioinformaticians from Iowa, Ohio, and Tokyo presented a statistical solution for observing personal diploid methylomes and transcriptomes. As they report, CpG methylome pairs of homologous chromosomes that are distinguishable with respect to phased heterozygous variants (PHVs) is challenging due to scarcity of PHVs in personal genomes. While SMRT Sequencing is a promising avenue to addressing this challenge as it outputs long reads with CpG methylation information, phasing the CpG sites can still come with errors. Their paper proposes a model that reduces the error rate to 1%, thereby calling CpG hypomethylation in each haplotype with >90% precision and sensitivity.
A preprint released this week from the Genome in a Bottle (GIAB) Consortium describes a benchmark set of structural variants (SVs), differences ≥50 bp, in the genome of a human male named HG002. The GIAB benchmark is the first to allow measuring precision (false positives) and recall (false negatives) of different approaches to detecting structural variants. The GIAB Consortium also developed a tool, Truvari, to support evaluation of variant call sets against the benchmark.
Earlier GIAB benchmarks, first released in 2014 and last updated in 2017, have led to enormous improvements in the quality and consistency of calling single-nucleotide and small indel variants. However, due to prior limits of DNA-sequencing technologies, the benchmark has not included SVs, which are fewer in number than small variants but in total cover more base pairs.
To extend the benchmark to SVs, the GIAB Consortium sequenced HG002 and his parents with short-, linked-, and long-read technologies; analyzed the reads with 26 different software variant callers; and integrated the different methods into a final set of 12,745 high-confidence SVs across 2.69 Gb of well-characterized “Tier 1” regions in the 3 Gb human genome. The high-confidence SVs match the expected size distribution for a human genome, with the number of variants decreasing with variant size except at the size of ALU (300 bp) and LINE1 (6 kb) repeats. The high-confidence SVs also show nearly perfect Mendelian consistency, with the genotype in HG002 being consistent with inheritance from his parents.
PacBio long reads, which provide high precision and recall for structural variants, were particularly important to the benchmark. GIAB required support from PacBio long reads for all of the high-confidence variants. Further, GIAB reports “many SVs only detectable with long reads [especially in tandem repeats]” and concludes “[t]hese results confirm the importance of long read data for comprehensive SV detection.”
If you would like to use the benchmark to evaluate how well you detect SVs, GIAB provides DNA reference material and datasets, including 32-fold coverage of accurate long reads from the PacBio Sequel II System. We also offer a tutorial on how to use the GIAB datasets and the SV benchmark to evaluate precision and recall.
In the future, the GIAB Consortium plans to extend the SV benchmark to the other genomes in its portfolio, namely HG001/NA12878 and HG005. They also plan to incorporate new data, such as highly-accurate long HiFi reads from PacBio, to improve the quality and scope of the benchmark.
Two years ago, Carola Greve and colleagues at the Zoological Research Museum Alexander Koenig in Bonn, Germany, were seeking to #SeqtheSlug as part of the 2017 Plant and Animal SMRT Grant competition, and the popular project was a close runner-up. Greve didn’t give up on her quest to sequence the ‘solar-powered’ sea slug. We caught up with her recently at the SMRT Leiden Scientific Symposium, where her update on the sea slug project earned her a Best Poster award.
Why the sea slug?
Although Mollusca represents the second largest animal phylum with around 85,000 extant species, only 23 mollusc genomes are publicly available on NCBI genome database, and when we started our project, no reference genome had been generated for any sacoglossan (algae-ingesting) mollusc. Some of these sacoglossa species are particularly interesting because of their ability to sequester chloroplasts from its food algae. These ‘stolen’ plastids, also known as kleptoplasts, are then stored in a functional state in the digestive gland cells of the slugs and presumably allow them to endure weeks or even months of starvation.
How do the slugs keep the chloroplasts active which continuously produce starch inside the slugs? No one knows! But up to now, this spectacular phenomenon, termed functional kleptoplasty, is unique among animals. We wanted to scour the genome for genes associated with this unique ability and further provide a valuable genomic resource for future genome-wide comparative analyses to organisms with similar lifestyles, i.e. those stealing useful parts out of their prey and incorporating, instead of digesting them.
How has the project progressed?
We created a mitochondrial genome for Plakobranchus cf. ocellatus (van Hasselt 1824), a sea slug species found in the Philippines, using Illumina short reads and another team (Huimin Cai, et al. 2019) beat us to a draft genome assembly earlier this year, of the Elysia chlorotica species, using a hybrid Illumina/PacBio approach. Their genome assembly comprised 9,989 scaffolds, with a total length of 557 Mb, and their annotation identified 176 Mb (32.6%) of repetitive sequences with 24,980 protein-coding genes.
We’d like to improve the genome using PacBio, and we are now working with Kornelia Neveling a molecular geneticist at the Genome Diagnostics Nijmegen, Radboud University Medical Center to get some good reads from our P. ocellatus samples. Once we get the results, we want to compare the genomes of the two slug species, as well as examine the secondary metabolites that they produce, such as polypropionates, which might be interesting for pharmaceutical applications.
What have you learned so far?
We’ve learned that it’s really difficult to extract high molecular weight DNA from these slimy creatures, and there seems to be something in the process that is inhibiting the enzymes involved in sequencing. The challenges have helped me in my new role, however. I am now at the Senckenberg Research Institute and Natural History Museum in Frankfurt, where I am leading the Translational Biodiversity Genomics lab. One of our main purposes is to extend biodiversity research to non-model organisms, and make it more accessible for basic and applied research. As part of this, we are establishing DNA extraction protocols for species that are hard to work with, such as worms, insects, slugs, and historical museum samples.
Have an interesting project you’d like to try using SMRT Sequencing? For a look at our ongoing and future SMRT Grant programs, check out www.pacb.com/smrtgrant
Malaria is a complicated killer, and efforts to develop effective vaccines have been hindered by gaps in our understanding of both the parasite that causes the infection, Plasmodium falciparum, and its transmitter, the mosquito.
Like many virulent parasites, P. falciparum has evaded close genetic scrutiny due to its complex and changing composition. Its 23 Mb haploid genome is extremely AT rich (~80%) and contains stretches of highly repetitive sequences, especially in telomeric and subtelomeric regions. To make matters more complicated, it expands its genetic diversity during mitosis via homologous recombination, leading to the acquisition of new variants of virulence-associated surface adhesion molecules.
Attempts to decode the P. falciparum genome with short reads have resulted in extremely fragmented assemblies of more than 20,000 contigs each, with N50 contig sizes less than 2Kb in length — barely long enough to contain a single intact gene, let alone to allow resolution of very homologous gene families that are often chained end-to-end across large regions. Early shotgun sequencing missed polymorphisms such as insertions and deletions, copy number variants, chromosomal rearrangements and structural variants in P. falciparum’s hypervariable and highly repetitive regions.
In 2016, PacBio collaborated with scientists from Institut Pasteur and Cold Spring Harbor to create a complete telomere-to-telomere de novo assembly in which all 14 chromosomes were resolved into single contigs. Even extremely AT-rich regions were resolved with uniform coverage and subtelomeric regions of all chromosomes were successfully assembled in a single run for the first time.
As PacBio microbiology expert Meredith Ashby explained in a recent Labroots webinar, the assembly was “game changing.”
The most challenging parts of a genome are often the most important to decode, she explained. Regions of high homology facilitate recombination events, a key mechanism for rapid genome evolution, for example. Or, in other words, “where most of the exciting things are happening” in terms of immune invasion and drug resistance, for example.
Not only has the new reference genome facilitated better analysis of these areas, as well as structural variants and large-scale changes in the genome, it has also enabled better SNP calling, Ashby said. This is important because some traits, including drug resistance, may be SNP driven.
By mapping short reads to single-molecule sequenced reference genomes, you can more confidently tell the difference between genes and pseudogenes , or between genes and new duplications, she said. And many clinical isolates are sequenced using short-read technology.
Since the new reference genome was published, an additional 15 plasmodium isolates have been assembled to near completeness. There have already been several publications about new discoveries enabled by these new assemblies, from asexual replication to locally divergent selection.
Host with the most
To fully understand malaria, however, we must also understand its host. The genome of malaria vector Anopheles coluzzii was recently assembled from a single individual using our new low input protocol.
PacBio technology has also enabled the assembly of a much improved genome of Aedes aegypti, which transmits yellow fever, zika, dengue and chikungunya.
Like Plasmodium, the A. aegypti genome is highly repetitive, and early sequencing attempts resulted in assemblies with 37,000 contigs. SMRT Sequencing reduced this to around 2,500 contigs, and increased their N50 sizes from 84 K to 11.8 Mb – much more contiguous.
The AaegL5 assembly revealed an enormous number of new genes, including a far more comprehensive catalogue of odorant, gustatory and ionotropic receptors, which could provide important information for pest control strategies based on feeding and mating. The Rockefeller University researchers also identified hotspots that were under selective pressure for insecticide resistance.
Also of interest to infectious disease researchers: findings involving serine proteases, which mediate immune responses, and metalloproteases, which are linked to mosquito–Plasmodium interactions. Half of the 404 serine and metalloproteases gene models were improved in the AaegL5 assembly, and 49 novel proteases were discovered, Ashby said. Other vector competence hotspots were also identified, such as QTLs on chromosome 2 that were linked to systemic dengue virus dissemination in midgut-infected mosquitoes.
“Malaria, yellow fever, zika, dengue and chikungunya cause millions of deaths worldwide every year,” Ashby said. “Hopefully these new references will yield new insight into all kinds of things that are important to reduce the global burden of infectious diseases.”
To learn more about the application of SMRT Sequencing technology in infectious disease research, watch the full Labroots presentation, or visit the PacBio team at American Society for Microbiology (ASM) Microbe 2019 at booth 1160.
A recent review article published in Frontiers in Genetics offers a great look at the landscape of long-read sequencing. Authors Tuomo Mantere, Simone Kersten, and Alexander Hoischen from Radboud University Medical Center in the Netherlands focus on emerging applications in medical genetics for long-read technologies.
“With the recently demonstrated success in identifying previously intractable DNA sequences and closing gaps in the human genome assemblies, long-read sequencing (LRS) technologies hold the promise to overcome specific limitations of NGS-based investigations of human diseases,” the scientists write. “LRS has the potential to grow into a technology that is used not only to produce high-quality genome assemblies (i.e., the platinum human reference genome), but also to capture clinically relevant genomic elements which are problematic for conventional approaches.”
The team cites four particular use cases for which long-read sequencing is particularly well suited, diving into detail about each:
- Discovering disease-causing structural variants missed by short-read technologies
- Direct and accurate sequencing of repeat expansions or GC-rich regions
- Enhanced phasing of variants to determine haplotypes and parent of origin
- Distinguishing between genes and pseudogenes
The review highlights peer-reviewed research publications describing novel discovery across each application in a wide variety of disease areas including: X-Linked Parkinsonism, ALS and FTD, SCA10 and Parkinson’s disease, Myotonic dystrophy, Bardet-Biedl syndrome, and Fragile X disorders.
The authors also examine transcriptome analysis, indicating that long-reads have been successfully employed to sequence full-length mRNA transcript isoforms as a complement to existing short-read RNA sequencing approaches. “Ultimately, this knowledge can be implemented to improve WES/WGS-based variant filtering, prioritization and prediction of their functional impact,” Mantere et al. report.
The authors conclude with a prediction that, in the future, “the broader use of [targeted long-read sequencing] could significantly increase the diagnostic yield of genetic testing and discover novel disease genes.”
When looking to understand the functional implications of genetic variability, scientists should seek out the Iso-Seq method, according to Cold Spring Harbor researchers.
In a recent paper published in Frontiers in Genetics, Doreen Ware, Bo Wang, and colleagues reviewed the state of transcript sequencing and analysis technologies, and concluded that single-molecule sequencing from PacBio provided several advantages over other methods.
A major challenge in molecular biology continues to be the complex mapping of the same genome to diverse phenotypes in different tissue types, development stages and environmental conditions, the paper states.
“A better understanding of the transcripts and expression of gene regulation is not only non-trivial but lies at the heart of this challenge,” the authors write.
RNA sequencing can support both the discovery and quantification of transcripts using a single high-throughput sequencing assay. But methods that rely on short reads have several limitations in revealing gene regulation, the protein-coding potential of the genome and ultimately the phenotypic diversity.
Long-read SMRT Sequencing for RNA characterization has the advantage of rendering, in vitro and without ambiguity, a full-length transcript sequence without depending on the error-prone computational step of assembly. As a result, they allow a more precise detection of alternative splicing events and eventually novel isoforms, making it easier to build gene models for species which are poorly studied or have an incomplete or missing reference genome, the authors state.
“With the development of single-molecule sequencing technology, ‘one read is one transcript’ is not a dream anymore, and scientists can get the intact sequence of each isoform by sequencing a single cDNA molecule,” the authors write.
The Iso-Seq approach offers particular advantages in the characterization of polyploid transcriptomes, which have a large number of repeats and homeolog genes, and in the profiling of allele-specific expression, Ware and Wang state.
They also detail experimental and informatic pipelines and highlight several downstream applications of the Iso-Seq method, including:
- alternative splicing
- alternative polyadenylation (APA)
- fusion transcripts
- long non-coding RNAs (lncRNAs)
- isoform phasing, and
- genome annotation
Regarding the last item, the team state that the Iso-Seq method can increase the accuracy of automated genome annotation by improving genome mapping of sequencing data, correctly identifying intron-exon boundaries, directly identifying alternatively spliced transcripts, identifying transcription start and end sites, and providing precise strand orientation to single exons genes. Mapped against a reference genome, the full-length transcripts that are uncovered can be used to improve or add de novo structural and functional annotation to a genome, improve genome assembly and existing gene models, they state.
“Iso-Seq is known to retrieve longer isoforms as well as more number of isoforms… This has revolutionized our understanding of the biology of a number of organisms, including plants and animals, since transcript diversity usually represents functional diversity,” the authors write.
Iso-Seq analysis has also benefited evolutionary studies, as it allows scientists to compare the splicing variants between species and better understand the conservation of genes/isoforms, the divergence of splicing patterns, and the significance of their expression levels.
The next challenge? What to do with all the new isoforms identified from the Iso-Seq method.
The growing number of isoforms identified from different tissues/conditions within an organism will need to be ranked and prioritized for community research. And not all of them will have a meaningful impact on the cellular biological processes of the cell, Ware and Wang note, so the results will have to be carefully validated and characterized.
“Experimental approaches such as CRISPR could help by targeting the role of each isoform, and see if there are redundant or complementary functions among these different splicing isoforms,” they conclude.
Hundreds of SMRT scientists came together recently in Leiden to learn about the latest updates to PacBio technology and to showcase their data analysis tools. Extremely useful information was shared, and future collaborations were sparked. For those who weren’t able to jet to the Netherlands to attend, we’ve rounded up the top tools and tips presented at the European SMRT Informatics Developers Meeting. For an in-depth report on the event, check out this blog post by PacBio Principal Scientist Elizabeth Tseng.
- SMRT Link – Of course our own open-source SMRT analysis software suite will be top of the list. Updates to the system have resulted in many improvements, including 8x faster time-to-results for CCS generation and 20x faster mapping with minimap2 using our own wrapper pbmm2; important improvements to CCS to support PacBio’s HiFi data type; detection of more types of structural variants; increased automation; and PDF reports.
- Bioconda – Want to be the first to try out new and improved analysis tools? Many updates to PacBio algorithms, assembly packages, and other tools are available on Bioconda before their official release, including the latest Sequel II System changes.
- pbsv – Our structural variant (SV) calling and analysis tool has also been updated. What’s new? An increase in sensitivity for large insertions and deletions, and calling of duplications and copy number variation… meaning that pbsv now calls all major SV types 20 bp and longer.
- DAZZLER Suite – Need to find all significant local alignments between reads? Or to remove chimeras, adaptamers, and low-quality dropouts? Da’ Gene Myers (@TheGeneMyers) has an app for that. Or several, actually, including DALIGNER, DASCRUBBER and DAMASKER. Myers announced he has updated the suite to better support highly accurate, long HiFi reads.
- PRINCESS – Prolific toolmaker Fritz Sedlazeck (@sedlazeck), creator of the SV caller Sniffles, unveiled his work-in-progress, PRINCESS, a Snakemake pipeline to call and phase SNPs and SVs. Keep an eye on his Github site to snag it when it drops.
- TAMA – The all-in-one Transcriptome Annotation by Modular Algorithms tool by Iso-Seq expert Richard Kuo (@GenomeRik) can do many things, including: mapping RNA reads to transcript annotation, merging annotations (can combine PacBio with references like ENSEMBL), identifying coding regions and associating them with known genes.
- SQANTI – This quality control pipeline by Ana Conesa (@anaconesa) can categorize Iso-Seq data against a reference annotation. It allows users to see which genes/transcripts are novel/known and offers detailed annotations on canonical/non-canonical junctions. A modified version of SQANTI is SQANTI2 by Elizabeth Tseng (@magdoll).
- TAPPAS – A Java-based application, also by Ana Conesa, that creates beautiful visualizations utilizing information at both the transcript and protein level. It can identify differential expression at both the isoform level and the gene level.
- pyPaSWAS – Program for DNA/RNA/protein sequence alignment, read mapping and trimming, by Sven Warris (@swarris).
- WhatsHap – Software from Tobias Marschall (@tobiasmarschal) for read-backed phasing of variants. Jana Ebler discussed an extension to WhatsHap to simultaneously call and phase variants in long reads.
Variant callers are not all the same – in fact, there are times when their algorithms don’t agree. So, what do you do? Ryan E. Mills (@ryan_e_mills), an assistant professor at the University of Michigan, laid out the problem — and two of his solutions — in a presentation at the Labroots Genetics and Genomics conference:
- VaPoR – A structural variant validator that uses a dotplot of PacBio reads against the reference genome to visualize and automatically score candidates for patterns that suggest deletions, insertions, tandem duplication or inversions.
- PALMER – The Pre-mAsking Long reads for Mobile Element InseRtion tool detects non-reference MEI events (LINE, Alu and SVA) and other insertions, by using the indexed reference-aligned BAM files from long-read technology as inputs. It uses the track from RepeatMasker to mask the portions of reads that aligned to these repeats, defines the significant characteristics of MEIs (TSD motifs, 5′ inverted sequence, 3′ transduction sequence, polyA-tail), and reports sequences for each insertion event.
What can a cute, cuddly, stingless bee from the Brazilian rainforest teach us about eusociality and mitochondrial evolution?
Natalia S Araujo wants to find out, and she’s not the only one. As the only bee species in which true polygyny (multiple fertile queens in the same colony) occurs, there is great interest in Melipona bicolor, and its mitochondrial genome (mt genome) was one of the first sequenced in bees. But the sequence was incomplete and lacked information about its mitochondrial gene expression pattern.
So Araujo, a postdoctoral researcher of animal genomics in the GIGA Institute of the University of Liège, Belgium, and her collaborator, Maria Cristina Arias from the University of Sao Paulo, Brazil, combined long and short reads of M. bicolor DNA with RNASeq data to characterize its species control region and transcription patterns.
Araujo reported the results at the fourth annual SMRT Leiden Scientific Symposium, held May 7-8 at Leiden University Medical Center in the Netherlands, the first day of which was abuzz with research that highlighted the importance of biodiversity and conservation genomics.
They created a 15,001bp mt genome, including a control region of 255 bp, with the highest AT content reported so far for bees (98.1%).
“Interestingly, conserved structures were identified for the first time in the control region of all eusocial corbiculate bees sequenced so far,” Araujo said. “M. bicolor has one of the most compact and functional mtDNA reported in bees. Results reveal unique and shared features of the mitochondrial genome in terms of sequence evolution and gene expression making M. bicolor an interesting model to study mitochondrial genomic evolution.”
Illuminating Life with Better Sequencing
SMRT Leiden Keynote speakers Paul Hebert of the University of Guelph, Canada, and Mara Lawniczak of the Wellcome Sanger Institute, UK, bookended the day with inspiring talks about large-scale species sequencing projects.
Hebert introduced BIOSCAN, an initiative by the International Barcode of Life (iBOL) project to use new technology to identify, register, and monitor life. Launched in 2007 out of Leiden, the iBol consortium, had the original mission to develop and deploy a DNA-based “barcode” identification system for animal, fungi, and plants. It successfully delivered the DNA barcode for 500,000 species in 2015, and developed an automated species recognition & discovery system called Barcode of Life Data System (BOLD), which will now help them focus on scanning for the unknown.
Originally based on Sanger sequencing, iBOL moved to using the PacBio Sequel System for species discovery after finding that it was more cost-effective and accurate at delivering species-level information. They now use it to barcode 5 million species per year and run species discovery for 500,000 species per year.
“There’s a real urgency in this that I haven’t encountered in other areas of science, in that the subjects of our research are disappearing. Don’t let the dodo fool you – most animals that go extinct don’t leave their bones behind on the forest floor. They leave a smudge,” Hebert said. “Every loss of species is a valuable loss of genomic data.”
They will also be using the Sequel System to elucidate species interaction. Environmental samples are messy, Hebert told the SMRT Leiden crowd. Some of the insect samples his team analyzed had DNA mixtures from animals they fed on (elephants, kangaroo, even nematodes). So, to characterize comprehensively, they extract the DNA of whole species, amplify with primers from different taxa, sequence deeply (>3,000-fold), and create species symbiomes.
Lawniczak discussed the Earth BioGenome Project, a moonshot for biology aiming to sequence, catalog, and characterize the genomes of all of Earth’s eukaryotic species, and Sanger’s own Darwin Tree of Life Project.
She also went into detail about efforts to better understand disease vector species through genetic sequencing, including the Anopheles gambiae 1000 Genome Project and the Sanger project to sequence the A. coluzzi genome from a single mosquito using PacBio’s new low DNA input workflow.
“How can we outpace the evolution of resistance?” she asked the crowd. The answer: high-quality reference genomes and de novo insect assemblies.
Other speakers described how long-read technology is revealing new insights into the diverse species that are relevant to our nutrition (bitter gourd by Henri Van de Geest of Genetwister Technologies, Netherlands, and cassava by Herve Vanderschuren of the University of Leige, Belgium), medicine (cannabis by Kevin McKernan of Medicinal Genomics), and evolution (cave fish by former SMRT Grant winner Fritz Sedlazeck of the Baylor College of Medicine Human Genome Sequencing Center).
Presentations by Kateryna Makova of Penn State, Oliver Duss of the Scripps Research Institute, and Iain Macaulay of the UK’s Earlham Institute, also revealed how SMRT Sequencing can be developed to query chemical processes (transcription) and reveal new information (single cell).
For an in-depth report on each presentation from Day One, check out this blog post by PacBio Principal Scientist Elizabeth Tseng.
Evolving Towards Precision Medicine
If there’s one area where accuracy is essential, it’s medicine, as highlighted by Euan Ashley as he opened Day Two of SMRT Leiden.
“If you miss one important coding variant in one gene, it could be the difference between life and death,” the cardiologist said.
A bit of a DNA detective, Ashley described his work at Stanford and as part of the Undiagnosed Diseases Network using long-read sequencing to help solve medical mysteries. Within a 20-month period, they delivered diagnoses for 132 out of 382 cases, a 35% solve rate, and a surprising 79% of the solved cases had actionability.
“This is truly transformational technology,” Ashley said. “There has been a lot of hype around genomics, but this is actual delivery here.”
What are we missing in the other 65% of cases? Repeat sequences, paralogous genes, mosaicism, non-disruptive variants — complexities of the genome that require long, comprehensive reads, better algorithms, and, ideally, graph reference genomes, he said.
Evan Eichler of the University of Washington, an early adopter of PacBio technology, also discussed the challenges of characterizing structural variation (SV) in the human genome, and some of his solutions.
His lab’s first study using PacBio for SV calling was eye-opening because it found ~22,000 novel genetic variants corresponding to 11 Mb of sequence. They had not expected to see so much novelty.
However, even with the long reads, segmental duplications (SD) remain challenging. These SD regions encompass nearly 500 genes and are the most copy number polymorphic regions , so they are important to catch, yet Eichler found that 75% of SDs on a human genome he analyzed were not assembled. So he used a method called Segmental Duplication Assembly (SDA), which maps reads to assembled contigs to identify variants which can then be used to separate reads into paralogous sequences.
Other speakers showcased the use of long reads to resolve complex regions in the human genome that have both important evolutionary and clinical significance. Kornelia Leveling of Radboud University Medical Centre, discussed long-read amplicon sequencing and its advantages in clinical applications (less PCR, improved breakpoint detection and haplotyping, and the ability to separate paralogous genes) and Melissa Laird Smith of the Icahn Institute at Mt. Sinai, spoke about how she is using PacBio sequencing to characterize the full diversity of the immunoglobin heavy chain locus.
The SMRT Leiden audience also heard how accurate, long reads are being adopted for testing medically important genes that can have therapeutic actionability. Ming-Hsiang Lee of the Sanford Burnham Prebys Medical Discovery Institute presented work on somatic mosaicism in the APP gene in relation to sporadic Alzheimer’s disease. Janet Song of Stanford explained how she used long-read sequencing to study tandem repeats in the CACNA1C intronic region to better understand its association with bipolar disorder and schizophrenia.
Alexander Mellmann of University Hospital Munster, spoke about tracking bacterial infections in hospitals by combining whole genome sequencing with genome-wide gene-by-gene typing (cgMLST). The pipeline has enabled the hospital to get better information about transmission rates and to evaluate its hygiene control methods.
And Yahya Anvar of the Leiden University Medical Center reported the use of PacBio HiFi reads for resolving complex pharmacogenes. Differential drug response is common (in 50% of the population) and dangerous (fourth leading cause of death). Previously, many of the PGx genes could not be characterized by short-read sequencing due to the genes’ highly repetitive nature, but Anvar showed that 77% of the important PGx genes could be fully phased, while another 19% could be partially resolved, using HiFi reads at 30-fold coverage.
For an in-depth report on each presentation from Day Two, check out this blog post by PacBio Principal Scientist Elizabeth Tseng.
Want a visual summary of the two-day conference? Evolutionary biologist and scientific illustrator Alex Cagan of the Sanger Institute was on site sketching each presentation in real-time – how appropriate. You can check out the resulting work here.
They are the unwelcome comeback kids: Measles, mumps and other old-time diseases that were once nearly extinct are on the rise in suburban communities as well as developing nations.
In order to better understand the evolution of these microbial menaces, researchers at the Wellcome Sanger Institute and Public Health England have been sequencing historical samples deposited in the UK’s National Collection of Type Cultures (NCTC).
The latest is a strain of cholera-causing bacteria (Vibrio cholerae) extracted in 1916 from the stool of a British soldier who was convalescing in Egypt. Researchers at the Sanger Institute revived the WWI soldier’s bacteria – thought to be the oldest publicly-available V. cholerae sample – and sequenced its entire genome using SMRT Sequencing. They also conducted pan-genome and phylogenetic analyses, which included 197 V. cholerae genome sequences and sequences from a handful of related Vibrio species
There were a few surprises. First, the strain (dubbed NCTC 30) was found to be non-toxic. This may explain why so few soldiers in the British Expeditionary Forces became ill with debilitating diarrhoea between 1914 and 1918, despite being in the midst of the world’s sixth global cholera pandemic.
Secondly, NCTC 30 was found to harbour a functional b-lactamase antibiotic resistance gene — even though it was collected before the introduction of penicillin-based antibiotics.
“This adds to increasing evidence that genes for antibiotic resistance in bacteria existed before the introduction of antibiotic treatments, possibly because the bacteria needed them to protect against naturally-occurring antibiotics,” said first author Nick Thomson of Wellcome Sanger Institute.
Although NCTC 30 differs from the current ‘El Tor’ biotype that still inflicts millions worldwide in a pandemic that began in 1961, Thomson said it is important to study strains from different points in time in order to better understand the evolution of the disease.
“Even though this isolate did not cause an outbreak it is important to study those that do not cause disease as well as those that do,” Thomson said. “These findings illustrate the rich history, as well as biological insights, that can be garnered from the study of bacterial pathogens.”
Additional research from the NCTC will be presented at the annual meeting of the American Society for Microbiology (ASM 2019, June 20-24). Additional research from the NCTC will be presented at the annual meeting of the American Society for Microbiology (ASM 2019, June 20-24). Be sure to stop by the PacBio booth (#1160) as well, to learn about microbial whole genome sequencing and high definition metagenomics with HiFi reads on the new Sequel II System.
Stop, Scrape, Squash… and Sequence!
The latest invasive insect to hit headlines, the spotted lanternfly, has a voracious, indiscriminate appetite, with a particular taste for apples, grapes and maple — bad news for the wine, orchard and syrup industries of New England, where the Asian pest has been spotted.
But there’s good news too, thanks to the expanded capacity of the new Sequel II System. USDA scientists were able to generate a high-quality, 2.3 Gb de novo assembly of a field-caught female Lycorma delicatula on a single SMRT Cell 8M.
Not only will the new genomic resource provide valuable insights into the little-known biology of the agricultural pest, which nearly wiped out entire vineyards in previous infestations, but such efficient assembly could enable the comprehensive genomic monitoring of the species throughout the field season, as well as rapid testing of pest intervention strategies.
This is important because the pest has the potential to cause devastation to several agricultural industries and communities. Shortly after the spotted lanternfly was detected in the U.S. in 2014 (believed to have hitched a ride on a shipment of stones), the Pennsylvania Department of Agriculture established a quarantine zone surrounding the site of first detection. Since that time, the L. delicatula quarantine zone has expanded from an area of 50 mi2 to over 9,400 mi2.
Despite this, essentially nothing is known at the genomic level about L. delicatula or any Fulgorid (planthopper) species.
As reported in a new bioRxiv preprint, a team of scientists from several USDA-ARS research laboratories, led by Scott M. Geib of the Daniel K Inouye U.S. Pacific Basin Agricultural Research Center in Hawaii and in collaboration with scientists at PacBio, sequenced and assembled the high-quality reference genome for the L. delicatula from a single library run on one PacBio SMRT Cell 8M. Previous planthopper genome projects required at least 100-5,000 inbred individuals and at least 16 different sequencing libraries, and despite the fact that the new USDA assembly is 2-4 times larger, it is 13-63 times more contiguous compared to those previous efforts.
“The methods described here for DNA extraction, library preparation and sequencing are straightforward and rapid, using established kits and leveraging the higher throughput of the Sequel II System to generate sufficient sequencing coverage with just one SMRT Cell and 30 hours of sequencing run time,” the authors write.
The project also shows how such studies can now be achievable by individual labs rather than requiring large consortia that were typical of previous genome assembly efforts, the authors state.
“The ability to generate genomes de novo from field-collected arthropods makes high quality genomes accessible for many more species,” they added.
Rapidly generating foundational data for invasive species is one big advantage of performing single-insect genome assemblies, but there are others as well, the authors note.
Direct sequencing of wild specimens dispenses with the requirement of inbred lab colonies, which may take months or even years to establish, can be expensive to maintain, and are impractical or impossible for many species.
By sampling field-collected animals, genetic variation can also be more accurately characterized for local populations, without the risk of adaptation to lab culture or loss of heterozygosity.
And it is easier to assemble than a sample of many pooled individuals, each of which may contribute up to two unique haplotypes.
Although the scientists did their best to avoid sample contamination by bacterial symbionts known to reside inside the lanternfly, there were signs that the DNA of two endosymbionts were mixed into the insect DNA they extracted.
Rather than dismiss these fragments, the researchers decided to make the most of the finds, and identified two endosymbiont genomes within the assembly. This was relatively easy to do, thanks to the long reads and robust assembly strategy that allowed for clear discrimination of the microbial symbionts — which were each complete and in single contigs — from the host.
The complete genomes of the endosymbionts, Sulcia muelleri and Vidania fulgoroideae, may provide additional information that could prove useful for pest control, the authors state.
“Because obligate symbionts in phloem-feeding insects typically provide nutritional benefit to their hosts, the symbiont genomes offer insight into nutritional requirements and basic metabolic functioning of L. delicatula,” they write.
The bacteria could also present opportunities for pest control.
In spotted lanternflies, the endosymbionts are transferred from female to offspring through the ovaries, and there may be a time window during late summer feeding when this vital transfer could be disrupted. Or the bacteria could provide highly species-specific genomic targets for control with RNA inhibition (RNAi) strategies, the authors note.
By Ángel Vergara Cruces, Universidad de Málaga
Plant geneticists have achieved a sweet feat: the first assembly of the octoploid strawberry genome.
As reported in Nature Genetics earlier this year, a team led by Steven J. Knapp of the University of California-Davis and Patrick P. Edger of Michigan State University, identified more than 100,000 genes in their high-quality assembly and annotation of the commercial strawberry, Fragaria x ananassa.
The main challenge when assembling a polyploid genome is that similar regions in different subgenomes (so-called homeologous regions) can lead to uncertainty about where to assign a given read when using methods that produce short reads. To overcome this challenge, the authors combined short-read and long-read genome sequencing technologies.
They initially assembled the genome using Illumina and 10X Genomics data. Next, they used Hi-C to obtain bigger scaffolds and inform of sequences in the genome that are physically close. As these interactions occur more frequently between contiguous sequences, Hi-C can help to correct misjoined reads and reduce the number of scaffolds.
Finally, PacBio long-read sequencing was performed to clarify difficult sections to map, such as repetitive stretches of the genome. In this way, PacBio SMRT Sequencing helped to fill in the gaps between the scaffolds and bring the assembly to 28 pseudomolecules, encompassing 99% of the estimated size of the genome.
With the genome in hand, the authors were able to unravel some long unresolved question in the field about which exactly were the diploid parentals that contributed the four subgenomes present in F. x ananassa.
The Fragaria genus has diploid, tetraploid, hexaploid and octoploid species. F. vesca and F. innumae are two wild diploid species that were previously identified as parentals. The new genome, combined with an extensive transcriptomic analysis of every diploid species, allowed the additional identification of the previously unknown F. niponica and F. viridis as ancestral parentals.
The distribution and natural history of these species allowed the researchers to hypothesize that an hexaploid species from Asia, resulting from hybridization between F. innumae, F. niponica and F. viridis, hybridized with F. vesca in North America, giving rise to the octoploid strawberry and a single dominant subgenome, which largely controls its metabolomic and disease-resistance traits.
Events of polyploidy and genome rearrangements are of paramount importance in plant evolution and domestication. They are usually followed by diploidization, and lead to neo- and sub-functionalization of duplicated genes and changes in the regulation of gene expression. This new polyploid genome promises to be an amazing step towards a better understanding of these phenomena, and will no doubt provide a powerful resource for breeders.
Another team of scientists at the Jiangsu Academy of Agricultural Sciences in China recently reported their use of PacBio Iso-Seq analysis to establish a high-quality reference transcriptome for cultivated strawberry (Fragaria x ananassa), providing insight into many metabolic and signaling pathways. To learn more about their findings, read their publication in Horticulture Research.
This month’s meeting of the American Association for Cancer Research (AACR) in Atlanta was a great showcase of the latest academic and translational research in this field. Years ago, the idea of analyzing cancer genomes played a niche role at the conference. Now, genomic and transcriptomic assessments are widely accepted as pivotal ways to understand cancer. This may have been best embodied by the organization’s choice for a new president: genome assembly pioneer Elaine Mardis.
As usual, the PacBio team was out in force for the AACR conference. We attended talks, presented posters, and greeted fellow attendees at our booth. One of our favorite moments occurred during a presentation from AACR Program Committee Chair John Carpten, director of the Institute of Translational Genomics at the University of Southern California, when he called for combining long-read technologies like ours with short-read and other tools to generate new insights into structural variants and specific gene isoforms in cancer.
Indeed, several posters and talks focused on exactly these applications. First, Remond Fijneman from the Netherlands Cancer Institute presented a poster, “Characterization of structural variants within MACROD2 in the pathogenesis of colorectal cancer,” which detailed that MACROD2 is affected by structural variants in >40% of colorectal cancers and >70% of metastases, but only very rarely in colorectal adenomas. Fijneman showed preliminary data indicating that target capture sequencing with PacBio has promise for providing more detailed information on the nature and variability of these deletions.
PacBio scientist Aaron Wenger shared a preliminary analysis of a collaboration with researchers at Radbound University Medical Center and University Medical Center Utrecht aimed at developing a robust truth set of structural variant calls in a cancer cell line. The poster, “Structural variant detection with long read sequencing reveals driver and passenger mutations in a melanoma cell line,” details the range of variants that can be identified with the PacBio structural variant calling pipeline, and reviews in silico experiments to estimate appropriate coverage levels for structural variant discovery in cancer samples.
There were also several presentations on the Iso-Seq method. PacBio’s Brendan Galvin presented “Full-length transcriptome sequencing of melanoma cell line complements long-read assessment of genomic rearrangements,” detailing recent improvements in the Iso-Seq workflow to reduce sample input requirements and sample prep time, and speed up Iso-Seq analysis to work with larger volumes of data. Finally, 2017 SMRT Grant winner Andrew Ludlow gave a talk on his work using the Iso-Seq method to understand the role of NOVA1 in telomere biology during cancer development. His talk is available as part of the AACR 2019 Annual Meeting Webcast.
Finally, Neil Vasan from Memorial Sloan Kettering Cancer Center presented a compelling poster on his work using PacBio amplicon sequencing to understand differences in treatment response, “Double PIK3CA mutations in cis enhance oncogene activation and sensitivity to PI3K alpha inhibitors in breast cancer.” By retrospectively sequencing full-length PIK3CA in patients receiving PIK3CA inhibitor therapy, Vasan found that 10% to 15% of patents have double mutations, and that these are predominantly present in cis. Importantly, compound mutations are predictive of increased sensitivity to inhibitor treatment.
If you attended AACR, we hope you had an informative and fun time too!
We’re thrilled to announce the launch of the Sequel II System, reducing project costs and timelines with approximately eight times the data output compared to the previous Sequel System. It enables customers to comprehensively detect human variants ranging in size from single nucleotide changes to large, complex structural variants. The system is also ideal for standard applications such as de novo assembly of large genomes and whole transcriptome analysis using the Iso-Seq method.
The Sequel II System is based on the proven technology and workflow underlying the previous version of the system, but contains updated hardware to process the new SMRT Cell 8M. Building upon the October 2018 release, which delivered highly accurate individual long reads (HiFi reads), the Sequel II System also produces these Sanger-quality reads (>99.9% accuracy), now at approximately eight times the scale.
Extensive analyses of Sequel II data demonstrate that the system can provide equivalent or higher accuracy compared to the Sequel System. HiFi reads can enable comprehensive human variant detection with ≥95% precision and recall of structural variants, ≥99% precision and recall of single nucleotide variants, and ≥96% precision and recall of insertions and deletions with as little as 15-fold coverage.
You don’t have to take our word for it. To ensure the system performed well, we ran an early access program with five institutions: Wellcome Sanger Institute, HudsonAlpha Genome Sequencing Center, University of Maryland Institute for Genome Sciences, Broad Institute, and The Icahn Institute at Mount Sinai. All five early access sites have subsequently purchased their first Sequel II Systems.
According to early access user Luke Tallon, Scientific Director, Genomics Resource Center at the University of Maryland Institute for Genome Sciences: “In our experience, the Sequel II System was essentially production-ready right out of the box. We have used it for a range of applications and sample types — from human genome sequencing to metagenome and microbiome profiling to non-model plant and animal genomes — and results have been very good. For our lab, which handles tens of thousands of microbial samples each year, the Sequel II System has the potential to enable really deep yet cost-effective microbiome and metagenome sequencing.”
Another early access user, Jeremy Schmutz, Faculty Investigator at the HudsonAlpha Institute for Biotechnology, had this to say: “It’s amazing to see lower-cost human genome sequences coming from the Sequel II System approaching the quality and completeness of the human reference genome, which took billions of dollars and many years to produce. We have already used the Sequel II instrument for human genomes as well as complex plant genomes and the results have been at least as good as the data we were getting from the original Sequel System with the added benefit of much higher throughput.”
To learn more, please listen to our on-demand webinar featuring an overview of the Sequel II System with presentations by several early access customers. Or contact us to talk to one of our application specialists to discuss your research.
Researchers rely on PacBio long-reads for richness and resolution when probing genomes, and these same attributes are becoming increasingly relevant in clinical settings. One field where the technology shows particular promise is infectious disease control.
When a disease outbreak hits a hospital, it is crucial that the pathogen and its transmission path are rapidly and accurately identified.
As PacBio researchers demonstrated in CLP magazine, current microbial detection techniques that rely solely on short-read DNA sequences can misidentify pathogens, resulting in incorrect prognosis and misinformed treatment decisions.
SMRT Sequencing, on the other hand, can paint a complete picture of entire genomes, as well as plasmids and other elements that may contribute to drug resistance. In an era of increasing antibiotic resistance, this is especially important. Methylation data generated during the sequencing process can provide further insights into pathogen virulence, essential information for outbreak response and control efforts.
At Mount Sinai Hospital in New York City, pathologists have successfully integrated SMRT Sequencing into the Mount Sinai Pathogen Surveillance Program. Established in 2014, the program has sequenced more than 1,000 genomes, cataloging around 36,000 isolates from 18,000 patients. While its original role was in reactive outbreak investigation, it is now a tool for proactive, continuous infection surveillance.
Adding SMRT Sequencing to routine surveillance of MRSA and C. difficile throughout the hospital has provided a more comprehensive view of drug resistance and revealed new pathogenic strains and unexpected transmission paths—and it does so rapidly enough to inform treatment, quarantine protocols, and other decisions that are essential for optimal patient care, according to Harm van Bakel, Assistant Professor of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai.
In a recent webinar, van Bakel gave an example where deep dives into genomic data enabled his team to crack a cryptic transmission case within the hospital’s newborn intensive care unit (NICU).
A total of 18 infants were struck with MRSA over the course of four months. Within this time, genetic drift resulted in changes in isolate genomes from patient samples. The depth and resolution of the PacBio whole genome sequencing enabled van Bakel to detect sub clades and sub-transmissions within the outbreak, and to apply this information to an analysis of outbreak progression over time.
“It’s very useful here to have the complete gene so that you can maximize the amount of available variation information,” van Bakel said. “You can get to some of these elements with short-read sequencing, but you can only really understand what’s going on if you reconstruct complete genomes.”
The pathologists married their genetic observations with other information collected around the hospital to fully reconstruct the outbreak — and ultimately track its origin.
An initial outbreak among a few patients quickly spread throughout the NICU. The unit was completely emptied and extensively cleaned as a result. But upon return to the unit, two more patients infected each other, followed by another few. This led to a second cleaning event and additional training, and the outbreak was successfully contained.
The staff were puzzled by the origins of the infection. Thanks to Mt. Sinai’s continuous sample collection as part of the surveillance protocol, van Bakel was able to go back several months into the hospital’s archived data and discover that the original transmission source was an adult ward.
That raised additional questions, however. The units were in different buildings, with no overlap of healthcare providers. How was the infection transmitted between units? Examining environmental samples collected from mobile equipment that might have passed through both wards, van Bakel was able to solve that mystery, too: Ventilators.
“Building these large collections of clinical samples and bacterial/viral isolates, and combining them with whole genome analysis, really allows you to study transmission dynamics and pathogen evolution in a hospital setting,” he said.
“In almost all cases where we investigate outbreaks, the genome tells us a slightly different story than what was suspected based on epidemiology alone,” van Bakel added. “I think it’s really something that needs to be integrated into general infection prevention practices in hospitals.”
With its unique medicinal and psychoactive compounds, the popularity of cannabis is spreading… well, like a weed. Now legal in 10 states for recreational use, and in 33 for medical use (with the FDA approval of the first oral cannabis drug for epilepsy on June 25, 2018), the once-forbidden plant is primed to become one of the most talked-about — and valuable — agriculture crops.
But what needs to be done to take this promising crop into the clinic?
Sound science, accurate testing protocols, and strident tracking systems — all of which can be achieved through genomics, according to Kevin McKernan, the former research and development lead on the Human Genome Project at MIT, whose company Medicinal Genomics (MGC) created the first Cannabis sativa genome in 2011.
As McKernan himself will admit, that first attempt was a bit of a mess.
“The draft assembly included hundreds of thousands of pieces, and was hardly functional. The sequencing technology we had back then just couldn’t handle all the repeat content and the polymorphism of the genome,” he said. “Over the next seven years, a lot of people tried to improve it, but they were only achieving average lengths of 159 kB N50s. This past spring (2018) we decided to nail it.”
The result, as reported by the Center for Open Science, was a high-quality reference assembly of the Jamaican Lion strain that is 1,000 times more contiguous than the 2011 assembly.
More than 180 billion bases were sequenced on the Sequel system, allowing the Medicinal Genomics team to select the longest reads as the foundation for the DNA assembly process. The reads were so long that every base was covered over 15 times with 60,000 base pair reads.
This was important because the cannabis genome is 10 times more varied than the human genome; it is highly repetitive and the most interesting cannabinoid and terpene synthase genes appear to have been tandemly multiplied and are separated by really large (32 kb, 64 kb, 96 kb) repeats that are longer than most other sequencing platforms’ read lengths.
There are more than 483 different identifiable chemical constituents known to exist in cannabis, of which over 80 are unique to the cannabis plant. These constituents include nitrogenous compounds, amino acids, proteins, glycoproteins, enzymes, sugars and related compounds, hydrocarbons, alcohols, acids, esters, aldehydes, ketones, fatty acids, lactones, steroids, terpenes, non-cannabinoid phenols, flavonoids, vitamins, pigments and elements.
CBD, the active ingredient in the FDA-approved epilepsy drug Epidiolex, is a chemical component of the Cannabis sativa plant. However, CBD does not cause intoxication or euphoria (the “high”) that comes from tetrahydrocannabinol (THC), the primary psychoactive component of marijuana. Different strains of cannabis (separated into Type I, Type II, Type III cannabis, and hemp lines) have different quantities of these compounds.
Having a comprehensive cannabis reference genome of a Type II (THC and CBD producing) variety is going to help tremendously in understanding the genetics of the plant and how to breed for more CBD or different esoteric cannabinoids. It opens the door to a host of industry innovations, including marker-assisted selection for genetically-based strain identification, accelerated breeding to improve production yields, reliable seed-to-sale tracking systems, and pathogen identification to ensure cannabis purity and safety.
“We now have a genome that is better than most of the other agricultural crops out there,” McKernan said. “But many more cultivars require sequencing to better understand this complex loci. We want a pan-genome. We want 12 really well done genomes, all sequenced with the same technology, to account for the different number of CBCAS, CBDAS, and THCAS (cannabinoid synthase) genes observed in each genome.”
So McKernan selected PacBio SMRT Sequencing to achieve this as part of the Cannabis Pan-Genome Project announced on Thursday. Using MGC’s assembly of the female Jamaican Lion cultivar as a baseline, genomic DNA from a sibling male plant and multiple offspring were isolated and are being sequenced with the Sequel II System long-read platform to identify structural variations and other types of important genetic variations. This “family” sequencing strategy will yield a recombination map and will serve as the basis for creating a pan-genome of cannabis.
On Tuesday, June 18, McKernan will be revealing the initial results of this project via live webinar that’s now open for registration. McKernan will also discuss cannabis genomics to a European audience at the 2019 SMRT Leiden conference in the Netherlands on May 7.
First there was Shadow, the poodle owned by gene-entrepreneur Craig Venter. Then there was Tasha, a female Boxer. Will the next de-coded dog be Maya, a German Shepherd Dog that helps police the campus at the University of Wisconsin-Madison?
Maya has been basking in social media celebrity alongside her human companion Sgt. Nic Banuelos, PhD students Lauren Baker and Emily Binversie, technician Jorden Gruel, veterinary surgeon-scientists Susannah Sample and Peter Muir, and Peter’s woven likeness, after winning the 2019 Plant and Animal SMRT Grant, co-sponsored by Histogenetics.
We caught up with the Comparative Genetics Research Laboratory to find out more about their project, which garnered more than 3,000 votes in a close online video competition amongst six finalists from institutions around the world.
A dog is not a dog is not a dog
When the genome assembly of Tasha the Boxer was released in 2005, little information was available about genomic structural variation in a domesticated species that has breed structure. At the time, a general consensus in the canine genetic community was well described by Kerstin Lindblad-Toh of the Broad Institute when she stated “A dog is a dog in a genomic sense.”
Not necessarily. As Binversie points out in the UW-Madison team’s crowd-pleaser video, the differences between Ms. Priscella the Papillon and Luna the Great Dane, are extensive, even though they are the same species. The 250+ breeds of dogs display variation in a wide range of traits. There are giant breeds and toy breeds; curly coats and straight coats; short snouts and long snouts; short legs and long legs.
Different breeds also have different disease risks. More than 350 inherited diseases have been described in domestic dogs, nearly half of which predominate in a single breed or a small group of breeds.
Despite this phenotypic variation, there is still only one canine reference genome. It was updated with additional transcriptome data in 2014, but according to Sample and Muir, it’s just not cutting it anymore. The extent of breed-related genomic structural variation remains an important research question.
“Sequencing technologies are evolving rapidly, and they are refining the approaches by which research questions should be addressed,” Muir said. “Maybe, over time, we will be able to focus more on using de novo assembly for genome analysis. Until then, it will be important to have high-quality reference genomes.”
PacBio SMRT sequencing will finally enable Sample to delve deeply into the structural variation that exists between breeds, so that she can better understand their unique traits and diseases.
Among those diseases is fibrotic myopathy, a rare disorder that causes hind limb lameness in the German Shepherd Dog. The condition is especially crippling for service dogs who work with police and military officers, and there are no successful treatment options.
“The selective breeding that occurs creates fascinating genomic architecture specific to each breed. Understanding that further is going to be quite important,” Sample said.
Dog’s best friend
Dogs will not be the only ones to benefit from the project, Muir said.
Since dogs were first domesticated more than 10,000 years ago, they have lived closely alongside humans, and have shared many of the same pathological, dietary and environmental conditions. Understanding their genetic evolution could shed light on ours as well.
Dog diseases are also important models for human disease, Muir said.
“This is a positive step for veterinary medicine in general, and for animal and human One Health studies,” he added.
The UW-Madison team is currently selecting a candidate dog to sequence. They want a local companion animal or working dog whose health history is well documented, including orthopedic and neurological status. They also intend to follow the health of the dog over time, and conduct additional post-mortem studies for an even richer dataset, possible years into the future when the dog passes away.
Video made the Twitter stars
The team’s first experience with a multimedia grant application was a fun yet frantic one, involving some mass emails to users of the UW–Madison School of Veterinary Medicine to avoid panic when the police K9 unit (and its siren) showed up for the video shoot.
The publicity, which included a brief spot on local television, provided a great opportunity for some community education.
Although Binversie said she hasn’t been asked for any autographs yet, traffic to the group’s Facebook page skyrocketed, and owners of all breeds have been in touch to offer support — and samples.
As for the “Shroud of Muir,” the carpet was given as a gift from a Turkish visiting professor. Instead of being used as a mat for frustrated students to wipe their feet on, as Sample first anticipated, it has a place of honor in the computer laboratory.
We are pleased to announce the availability of our protocol for template preparation to support low DNA input for sequencing on the Sequel System. The low DNA input workflow features a 3.5-hour library prep using the SMRTbell Express Template Prep Kit 2.0 (PN 100-938-900) from as little as 150 ng of input gDNA. This workflow can be used for de novo genome assembly of up to 300 Mb and can be scaled up for larger genomes with additional gDNA input.
How low is low? This publication in the journal Genes shows how a team at the Wellcome Sanger Institute used the new method to generate a genome assembly for the Anopheles coluzzii mosquito using unamplified DNA from a single female insect. The new protocol will enable many new research areas, including performing de novo assemblies from individual small-bodied organisms.
To get started using the low DNA input workflow for your next project, we recommend you read our application note or contact your local FAS with any questions.
While some microbiologists can study their organisms of interest by growing them in cultures in the lab, many don’t have that luxury.
Most microbes and algae cannot be cultured, which is why environmental microbiologist Cody Sheik relies so heavily on DNA sequencing and why he is especially excited to use the PacBio platform for metagenomic studies using both targeted and shotgun sequencing approaches.
Sheik’s first exposure to PacBio sequencing came shortly after joining the faculty at the University of Minnesota at Duluth, where the sulfate-reducing bacterium Desulfovibrio desulfuricans strain G11 was being used as a model organism. First isolated in 1979, the microorganism has a long history in syntrophy research, but had never been fully characterized. So Sheik set out to do so, publishing its first complete genome sequence in 2017.
Since then, he has used the PacBio platform in several ways in the Sheik Aquatic Geomicrobiology Lab, where he studies the roles, diversity and biogeography of novel microorganisms in sediment, water columns, hydrothermal springs and deep subsurfaces.
As he described in a recent webinar, he is using amplicon sequencing to monitor eukaryotic phytoplankton communities in the Great Lakes. By amplifying the 18S rRNA gene, a commonly used eukaryotic phylogenetic marker, his team is able to trace the presence of diatoms or other phytoplankton at different locations and time points in a way that is more revealing and less time consuming that traditional microscopy methods.
He is also hoping to use amplicon sequencing to overcome some of the limits of current DNA-based approaches to track invasive species, which require PCR primer sets specific to each species.
“This approach is very limited – you’re looking for one species per water sample,” Sheik said.
This summer, his team will test an approach that involves amplifying a common eukaryotic gene that several species should have and develop a broader assay for many invasives, which will allow them to get answers “more rapidly than what we could do with just a PCR-based approach.”
In another line of inquiry, Sheik is examining the complex composition of “rock snot,” or Didymosphenia geminata, an invasive microscopic algae (diatom) that can produce large amounts of stalk material to form thick brown mats on stream bottoms.
“Within this rock snot itself, you actually have multiple species of diatoms, and because it’s this very polysaccharide rich matrix that they live in, it’s really hard to pick individual cells out,” Sheik said.
He is hoping to extract high molecular weight DNA and generate draft genomes with long scaffolds from organisms in the snot rock matrix using a combination of the PacBio platform and DNA cross-linkage.
Sheik’s most ambitious project, however, is a $1 million NSF-funded study into how microorganisms thrive in an extreme environment half a mile underground in the Soudan Iron Mine near Ely, Minnesota. Microbial communities there live amongst rocks formed 2.8 billion years ago in salty water rich in iron but without oxygen. Their composition seems to differ greatly from each other, even those located very close together, and Sheik wants to know how and why.
“Here’s where I’m getting really excited using the PacBio platform to really go after some of these interesting organisms living in these sorts of communities using metagenomics,” he said. “We’re also interested in using methylation patterns from PacBio sequencing to look at population structure,” since methylation fingerprints can vary significantly even between two strains of the same species.
Research interest in the human microbiome and the roles our bacterial, viral, and single-cell eukaryote co-inhabitants play in health, nutrition, immunity, and disease has exploded. Yet accurately measuring the composition of these microbial communities remains complex.
Sequence-based approaches allow the genetic material from complete collections of microbes to be analyzed without the need to cultivate the microorganisms. But each step in the process of collecting, extracting, preparing, sequencing and analyzing the DNA and data introduces its own set of errors and biases.
At the Innovation Lab of the University of Minnesota Genomics Center, research scientist Ben Auch and his colleagues are developing new tools to improve both microbiome and isolate characterization and the Sequel System is providing some solutions.
As Auch explained in a recent webinar, DNA extraction is a significant source of bias. He referenced a 2017 study in Nature Biotechnology led by German researchers, which compared how 21 labs handled two fecal specimens, and whether different DNA extraction protocols affected microbiome test results. They found wide variation in results, particularly among gram-positive bacteria, which were often under-reported.
“It would be helpful to have a tool to assess this bias consistently across labs and samples, that could potentially be used as an inline process control to track bias across the entire microbiome workflow, including extraction,” Auch said.
His lab has partnered with Minnesota company Microbiologics to develop a “xenobiological microbial standard” — a cell-based microbial spike-in control made up of organisms not found in the human microbiome.
“We hope to be able to use this tool to capture the diversity of microbial properties that might be present in a microbiome sample and track how those properties influence the resulting data, and also to calculate absolute microbial abundance,” Auch said.
The prototype contains an even mixture of 12 organisms – six gram positive and six gram negative – ranging in environmental origin, GC (guanine-cytosine) content and genome sizes, from 2.14 to 9 Mb.
But to use this assemblage as a control, he needed them to be well characterized. Unfortunately, in cases where the microbes had previously been sequenced, the existing genome assemblies tended to be highly fragmented. Others had no assemblies at all. In addition, Auch wanted to be sure his information reflected the exact strains they were using.
He turned to PacBio technology to comprehensively sequence all 12 organisms. And in order to make the endeavor more efficient and affordable, he used multiplexing.
“Compared to preparing libraries individually, this protocol is highly streamlined,” he said.
As he explained in the webinar, samples are first sheared to a consistent size of around 10 kb then cleaned up and QC’ed for fragmentation size and concentration. The library prep is individual at this point, with each sample following the typical PacBio library prep through the ligation step. At that point barcoded adaptors are substituted for the default adaptor, the ligase is inactivated and samples can be pooled based on a calculator that PacBio provides. The calculator decides how much of each ligation reaction should be pooled based on the concentration, shear size, and estimated genome size of all the samples.
The pooled libraries are now treated as a single sample, and they are moved into an optional, yet recommended, size selection step.
Once the library has been sequenced, it’s de-multiplexed in SMRT Link and ready for the downstream assembly process.
“You can generally plan to sequence 6-8 typical microbes on each SMRT cell, or a total genome content of between 30 and 40 megabases. For smaller and less repetitive genomes, you might be able to get as many as 16 libraries on a single SMRT Cell,” Auch said.
He shared the results of one run, which included seven samples, each of which was assembled in just one or two contigs.
“Considering the diversity of these microbes, I think it’s quite impressive to be able to get seven or more complete genomes out of a single prep and a single SMRT Cell,” Auch said.
As an added bonus, the sequencing data contains information about methylation in its raw reads, which can be further mined and shared with other researchers via the community database REBASE, curated by Nobel Laureate Rich Roberts.
“We’ve shown that diverse microbes across GC content, genome size, and environment, can be efficiently multiplexed on the PacBio Sequel System and they result in highly contiguous genomes,” Auch said. “High quality, complete microbial genomes are now very much within reach, from both a technical and cost perspective.”
To learn more about Microbial Multiplexing Workflow on the Sequel System using the SMRTbell Express Template Prep Kit 2.0, check out this handy application guide.