Smoking out Structural Variants in the Cannabis Genome
Saturday, April 20, 2019
With its unique medicinal and psychoactive compounds, the popularity of cannabis is spreading… well, like a weed. Now legal in 10 states for recreational use, and in 33 for medical use (with the FDA approval of the first oral cannabis drug for epilepsy on June 25, 2018), the once-forbidden plant is primed to become one of the most talked-about — and valuable — agriculture crops.
But what needs to be done to take this promising crop into the clinic?
Sound science, accurate testing protocols, and strident tracking systems — all of which can be achieved through genomics, according to Kevin McKernan, the former research and development lead on the Human Genome Project at MIT, whose company Medicinal Genomics (MGC) created the first Cannabis sativa genome in 2011.
As McKernan himself will admit, that first attempt was a bit of a mess.
“The draft assembly included hundreds of thousands of pieces, and was hardly functional. The sequencing technology we had back then just couldn’t handle all the repeat content and the polymorphism of the genome,” he said. “Over the next seven years, a lot of people tried to improve it, but they were only achieving average lengths of 159 kB N50s. This past spring (2018) we decided to nail it.”
The result, as reported by the Center for Open Science, was a high-quality reference assembly of the Jamaican Lion strain that is 1,000 times more contiguous than the 2011 assembly.
More than 180 billion bases were sequenced on the Sequel system, allowing the Medicinal Genomics team to select the longest reads as the foundation for the DNA assembly process. The reads were so long that every base was covered over 15 times with 60,000 base pair reads.
This was important because the cannabis genome is 10 times more varied than the human genome; it is highly repetitive and the most interesting cannabinoid and terpene synthase genes appear to have been tandemly multiplied and are separated by really large (32 kb, 64 kb, 96 kb) repeats that are longer than most other sequencing platforms’ read lengths.
There are more than 483 different identifiable chemical constituents known to exist in cannabis, of which over 80 are unique to the cannabis plant. These constituents include nitrogenous compounds, amino acids, proteins, glycoproteins, enzymes, sugars and related compounds, hydrocarbons, alcohols, acids, esters, aldehydes, ketones, fatty acids, lactones, steroids, terpenes, non-cannabinoid phenols, flavonoids, vitamins, pigments and elements.
CBD, the active ingredient in the FDA-approved epilepsy drug Epidiolex, is a chemical component of the Cannabis sativa plant. However, CBD does not cause intoxication or euphoria (the “high”) that comes from tetrahydrocannabinol (THC), the primary psychoactive component of marijuana. Different strains of cannabis (separated into Type I, Type II, Type III cannabis, and hemp lines) have different quantities of these compounds.
Having a comprehensive cannabis reference genome of a Type II (THC and CBD producing) variety is going to help tremendously in understanding the genetics of the plant and how to breed for more CBD or different esoteric cannabinoids. It opens the door to a host of industry innovations, including marker-assisted selection for genetically-based strain identification, accelerated breeding to improve production yields, reliable seed-to-sale tracking systems, and pathogen identification to ensure cannabis purity and safety.
“We now have a genome that is better than most of the other agricultural crops out there,” McKernan said. “But many more cultivars require sequencing to better understand this complex loci. We want a pan-genome. We want 12 really well done genomes, all sequenced with the same technology, to account for the different number of CBCAS, CBDAS, and THCAS (cannabinoid synthase) genes observed in each genome.”
So McKernan selected PacBio SMRT Sequencing to achieve this as part of the Cannabis Pan-Genome Project announced on Thursday. Using MGC’s assembly of the female Jamaican Lion cultivar as a baseline, genomic DNA from a sibling male plant and multiple offspring were isolated and are being sequenced with the Sequel II System long-read platform to identify structural variations and other types of important genetic variations. This “family” sequencing strategy will yield a recombination map and will serve as the basis for creating a pan-genome of cannabis.