This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
With its unique medicinal and psychoactive compounds, the popularity of cannabis is spreading… well, like a weed. Now legal in 10 states for recreational use, and in 33 for medical use (with the FDA approval of the first oral cannabis drug for epilepsy on June 25, 2018), the once-forbidden plant is primed to become one of the most talked-about — and valuable — agriculture crops.
But what needs to be done to take this promising crop into the clinic?
Sound science, accurate testing protocols, and strident tracking systems — all of which can be achieved through genomics, according to Kevin McKernan, the former research and development lead on the Human Genome Project at MIT, whose company Medicinal Genomics (MGC) created the first Cannabis sativa genome in 2011.
As McKernan himself will admit, that first attempt was a bit of a mess.
“The draft assembly included hundreds of thousands of pieces, and was hardly functional. The sequencing technology we had back then just couldn’t handle all the repeat content and the polymorphism of the genome,” he said. “Over the next seven years, a lot of people tried to improve it, but they were only achieving average lengths of 159 kB N50s. This past spring (2018) we decided to nail it.”
The result, as reported by the Center for Open Science, was a high-quality reference assembly of the Jamaican Lion strain that is 1,000 times more contiguous than the 2011 assembly.
More than 180 billion bases were sequenced on the Sequel system, allowing the Medicinal Genomics team to select the longest reads as the foundation for the DNA assembly process. The reads were so long that every base was covered over 15 times with 60,000 base pair reads.
This was important because the cannabis genome is 10 times more varied than the human genome; it is highly repetitive and the most interesting cannabinoid and terpene synthase genes appear to have been tandemly multiplied and are separated by really large (32 kb, 64 kb, 96 kb) repeats that are longer than most other sequencing platforms’ read lengths.
There are more than 483 different identifiable chemical constituents known to exist in cannabis, of which over 80 are unique to the cannabis plant. These constituents include nitrogenous compounds, amino acids, proteins, glycoproteins, enzymes, sugars and related compounds, hydrocarbons, alcohols, acids, esters, aldehydes, ketones, fatty acids, lactones, steroids, terpenes, non-cannabinoid phenols, flavonoids, vitamins, pigments and elements.
CBD, the active ingredient in the FDA-approved epilepsy drug Epidiolex, is a chemical component of the Cannabis sativa plant. However, CBD does not cause intoxication or euphoria (the “high”) that comes from tetrahydrocannabinol (THC), the primary psychoactive component of marijuana. Different strains of cannabis (separated into Type I, Type II, Type III cannabis, and hemp lines) have different quantities of these compounds.
Having a comprehensive cannabis reference genome of a Type II (THC and CBD producing) variety is going to help tremendously in understanding the genetics of the plant and how to breed for more CBD or different esoteric cannabinoids. It opens the door to a host of industry innovations, including marker-assisted selection for genetically-based strain identification, accelerated breeding to improve production yields, reliable seed-to-sale tracking systems, and pathogen identification to ensure cannabis purity and safety.
“We now have a genome that is better than most of the other agricultural crops out there,” McKernan said. “But many more cultivars require sequencing to better understand this complex loci. We want a pan-genome. We want 12 really well done genomes, all sequenced with the same technology, to account for the different number of CBCAS, CBDAS, and THCAS (cannabinoid synthase) genes observed in each genome.”
So McKernan selected PacBio SMRT Sequencing to achieve this as part of the Cannabis Pan-Genome Project announced on Thursday. Using MGC’s assembly of the female Jamaican Lion cultivar as a baseline, genomic DNA from a sibling male plant and multiple offspring were isolated and are being sequenced with the Sequel II System long-read platform to identify structural variations and other types of important genetic variations. This “family” sequencing strategy will yield a recombination map and will serve as the basis for creating a pan-genome of cannabis.
On Tuesday, June 18, McKernan will be revealing the initial results of this project via live webinar that’s now open for registration. McKernan will also discuss cannabis genomics to a European audience at the 2019 SMRT Leiden conference in the Netherlands on May 7.
First there was Shadow, the poodle owned by gene-entrepreneur Craig Venter. Then there was Tasha, a female Boxer. Will the next de-coded dog be Maya, a German Shepherd Dog that helps police the campus at the University of Wisconsin-Madison?
Maya has been basking in social media celebrity alongside her human companion Sgt. Nic Banuelos, PhD students Lauren Baker and Emily Binversie, technician Jorden Gruel, veterinary surgeon-scientists Susannah Sample and Peter Muir, and Peter’s woven likeness, after winning the 2019 Plant and Animal SMRT Grant, co-sponsored by Histogenetics.
We caught up with the Comparative Genetics Research Laboratory to find out more about their project, which garnered more than 3,000 votes in a close online video competition amongst six finalists from institutions around the world.
A dog is not a dog is not a dog
When the genome assembly of Tasha the Boxer was released in 2005, little information was available about genomic structural variation in a domesticated species that has breed structure. At the time, a general consensus in the canine genetic community was well described by Kerstin Lindblad-Toh of the Broad Institute when she stated “A dog is a dog in a genomic sense.”
Not necessarily. As Binversie points out in the UW-Madison team’s crowd-pleaser video, the differences between Ms. Priscella the Papillon and Luna the Great Dane, are extensive, even though they are the same species. The 250+ breeds of dogs display variation in a wide range of traits. There are giant breeds and toy breeds; curly coats and straight coats; short snouts and long snouts; short legs and long legs.
Different breeds also have different disease risks. More than 350 inherited diseases have been described in domestic dogs, nearly half of which predominate in a single breed or a small group of breeds.
Despite this phenotypic variation, there is still only one canine reference genome. It was updated with additional transcriptome data in 2014, but according to Sample and Muir, it’s just not cutting it anymore. The extent of breed-related genomic structural variation remains an important research question.
“Sequencing technologies are evolving rapidly, and they are refining the approaches by which research questions should be addressed,” Muir said. “Maybe, over time, we will be able to focus more on using de novo assembly for genome analysis. Until then, it will be important to have high-quality reference genomes.”
PacBio SMRT sequencing will finally enable Sample to delve deeply into the structural variation that exists between breeds, so that she can better understand their unique traits and diseases.
Among those diseases is fibrotic myopathy, a rare disorder that causes hind limb lameness in the German Shepherd Dog. The condition is especially crippling for service dogs who work with police and military officers, and there are no successful treatment options.
“The selective breeding that occurs creates fascinating genomic architecture specific to each breed. Understanding that further is going to be quite important,” Sample said.
Dog’s best friend
Dogs will not be the only ones to benefit from the project, Muir said.
Since dogs were first domesticated more than 10,000 years ago, they have lived closely alongside humans, and have shared many of the same pathological, dietary and environmental conditions. Understanding their genetic evolution could shed light on ours as well.
Dog diseases are also important models for human disease, Muir said.
“This is a positive step for veterinary medicine in general, and for animal and human One Health studies,” he added.
The UW-Madison team is currently selecting a candidate dog to sequence. They want a local companion animal or working dog whose health history is well documented, including orthopedic and neurological status. They also intend to follow the health of the dog over time, and conduct additional post-mortem studies for an even richer dataset, possible years into the future when the dog passes away.
Video made the Twitter stars
The team’s first experience with a multimedia grant application was a fun yet frantic one, involving some mass emails to users of the UW–Madison School of Veterinary Medicine to avoid panic when the police K9 unit (and its siren) showed up for the video shoot.
The publicity, which included a brief spot on local television, provided a great opportunity for some community education.
Although Binversie said she hasn’t been asked for any autographs yet, traffic to the group’s Facebook page skyrocketed, and owners of all breeds have been in touch to offer support — and samples.
As for the “Shroud of Muir,” the carpet was given as a gift from a Turkish visiting professor. Instead of being used as a mat for frustrated students to wipe their feet on, as Sample first anticipated, it has a place of honor in the computer laboratory.
We are pleased to announce the availability of our protocol for template preparation to support low DNA input for sequencing on the Sequel System. The low DNA input workflow features a 3.5-hour library prep using the SMRTbell Express Template Prep Kit 2.0 (PN 100-938-900) from as little as 150 ng of input gDNA. This workflow can be used for de novo genome assembly of up to 300 Mb and can be scaled up for larger genomes with additional gDNA input.
How low is low? This publication in the journal Genes shows how a team at the Wellcome Sanger Institute used the new method to generate a genome assembly for the Anopheles coluzzii mosquito using unamplified DNA from a single female insect. The new protocol will enable many new research areas, including performing de novo assemblies from individual small-bodied organisms.
To get started using the low DNA input workflow for your next project, we recommend you read our application note or contact your local FAS with any questions.
While some microbiologists can study their organisms of interest by growing them in cultures in the lab, many don’t have that luxury.
Most microbes and algae cannot be cultured, which is why environmental microbiologist Cody Sheik relies so heavily on DNA sequencing and why he is especially excited to use the PacBio platform for metagenomic studies using both targeted and shotgun sequencing approaches.
Sheik’s first exposure to PacBio sequencing came shortly after joining the faculty at the University of Minnesota at Duluth, where the sulfate-reducing bacterium Desulfovibrio desulfuricans strain G11 was being used as a model organism. First isolated in 1979, the microorganism has a long history in syntrophy research, but had never been fully characterized. So Sheik set out to do so, publishing its first complete genome sequence in 2017.
Since then, he has used the PacBio platform in several ways in the Sheik Aquatic Geomicrobiology Lab, where he studies the roles, diversity and biogeography of novel microorganisms in sediment, water columns, hydrothermal springs and deep subsurfaces.
As he described in a recent webinar, he is using amplicon sequencing to monitor eukaryotic phytoplankton communities in the Great Lakes. By amplifying the 18S rRNA gene, a commonly used eukaryotic phylogenetic marker, his team is able to trace the presence of diatoms or other phytoplankton at different locations and time points in a way that is more revealing and less time consuming that traditional microscopy methods.
He is also hoping to use amplicon sequencing to overcome some of the limits of current DNA-based approaches to track invasive species, which require PCR primer sets specific to each species.
“This approach is very limited – you’re looking for one species per water sample,” Sheik said.
This summer, his team will test an approach that involves amplifying a common eukaryotic gene that several species should have and develop a broader assay for many invasives, which will allow them to get answers “more rapidly than what we could do with just a PCR-based approach.”
In another line of inquiry, Sheik is examining the complex composition of “rock snot,” or Didymosphenia geminata, an invasive microscopic algae (diatom) that can produce large amounts of stalk material to form thick brown mats on stream bottoms.
“Within this rock snot itself, you actually have multiple species of diatoms, and because it’s this very polysaccharide rich matrix that they live in, it’s really hard to pick individual cells out,” Sheik said.
He is hoping to extract high molecular weight DNA and generate draft genomes with long scaffolds from organisms in the snot rock matrix using a combination of the PacBio platform and DNA cross-linkage.
Sheik’s most ambitious project, however, is a $1 million NSF-funded study into how microorganisms thrive in an extreme environment half a mile underground in the Soudan Iron Mine near Ely, Minnesota. Microbial communities there live amongst rocks formed 2.8 billion years ago in salty water rich in iron but without oxygen. Their composition seems to differ greatly from each other, even those located very close together, and Sheik wants to know how and why.
“Here’s where I’m getting really excited using the PacBio platform to really go after some of these interesting organisms living in these sorts of communities using metagenomics,” he said. “We’re also interested in using methylation patterns from PacBio sequencing to look at population structure,” since methylation fingerprints can vary significantly even between two strains of the same species.
Research interest in the human microbiome and the roles our bacterial, viral, and single-cell eukaryote co-inhabitants play in health, nutrition, immunity, and disease has exploded. Yet accurately measuring the composition of these microbial communities remains complex.
Sequence-based approaches allow the genetic material from complete collections of microbes to be analyzed without the need to cultivate the microorganisms. But each step in the process of collecting, extracting, preparing, sequencing and analyzing the DNA and data introduces its own set of errors and biases.
At the Innovation Lab of the University of Minnesota Genomics Center, research scientist Ben Auch and his colleagues are developing new tools to improve both microbiome and isolate characterization and the Sequel System is providing some solutions.
As Auch explained in a recent webinar, DNA extraction is a significant source of bias. He referenced a 2017 study in Nature Biotechnology led by German researchers, which compared how 21 labs handled two fecal specimens, and whether different DNA extraction protocols affected microbiome test results. They found wide variation in results, particularly among gram-positive bacteria, which were often under-reported.
“It would be helpful to have a tool to assess this bias consistently across labs and samples, that could potentially be used as an inline process control to track bias across the entire microbiome workflow, including extraction,” Auch said.
His lab has partnered with Minnesota company Microbiologics to develop a “xenobiological microbial standard” — a cell-based microbial spike-in control made up of organisms not found in the human microbiome.
“We hope to be able to use this tool to capture the diversity of microbial properties that might be present in a microbiome sample and track how those properties influence the resulting data, and also to calculate absolute microbial abundance,” Auch said.
The prototype contains an even mixture of 12 organisms – six gram positive and six gram negative – ranging in environmental origin, GC (guanine-cytosine) content and genome sizes, from 2.14 to 9 Mb.
But to use this assemblage as a control, he needed them to be well characterized. Unfortunately, in cases where the microbes had previously been sequenced, the existing genome assemblies tended to be highly fragmented. Others had no assemblies at all. In addition, Auch wanted to be sure his information reflected the exact strains they were using.
He turned to PacBio technology to comprehensively sequence all 12 organisms. And in order to make the endeavor more efficient and affordable, he used multiplexing.
“Compared to preparing libraries individually, this protocol is highly streamlined,” he said.
As he explained in the webinar, samples are first sheared to a consistent size of around 10 kb then cleaned up and QC’ed for fragmentation size and concentration. The library prep is individual at this point, with each sample following the typical PacBio library prep through the ligation step. At that point barcoded adaptors are substituted for the default adaptor, the ligase is inactivated and samples can be pooled based on a calculator that PacBio provides. The calculator decides how much of each ligation reaction should be pooled based on the concentration, shear size, and estimated genome size of all the samples.
The pooled libraries are now treated as a single sample, and they are moved into an optional, yet recommended, size selection step.
Once the library has been sequenced, it’s de-multiplexed in SMRT Link and ready for the downstream assembly process.
“You can generally plan to sequence 6-8 typical microbes on each SMRT cell, or a total genome content of between 30 and 40 megabases. For smaller and less repetitive genomes, you might be able to get as many as 16 libraries on a single SMRT Cell,” Auch said.
He shared the results of one run, which included seven samples, each of which was assembled in just one or two contigs.
“Considering the diversity of these microbes, I think it’s quite impressive to be able to get seven or more complete genomes out of a single prep and a single SMRT Cell,” Auch said.
As an added bonus, the sequencing data contains information about methylation in its raw reads, which can be further mined and shared with other researchers via the community database REBASE, curated by Nobel Laureate Rich Roberts.
“We’ve shown that diverse microbes across GC content, genome size, and environment, can be efficiently multiplexed on the PacBio Sequel System and they result in highly contiguous genomes,” Auch said. “High quality, complete microbial genomes are now very much within reach, from both a technical and cost perspective.”
To learn more about Microbial Multiplexing Workflow on the Sequel System using the SMRTbell Express Template Prep Kit 2.0, check out this handy application guide.
by Jonas Korlach, CSO
It’s been really exciting to see a spate of publications coming out that demonstrate the utility of SMRT Sequencing for determining the underlying genetic cause of diseases that have long gone unsolved. Discovery of the pathogenic variants behind these diseases is not just academic progress; it can give answers to people who have been seeking them for years or even generations.
Here are several recent examples of the great work happening in this area. Congratulations to these teams and all other scientists who are using SMRT Sequencing to advance our understanding of disease.
Scientists from University Hospitals Leuven and collaborating institutions used SMRT Sequencing to test the theory that tandem repeat expansions could explain some cases of X-linked intellectual disability. They targeted more than 1,800 tandem repeats on the X chromosome, finding a candidate causal repeat expansion associated with relevant gene expression changes.
At Yokohama City University Graduate School of Medicine and collaborating organizations, scientists deployed SMRT Sequencing to uncover the genetic mechanism responsible for an unexplained form of epilepsy. Using low-coverage whole genome data, they identified six structural variants in a genetic region of interest and determined that one, a 4.6 kb repeat insertion, was causative for the disease.
In this study, scientists from University College London and the University of Oxford aimed to characterize a triplet repeat associated with Fuchs endothelial corneal dystrophy across several samples. Using an amplification-free CRISPR/Cas9 method with SMRT Sequencing, they accurately determined repeat length and instability levels, generated genotype data, and phased alleles.
Scientists in Japan turned to SMRT Sequencing to explain a familial form of epilepsy that had not been solved with whole-exome sequencing. Low-fold whole genome sequence data allowed the team to detect more than 17,000 structural variants and quickly filter that list to find the 12.4 kb deletion responsible for this family’s condition.
At Sanford Burnham Prebys Medical Discovery Institute, scientists produced the first evidence of somatic gene recombination in the human brain. Using SMRT Sequencing, they found recombination of a gene associated with Alzheimer’s disease and characterized thousands of variants.
Scientists in the UK and US analyzed a somatically unstable repeat expansion that causes myotonic dystrophy type 1. Thanks to SMRT Sequencing, they were able to characterize the repeat expansion region in several individuals and confirm that people whose expansion regions were interrupted by CCG or CGG variant repeats had milder symptoms than those with pure CTG repeats.
This preprint from a large team of scientists describes the use of SMRT Sequencing to investigate the cause of a progressive neurodegenerative disease that is difficult to diagnose. Studying one family, they identified a repeat expansion found in all affected individuals but no unaffected individuals. Similar expansions were then found in several unrelated families with this disease.
In a study just published in the Journal of Biology of Blood and Marrow Transplantation, scientists at the Anthony Nolan Research Institute demonstrated that ultra-high-resolution HLA typing performed with SMRT Sequencing identified stronger matches associated with improved survival rates among patients who received hematopoietic cell transplants.
The Anthony Nolan Research Institute, which is funded by Anthony Nolan, a registered UK charity that maintains the world’s oldest stem cell registry, implemented SMRT Sequencing to fully phase and characterize HLA genes with high accuracy. The HLA genes are highly polymorphic and complex, making them very difficult to resolve fully with conventional technologies. A thorough analysis of them is important for finding the best donor/recipient matches for stem cell transplants and other applications.
In this retrospective study, scientists aimed to determine whether high-resolution HLA typing enabled by SMRT Sequencing would have made a difference for previously matched donors and recipients. They analyzed 891 donor/recipient pairs, all of which had originally been considered a perfect match (a 12/12 score for all six HLA genes). SMRT Sequencing revealed that 29.1% of those matches were not actually perfect and identified previously undetected variation in nearly a quarter of the pairs.
The patients whose 12/12 matches were confirmed by SMRT Sequencing had a significantly improved 5-year overall survival of 54.8% compared to 30.1% for those whose 12/12 matches were changed based on higher-resolution information. Furthermore, ultra-high resolution 12/12 HLA matched patients also had significantly higher five-year overall survival (55.1%) than those patients with any degree of mismatching (40.1%).
The study also showed that perfectly matched patients were less likely to die of other transplant-related complications in the 12 months post-transplant, and significantly less likely to develop acute graft-versus-host disease. The study highlights the importance of sequencing through previously uncharacterized regions of the traditional HLA genes, showing that polymorphisms in these regions affect patient overall survival.
In a statement about this study, Neema Mayor from the Anthony Nolan Research Institute said: “We believe that HLA matching at ultra-high resolution could ultimately enable us to further minimize the risk of complications such as graft-versus-host disease and, consequently, the risk of mortality — potentially saving more lives in the future.”
It seems like there is a new story every week in the mainstream press about the unexpected ways the bacteria living on and within us impact health, disease, and even our behavior. The torrent of new discoveries unleashed by high-throughput sequencing has captured the imagination of scientists and laypeople alike. Scientists at Second Genome are hoping to apply these insights to improve human health, leveraging their bioinformatics expertise to mine bacterial communities for potential therapeutics. Second Genome is a clinical stage pharmaceutical company with a mission to redefine disease in the context of microbiome medicine and create therapeutics that can address unmet medical needs. Recently they teamed up with scientists here at PacBio to explore how long-read sequencing might supplement their short-read-based pipeline for gene discovery, using an environmental sample as a test case. They were especially interested in identifying unique, complete, and error-free gene clusters in metagenomic assemblies.
The team began by spiking the sample with internal controls and generating a 10 kb insert library. The library was sequenced on two SMRT Cells with 20-hour movies on the Sequel System. The resulting data was analyzed in two ways. First, the ccs algorithm was applied to generate ~270,000 HiFi reads per 1M SMRT Cell @ 99% minimum accuracy. Since these reads are on average 10 times longer than the average bacterial gene, full-length genes can be discovered even without assembly. Bioinformaticians at Second Genome then used two different gene prediction programs to evaluate the usability of HiFi reads for gene discovery. Next, the raw data was assembled with Canu, and the discovered genes were mapped back onto the contigs.
How did the results shape up? As you can see, two SMRT Cells of data revealed an impressive number of genes and a highly contiguous assembly with a mean contig size of 93 kb. In addition, among the contigs were two closed bacterial genomes.
Table 1. Metagenome assembly statistics
Comparing the performance of the two gene discovery algorithms, scientists at Second Genome found Prodigal predicted a large number of genes that were concordant with expected genes regardless of the sequencing technology, assembler, or annotation tool used in the source genome. However, more analysis is needed to make a conclusion about the program performance.
Table 2. Comparison of gene discovery programs for all identified genes
Second Genome then took the analysis one step further and assessed whether long-read sequencing was a good value for their business. Rather than fall back on the commonly used ‘cost per base’ metric, they developed their own more pertinent way of measuring success: what was the cost per error-free, unique predicted protein? By normalizing their data to calculate unique predicted protein per $1,000, they found that PacBio sequencing was actually twice as cost-effective as short-read technology at discovering complete genes from the same DNA sample. Whereas short-read technology predicted ~17,000 full-length proteins per $1,000 of data, PacBio data yielded ~36,000 predicted proteins. Similarly, PacBio sequencing recovered approximately twice as many spike-in sequences per $1,000 invested.
Table 3. Comparison of yield per $1,000 for different sequencing and gene-calling methods of identifying full-length proteins from spike-in genomes
Reviewing the results, Todd DeSantis, Second Genome Co-founder and VP of Informatics, said “PacBio’s long, accurate single molecule reads allowed us to more accurately discover novel and complete proteins that were encoded in our valuable and complex microbiome specimens. The prospect of bypassing assembly may increase our rate of discovery.” Furthermore, given the median contig size of 24 kb, most of the PacBio genes are collocated with numerous other genes on the same contig.
The results demonstrate that long-read sequencing technology can be successfully applied to metagenomes from complex communities and is complementary to short-read technology. PacBio sequencing was more effective at discovering complete genes than short-read sequencing in this study. Even more exciting is the prospect of replicating this study on the forthcoming Sequel II System which, with eight times as many ZMWs, will make the cost comparison even more favorable. Todd DeSantis said, “We anticipate the Sequel II System may enable scientists to more completely characterize microbial communities. This could ultimately yield significant discoveries in the microbiome and metagenomic space at a reduced cost to what we are
seeing today with Sequel and short reads.”
One of the fastest growing global foods is also one of its most vulnerable. Without an adaptive immune system, the Pacific white shrimp, Litopenaeus vannamei, rely on cellular and humoral defenses, such as the release of antimicrobial peptides, in their battle against invading microbes and pathogen infections. A battle they’re losing, leading to massive mortality and devastating economic losses.
A full-length transcriptome analysis using the PacBio Iso-Seq method has resulted in an isoform-level reference transcriptome that is shedding new light into the shrimp’s innate immune system, providing hope for the shrimp aquaculture industry.
One of the most economically important shrimp species in the global aquaculture industry, Pacific white shrimp global production grew from 2,688,901 tons in 2010 to 4,168,417 tons in 2016, according to The Food and Agriculture Organization of the United Nations. However, farmed shrimp supplies are under significant threat from three major shrimp pathogens: acute hepatopancreatic necrosis disease (AHPND), white spot syndrome virus (WSSV) and bacteria in the genus Vibrio.
By interrogating the transcriptome of the vulnerable species, a team of scientists from the Guangdong Institute of Applied Biological Resources in China were able to identify more than 5,000 full-length transcripts involved in its innate immune system, including nine immune-related processes, 19 immune-related pathways and 10 other immune-related systems. They also found wide transcript variants, including toll-like receptors (TLRs) and interferon regulatory factors (IRFs), which increased the number and function complexity of immune molecules.
Reporting in Fish and Shellfish Immunology, Chen Jinping and first author Zhang Xiujuan described how they combined PacBio isoform sequence (Iso-Seq) analysis and Illumina paired-end short read methods to discover 72,648 nonredundant full-length transcripts (unigenes) with an average length of 2,545 bp from five main tissues: the hepatopancreas, cardiac stomach, heart, muscle, and pyloric stomach.
The team turned to targeted isoform sequencing due to the difficulty they would have faced trying to sequence the entire L. vannamei genome, which currently has no reference. The genome is large and contains highly repetitive sequences; previous attempts to characterize its transcriptome using short reads restricted the yield of full-length cDNA molecules, the authors wrote.
“Using short read RNA-Seq strategies, extensive alternative prediction is impractical and a high variability of isoforms expression quantification is impossible in shrimp without a true genome reference,” the authors wrote. “The PacBio Iso-Seq strategy provides the convenience of finding more numbers of AS events of genes in many species, including reference-free species.”
The scientists used homology-based cDNA cloning to amplify full-length sequences of the shrimp’s immune genes to generate a high-confidence isoform dataset that was sequenced at Nextomics Biosciences in Wuhan, China. After annotating these full-length transcripts with well-curated databases, long noncoding RNAs (lncRNAs) and alternative splicing events were characterized.
Using the full-length isoform transcripts yielded from the SMRT Iso-Seq analysis as reference sequences, the unigene expression levels among the various tissues of L. vannamei were further analyzed based on short read datasets generated by the Illumina sequencing platform.
“Understanding the innate immune system of shrimp and revealing their immune responses against invading pathogens might contribute to developing strategies for the prevention and treatment of these diseases, which is essential for the shrimp aquaculture industry,” the authors wrote.
This survey of transcript variants and expression profiles of the immune-related molecules of L. vannamei have contributed to a comprehensive insight into the immune system, and will provide a valuable resource for geneticists and the commercial sector alike, they concluded.
The modern world might benefit from a return to our ancient roots by expanding the cultivation of one of the first domesticated crops, broomcorn millet.
Foodies will appreciate that the crop, a staple in many semi-arid regions of Asia and Europe, is gluten-free and extremely nutritious, with higher levels of protein, several minerals, and antioxidants than most other cereals.
Farmers will appreciate that the drought-resistant plant has the highest water-use efficiency among all cereal crops, (i.e. the highest amount of grains produced with the same amount of water), a short life cycle (60–90 days), and a high harvest index.
And thanks to high-quality reference genomes recently published by two different scientific teams in China, breeders can now get to work mining the crop’s novel genetic traits.
First domesticated in Northern China around 10,000 years ago, broomcorn millet (Panicum miliaceum L.) — also known as common millet, proso millet, and hog millet — is mainly used for dryland farming where most other crops have failed, or as a summer rotation crop in temperate regions. Broomcorn millet breeding programs have benefitted little from genomic technologies, and have been conducted only on a small scale and in isolated regions of the world, but the crop’s hardiness in changing climate conditions has generated renewed interest.
“The genetic diversity of broomcorn millet varieties from different regions of the world remains a valuable but unexplored resource. Broomcorn millet could be used not only as a dryland crop but also as a crop in broader regions to support more water-efficient, sustainable agriculture,” wrote one of the scientific teams, led by Zhang Heng and Zhu Jiankang from Shanghai Center for Plant Stress Biology of Chinese Academy of Sciences, in Nature Communications.
Their study detailed their assembly of a high-quality reference genome of the allotetraploid, 36 chromosome, broomcorn millet, generated from SMRT Sequencing. Anchored to a high-density genetic map using Hi-C, the genome was assembled into 18 pseudo-chromosomes. A total of 305,520 transcripts were also de novo assembled from 241 Gb of mRNA-seq data.
Through this, they identified 220,000 genetic markers, 55,930 protein-coding genes and 339 microRNA genes.
Among the genes of particular interest were C4 photosynthesis genes. C4 plants like millet and its bioenergy crop relative switchgrass (Panicum virgatum), are more efficient in carbon fixation and in the use of water and nitrogen compared to their C3 relatives, such as rice, and much effort has been made in engineering C4 traits in C3 crops. But this requires a clear understanding of the molecular mechanism of C4 carbon fixation, and the Shanghai study authors are hopeful that their genome might provide some additional insights. Their analysis suggests that three different subtypes of C4 carbon fixation may coexist in broomcorn millet to help it better cope with fluctuating environments in the field.
“Utilizing more than one decarboxylation mechanism can help cope with dynamic and fluctuating environments in the field,” they wrote.
They also traced NAD-ME metabolic pathways and ubiquitin-like proteins, which are thought to play a role in the plant’s efficient utilization of nutrients under stress conditions.
The Phylogeny of Paniceae
In a separate Nature Communications study, Lai Jinsheng, Shi Junpeng, et al, from the Center for Crop Functional Genomics and Molecular Breeding of China Agricultural University, Beijing, reported their genome assembly through a combination of PacBio sequencing, BioNano, and Hi-C (in vivo) mapping.
They created 18 “super scaffolds” covering ~95.6% of the estimated 887.8 Mb genome, and annotated 63,671 protein-coding genes.
“Mining the genome of broomcorn millet uncovered the genes potentially involved in both biotic and abiotic stress resistance in broomcorn millet,” the authors wrote.
They identified 493 genes containing NB-ARC domain that may be involved in disease resistance, of which 20 genes (seven gene families) were specific in broomcorn millet, as well as 15 ABA or WDS (water-deficiency stress) responsive genes.
“Interestingly, four of these ABA genes were constitutively expressed with relatively high expressional level across all the samples we examined, even for the tissues or stages without salt or drought treatment,” the Beijing team added.
The team also delved into the evolution of the crop.
“By taking advantage of the high-quality assembly of broomcorn millet in this study, in combination with the newly published genomes of pearl millet and Dichanthelium oligosanthes, we were able to reconstruct the phylogeny in Paniceae,” the authors wrote.
Unlike maize, which experienced strong genome rearrangements and expansions after tetraploidization, broomcorn millet retained the majority of two copies of ancestral genes, kept the basal chromosome numbers (2n = 4 × = 36) and experienced relatively weak genome expansion, likely resulting from its more “recent” tetraploidization than other Panicaeae lines — within ~5.91 million years, the authors reported.
“The genome is not only beneficial for the genome-assisted breeding of broomcorn millet, but also an important resource for other Panicum species,” they wrote. “It will no doubt facilitate the comparative genomic research between Panicum and other crops.”
We’re seeing real progress in efforts to better characterize genomic population diversity. That’s particularly good news in light of mounting concerns about the implications of having genomic databases that over-represent individuals of European ancestry—a topic our CSO Jonas Korlach wrote about in Scientific American.
Among the latest developments is the announcement from the Tohoku Medical Megabank Organization at Tohoku University about the release of the Japanese reference genome, known as JG1. The assembly is an integration of three de novo assembled genomes of male Japanese individuals. For this work, scientists used SMRT Sequencing with orthogonal technologies to produce highly contiguous assemblies. This population-specific reference is a valuable new resource that should accelerate the application of precision medicine for people of Japanese descent.
The National Human Genome Research Institute (NHGRI) is also helping with population diversity initiatives by creating new funding opportunities. A new Human Genome Reference Program grant will go toward the development of 350 high-quality human reference genomes as well as toward establishing metrics to define what constitutes a high-quality genome assembly. Goals for the program include learning more about human variation and its relation to different populations/ancestries and to include population-specific haplotype reference sequence in future reference genome models to avoid analysis bias. NHGRI hosted a recent webinar describing the program and linking a FAQ along with the funding announcements. Applications are due April 2.
To learn more about long-read sequencing of human genomes, we’ll be hosting a workshop at the upcoming Human Genome Meeting 2019 in Seoul, South Korea, on April 24. PacBio speakers including Jonas Korlach, William Rowell, Gregory Concepcion, and Wilson Cheng will present useful information about comprehensive variant detection, de novo assembly and phasing of human genomes using highly accurate long reads. The workshop will also cover data generated on our new Sequel II System.
Not attending HGM 2019? We also have plenty of information about population genetics on our website.
For the thousands of scientists who attended The Plant and Animal Genome Conference in San Diego this January, the sentiment seemed to be “ask not if PacBio is for you, but how PacBio can work best for you.”
The answer that emerged during PacBio’s PAG workshop and subsequent SMRT Informatics Developers Conference was a complex one.
Recent developments, such as new chemistry, new SMRT Cells, the SMRTbell Express Template Prep Kit, and SMRT Link 6.0 software have already led to faster and easier library prep, longer reads with more data and reliability, better transcript characterization (Iso-Seq) and phasing (FALCON-Unzip) capabilities (discussed by PacBio principal scientist Liz Tseng), and deeper insight with less waiting.
As senior product manager Justin Blethrow laid out in his talk, “Sequence with Confidence – How SMRT Sequencing is Accelerating Plant and Animal Genomics,” the upcoming release of a new Express Template Prep Kit will make the faster library prep available for more applications, and the Sequel II System will provide 8-times the amount of data while also integrating circular consensus sequencing. Also in the product roadmap for 2019 is a low DNA input protocol for limited samples and the smallest of organisms.
Big data from tiny samples
A talk at the PacBio PAG workshop by Andrew Clark of Cornell University gave attendees a sneak peek at one of these exciting opportunities. In a collaboration with Manyuan Long of the University of Chicago and Rod Wing of the University of Arizona, the evolutionary ecologist was able to use PacBio sequencing to create new genome assemblies of 10 drosophila species, including de novo assemblies of two individual flies, using as little as 26 ng of gDNA. Clark was most curious about why D. virilis diverges so dramatically from other species in terms of heterochromatic regions with long sections of simple satellite repeats — up to 40% of the genome.
“These regions of the genomes are really a bear to work with,” he said. But with PacBio long-read sequencing, the assemblies “flew together nicely,” with almost whole chromosome arms assembling into single contigs. And patterns were readily apparent after taking a look at the raw reads.
“This method to develop whole genome sequencing from single individuals is terrifically exciting in terms of the kinds of new questions that we can generate and answer with those data,” Clark added.
The pangenome era
Other speakers at the PAG workshop heralded the dawn of the pangenome era and delved into detail about their work to create multiple references for plant and animal species.
Max Planck researcher Sonja Vernes, director of the Bat1K consortium, discussed her group’s ambitious efforts to sequence the genome of every living bat species.
Initial data shows clear improvement in the quality of assemblies generated with long reads, she said. The PacBio assembly of the greater horseshoe bat (Rhinolophus ferrumequinum), for instance, contained just 679 contigs, to a standard of 19.9 Mbp NG50, compared to its previously posted assembly made up of 290,000 contigs at a standard of 0.01 Mbp NG50.
Isoform sequence (Iso-Seq) analysis has also provided a wealth of information about transcripts that differ between different sites throughout the body, enabling comprehensive genome annotation.
“We’re really missing out on this information if we don’t go in and collect the functional data to understand the gene structure,” Vernes said.
Kevin Fengler, of Corteva Agriscience, described his work with maize. As he pointed out, genome assemblies must be very accurate and robust to be research-ready, which is why he favors a combination of the latest PacBio technology and old-fashioned manual curation to elevate scaffolds to platinum-grade assemblies.
“Base pair error is not sequence diversity, and mis-assembly is not structural diversity,” Fengler said.
He described his workflow and scaffolding assembly across maize pangenome lines, then moved on to the bigger question: what now?
“Here’s really where the fun begins,” Fengler said as he went on to present some pangenome visualization tools, including TagDots, “rapid dot blots for the pangenome era,” and PANDA (PANgenome Diversity Alignments).
A magical world: The Sequel’s sequel
At SMRT Informatics, Jeremy Schmutz of the HudsonAlpha Institute for Biotechnology and the DOE Joint Genome Institute, charted the evolution of plant sequencing over the last 10 years and gave a glimpse at its future using the Sequel II System.
A 2008 iteration of the soybean genome done with 15M Sanger reads cost 200 times the amount needed to create a more complete genome using 13 Sequel SMRT Cells in 2018, Schmutz said.
He provided more fun facts. Just how many plant genomes can one sequence with three Sequels and four full-time equivalent staff? Forty-three, Schmutz said — 6 outbred trees, 12 sorghums, 4 maizes, 4 mosses and 8 complex grasses — for a total of 48 Gb of completed genomes.
And his preview of the latest advances included a project to catalog somatic genetic and epigenetic mutations across 200-year-old poplar trees. Circular consensus sequencing (CCS) to generate HiFi reads on the new Sequel II allowed scientists to capture and correlate data from several sites along the trees, such as individual branches, as well as call SNPs, detect structural variants, and phase haplotypes.
“What can we do with the new Sequel IIs?” he asked the crowd. “Tackle outrageous sized plants, create high-quality pangenomes of species, use CCS for metagenomes, precise SNP and structural variation detection, phase haplotypes for alignable regions, and develop new hybrid, outbred, polyploid strategies.”
Other speakers at the half-day event also heralded the new HiFi paradigm, and the final session turned into friendly debates — about the ideal default parameter set for Iso-Seq to get the best bang for your buck, between PacBio scientist Liz Tseng and Roslin Institute bioinformatician Richard Kuo; and the speed of the CCS algorithm (used to generate HiFi data) and its feasibility in large-scale studies, between PacBio algorithm expert Jim Drake, HudsonAlpha’s Jeremy Schmutz and Sergey Koren of the National Human Genome Research Institute.
The next SMRT Scientific Symposium and Informatics Developers Meeting will take place May 7 – 9 in Leiden, Netherlands. Registration is now open.
PacBio posters from PAG can be viewed here:
- “Library Prep and Bioinformatics Improvements for Full-Length Transcript Sequencing on the PacBio Sequel System” – Michelle Vierra, et al
- “A Low DNA Input Protocol for High-quality PacBio De Novo Genome Assemblies from Single Invertebrate Individuals” – Sarah B. Kingan, et al.
- “Haplotyping Using Full-Length Transcript Sequencing Reveals Allele-Specific Expression” –
Elizabeth Tseng, et al.
- “Single Molecule High-Fidelity (HiFi) Sequencing with >10 kb Libraries” – Paul Peluso, et al.
The PacBio team was honored to have the opportunity to give several talks at this year’s Advances in Genome Biology & Technology conference. If you weren’t able to be there, we’ve got you covered with videos and highlights.
In a plenary session, Marty Badgett, senior director of product management, gave attendees a look at the latest results using the HiFi reads with the circular consensus sequencing (CCS) mode as well as a sneak peek at data from our soon-to-be-released Sequel II System. As he demonstrated, HiFi reads cover the same molecule many times, delivering high consensus accuracy (Q30 or 99.9%) at long read lengths.
This mode now works with fragments as long as 20 kb, as we showed in a recent preprint. Badgett offered several examples where this is useful, such as pharmacogenomic gene analysis and resolving metagenomic communities. He also updated attendees on our Iso-Seq method, which can now segregate transcripts into haplotype-specific alleles using a new tool called Iso-Phase.
Of course, the big highlight of the talk was a look at early data from the Sequel II System, which delivers approximately eight times the data of the Sequel System. Badgett showed that read length distributions and many other factors are essentially the same as the current system, but that the new model has improved raw read accuracy, taking eight passes around a molecule instead of ten to get to Q30 accuracy in CCS mode. He also presented results from Iso-Seq analysis, plant genome assembly, and continuous long-read mode. The Sequel II System is in five early access labs and will be commercially released in the second quarter of this year.
Later, CEO Mike Hunkapiller and Principal Scientist Jason Underwood gave talks in the much-anticipated technology session. Hunkapiller focused on the use of HiFi reads for comprehensive genomic analysis, offering examples such as the sequencing of a Genome in a Bottle reference sample, which concluded with Q48 accuracy, 18 Mb contigs, and clearly phased haplotypes.
That work also entailed variant analysis — Hunkapiller noted that SMRT Sequencing delivered good recall and precision for deletions and insertions — which performed best using DeepVariant from Google to model the data. The results showed that several seemingly high-confidence variant calls from previous analyses of the same sample were incorrect and added a significant number of new variants to the catalog.
Underwood spoke about single-cell isoform sequencing (scIso-Seq), focusing on a collaborative project with the labs of Evan Eichler and Alex Pollen. For this effort, scientists used Drop-seq sample prep and then loaded cDNA products onto the Sequel System. Results from a barnyard experiment using mouse and human cells as well as from cerebral organoids showed that this approach could deliver cell type-specific gene expression data. Underwood also presented data from the Sequel II System comparing chimp and human organoids, resulting in information for about 14,000 unique genes with important insights for post-transcriptional gene regulation, transcription start sites, and more.
Finally, Primo Baybayan, our Director of Applications, presented a poster entitled ‘A high-quality de novo genome assembly from a single mosquito using PacBio sequencing.’ In the poster, a modified SMRTbell library construction protocol was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System, generating, on average, 25 Gb of sequence per SMRT Cell with 20-hour movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes are present and full-length). This new low-input approach now puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.
Many thanks to the AGBT organizers for inviting our team to present this exciting science!
You may have missed last week’s Advances in Genome Biology & Technology conference in sunny Marco Island, Fla., but you definitely shouldn’t miss the two posters presented there by Justin Zook and Justin Wagner from NIST’s Genome in a Bottle (GIAB) consortium.
The GIAB team has made critical progress in generating high-quality human genome reference materials and benchmarks that have helped to improve the accuracy and reproducibility of variant calling across laboratories. The latest results advance that work with an expansion of the benchmark set to include additional small (single-nucleotide variant and indel) variants and — for the first time — large (structural) variants.
The poster on structural variants (“A new benchmark for human germline structural variant calls”) describes a benchmark set of 11,869 insertions and deletions ≥50 bp and corresponding benchmark regions that span 2.69 Gb (89%) of the human genome. The set is derived from multiple technologies and is validated by manual curation and consistent inheritance in a mother-father-son trio. With tools like Truvari, the benchmark set provides a direct measure of false positives and false negatives in individual variant callsets. This will enable the improvement of structural variant calling software, just as the small variant benchmark did for single-nucleotide variants and indels.
In a second poster (“Expanding the Genome in a Bottle benchmark callsets with high-confidence small variant calls from long and linked read sequencing technologies”), the GIAB team discusses expanding and improving their small variant benchmark set by integrating linked- and long-read technologies, including PacBio circular consensus sequencing (CCS) reads. CCS reads — described in a recent preprint — have similar base accuracy to typical NGS reads but are much longer, and thus map unambiguously to repetitive or low-complexity regions of the genome that are not accessible with short NGS reads.
Integrating linked reads and PacBio CCS reads expands the region over which GIAB can confidently call variants by more than 84 Mb (>3%), and detects an additional 156,000 variants (>4%), “mostly in regions difficult to map with short reads,” the authors report. In a list of medically relevant genes, this new benchmark adds 418 more variants, an increase of about 5%.
For more information about GIAB, check out the public workshop being held March 28-29 at Stanford University. As indicated on both posters, the GIAB team welcomes new collaborators interested in the accurate and complete characterization of human genomes.
A new publication in the Journal of Human Genetics describes an impressive effort to identify the pathogenic variant causing progressive myoclonic epilepsy in two siblings. The scientific team used SMRT Sequencing to discover a 12.4 kb structural variant in a repetitive, GC-rich region after several other methods — including whole exome sequencing — failed to find the answer.
The paper comes from lead author Takeshi Mizuguchi, senior author Naomichi Matsumoto, and collaborators at Yokohama City University, Aichi Prefectural Colony Central Hospital, and other institutions in Japan. As the authors note, whole exome sequencing has delivered strong results for many cases that would otherwise have gone undiagnosed; for progressive myoclonic epilepsy in particular, the diagnostic yield is 31%. “However, the remaining 69% of cases present a genetic challenge,” the scientists report. “These findings suggest that certain types of pathogenic variation evade detection by the currently available genetic analysis.”
In this project, researchers were stumped by two siblings — a 20-year-old female and a 13-year-old male — who both showed signs of a severe neurodegenerative condition. While a genetic cause was highly suspected, trio-based whole exome sequencing and a subsequent search for causative single nucleotide variants turned up no leads. The scientists then deployed SMRT Sequencing, focusing on structural variants ranging in size from 50 bp to 50 kb, especially in regions that are challenging for short-read platforms to sequence. They used the Sequel System to generate low-coverage whole genome sequencing of an affected sibling and three unrelated controls.
Analysis of the 6-fold coverage of the case sample with PacBio’s pbsv software identified more than 17,000 structural variants — including more than 7,200 deletions and nearly 10,000 insertions. The scientists filtered out structural variants seen in the control samples to quickly narrow the list of potentially causal candidates, and whittled the list further by selecting candidates that impact a coding gene. Fifty variants remained, five of which affected genes associated with an autosomal recessive phenotype. “Surprisingly, a 12.4-kb deletion call spanning the first coding exon of CLN6 was found,” the team writes. Biallelic mutations in CLN6 cause neuronal ceroid lipofuscinosis, a disease with clinical features that match those of the two siblings. Additional Southern blot and RT-PCR analysis validated the deletion and demonstrated that it was pathogenic.
With this finding in hand, the team went back to try to understand why the deletion had proven so elusive earlier. Two exome analysis methods “completely missed the homozygous CLN6 deletion … probably due to the scanty read coverage against CLN6 exon 1 with high GC content (77.6%) even in controls,” the scientists report. “By contrast, PacBio long reads showed uniform coverage … which improved the variant detection in GC-rich regions containing multiple repetitive elements.” Even with only three SMRT Sequencing reads of the CLN6 region, “the long sequences of the reads conferred excellent mappability and ensured the robust detection of [structural variants],” the team adds.
The authors encourage other scientists to consider using long-read sequencing for similar cases where exome analysis reveals no pathogenic variants. They also call for the development of a robust structural variation database, along the lines of what gnomAD does for small variants. “For the purpose of reducing the number of candidate of diseases-causing mutations, it would be extremely beneficial if a public database for [structural variants] were available,” they note.
Please join us in congratulating Kristen Sund from Cincinnati Children’s Hospital Medical Center for winning our 2018 Structural Variation SMRT Grant Program!
Her proposal to use SMRT Sequencing to pinpoint the genetic mechanism responsible for neurological disease in patients with complex structural rearrangements definitely captured our attention. We caught up with Kristen to learn more about her background, her research, and how she hopes to use the data generated through this grant.
How did you get into this field?
I have always had a very strong interest in research and patient care, so I decided to get training as a genetic counselor and to get my PhD in molecular and developmental biology. I guess it makes sense from there that I am constantly looking for ways to use the latest technologies to find the genetic cause for disorders that were previously undiagnosable.
What does your day-to-day work look like?
Right now, I’m a laboratory fellow in a combined program for cytogenetics and molecular genetics at the ABMGG Laboratory for Genetics and Genomics. My activities focus on learning everything from the wet lab to analysis to quality control to interpretation for clinical genetic testing. What I really love about the combined approach to molecular genetics and cytogenetics is that it allows us to fully integrate what we’re doing for a particular case and focus on finding an answer. It feels more holistic.
What’s the background behind your SMRT Grant proposal?
When I was a genetic counselor in the lab, I was involved with research projects that focused on using the latest genetic technologies. At the time we were not offering clinical whole exome sequencing and there was a strong interest in using the technology on a research basis for some families that hadn’t been diagnosed. I wound up developing an analysis algorithm which I’m sure is very primitive by today’s standards, but at the time it got the job done. We actually solved a number of those cases. I loved that work — getting to know the families and being able to find them an answer in some cases. In my lab now, we do offer whole exome sequencing, but I began wondering what else we could do with other technologies that wasn’t possible with exome sequencing. How could we use long-read sequencing to search for answers for cases that are undetectable with other technologies?
What is it about these cases that makes them challenging to solve with other approaches?
Here’s one example of a case that we’re planning to submit for long-read sequencing. This patient has a neurologic phenotype and a known chromosome abnormality that is a little bit unusual because it involves two chromosomes and four chromosome breaks from an insertion and a translocation. The patient has had extensive follow-up testing including a SNP microarray and a couple of NGS panels, all of which came back normal. I’m convinced that one of these breakpoints holds the answer. I’ve been able to estimate the location of the breakpoint and some genes that might be in the region, but all we can do is guess until we can get a higher resolution look at the breakpoints and hopefully find a gene of interest.
What does it mean to long-undiagnosed patients to finally get an answer?
Families use the information in different ways. One family that comes to mind started a support group through Facebook. This child was a teenager, so this family had been dealing with this her whole life, but they didn’t know what to expect for her prognosis or how to explain it to other people. For them, it was huge to get an answer. There are no real treatment options, but it meant so much to the family to find out what to expect.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win free sequencing. Thank you to our co-sponsor, the University of Minnesota Genomics Center, for supporting the 2018 Structural Variation SMRT Grant Program!
For Research Use Only. Not for use in diagnostic procedures.
With their large brains, sophisticated sense organs and complex nervous systems, cephalopods could teach us a thing or two about learning, memory, and adaptability. But despite their evolutionary, biological, and economic significance, their genome information is still limited to a few species.
To bridge this gap, a team of Korean scientists has assembled the genome of the common long-arm octopus (Octopus minor) using PacBio technology to sequence both the DNA and RNA of the emerging model species.
Found in Northeast Asia, particularly in coastal mudflats of South Korea, China, and Japan, O. minor has become a major commercial fishery product with a high annual yield. They are also promising organisms for studies of the molecular basis of plasticity and their adaptation to the harsh environmental conditions in mudflats that are subject to temperature changes, steep salinity and pH gradients, varying oxygen availability, wave action and tides.
With this in mind, Hye Suck An from the National Marine Biodiversity Institute of Korea and colleagues from Korea Polar Research Institute, Chungbuk National University, and the University of Science & Technology of Yuseong-gu, Daejeon, created a 5.09 Gb assembly of the challenging genome with 30,010 genes, nearly half of which were composed of repeat elements.
Additionally, they annotated the genome using the Iso-Seq method on pooled RNA from thirteen organs. This enabled them to identify characteristics like intron length, protein-coding genes and transposable elements.
Together, this data enabled them to elucidate the molecular mechanisms underlying the O. minor’s adaptations.
As described in a paper in GigaScience, the team also compared their results with the published genome and multiple transcriptomes of the California two-spot octopus (Octopus bimaculoides) to study their evolution.
“We discovered that they evolved recently and independently from the octopus lineage during the successful transition from an aquatic habitat to mudflats,” the authors stated. “We also found evidence suggesting that speciation in the genus Octopus is closely related to the gene family expansion associated with environmental adaptation.”
Curiosity of the Catfish
A reference genome for another important aquaculture species, the yellow catfish (Pelteobagrus fulvidraco), has also been created by scientists in China.
Popular in fisheries of Southern China, the species has suffered from germplasm degeneration and poor disease resistance. Similar to other aquaculture species, like the previously featured tilapia, sex is an important commercial trait, with adult males growing two- to three-fold bigger than females.
In order to decipher the economic traits and sex determination of the species, the multi-institutional team constructed “the first high-quality chromosome-level genome assembly” of the yellow catfish, using PacBio sequencing as the base technology to build long contigs.
As reported in GigaScience, they annotated the assembly, identifying 24,552 protein-coding genes, and explored the phylogenetic relationships of the yellow catfish with other teleosts. They found almost 2,000 gene families that were expanded in the yellow catfish, mainly enriched in immune system, signal transduction, glycosphingolipid biosynthesis and fatty acid biosynthesis, providing a rich path for future studies.
“We believe that the high-quality reference genome generated in this work will … accelerate the development of more efficient sex control techniques and improve the artificial breeding industry for this economically important fish species,” the authors wrote.
Getting the word out about your services is a surefire way to get more interest — and ultimately more projects — into your pipeline. The good news is, it doesn’t take a marketing specialist or tens of thousands of dollars to get started. You’d be surprised at some of the big gains you can get with a little outlay.
There are many ways to boost your name in the genomics service provider world at any budget level. Here we highlight our top three. Start with one and work up from there!
No matter your personal feelings about social media, it’s been shown again and again to be a great way to engage potential customers. Platforms like Twitter and LinkedIn give you an opportunity to broadcast promotional pricing, exciting results, or new services you’re excited about for exactly zero dollars and just a little bit of your time. Our advice to you is to get an account on both platforms, put some effort into a good description of your products and services, and spend 10-20 minutes a day interacting with others on the sites. Not only will you be able to reach more people yourself, but you’ll start to notice the upcoming movers and shakers, be able to answer questions directly, and see trends in the markets you’re interested in penetrating.
Some best practices for social media include:
- Be concise – Cut to the point right off the bat. After all, you only have 280 characters.
- Tag people – Did you get a great result with a collaborator or think a particular scientist would be interested in the result? Tag them! People are more likely to engage with posts in which they are tagged.
- #hashtags – Whether witty or targeted, including a simple tag that is specific to your core facility encourages people to promote your services for you. Just take a look at #PoweredByPacBio to see what we mean.
- Include links – References and links directly to the information you’re looking to spread encourages people to click on the content.
Get an email marketing service and keep in touch with your customers and prospects. Every inquiry, project, and person who comes by your booth at conferences is a lead – and leads need to be nurtured. Using a simple email marketing service like MailChimp or Campaign Monitor can make keeping in touch with them feel like less of a burden, and gives you an opportunity to stretch your creativity muscles. You can do individualized follow-up emails to prospects or start a monthly newsletter to share your current services and capabilities. And if you take the time to develop some content (a case study, a recorded webinar, or a brochure) you can create an automated drip campaign to send out emails at specific intervals, delivering your relevant content directly to prospects’ inboxes.
Some best practices for email marketing include:
- Catch their eye – A well thought-out subject line can draw someone in who may otherwise skip your email. Be sure to entice them with an offer!
- Keep it simple – With our unlimited access to information via the internet, it’s easy to get overwhelmed by content. Be sure to hone the message you’re trying to send and only include content in your emails that strengthen that message.
- Repetition is key – It takes anywhere from 3-5 interactions with a lead to turn them into a customer, so don’t feel bad about sending multiple emails. Just make sure to spread them over several weeks.
Webinars, or video seminars, give you unique opportunities to speak with hundreds of people from all over the world in one place without the cost of an event sponsorship or plane ticket. You can ask one or several of your customers to present exciting research that was enabled by your services, and then give a quick overview of the products and services you have available. There are many options when it comes to hosting webinars. Publishers, such as Nature, give you the benefit of access to their marketing team with promotion, logistics, and lead capture, at a price of about $10,000. More reasonably priced webinar hosting services with monthly fees include GoToWebinar and Zoom. And if you’re really on a budget and don’t expect more than 100 attendees, there are free services like ezTalks.
Some best practices for webinars include:
- Be prepared – Make sure you have your speakers and content ready long beforehand, so you can promote the event and capture leads via a registration page.
- Promote the event – A webinar is only as successful as the number of people who attend (or sign up). You will want to put a little bit of effort into promoting and reminding your target audience about the webinar via email and social media.
- Follow up – So you had 75 attendees, and everything went great? Awesome! Now it’s time to follow up with the folks in attendance (and registrants) to get them into the projects pipeline.
We hope this list was helpful and helped inspire your marketing strategy for 2019! Remember to start small with measurable results and, most importantly, have fun with it.
In an effort to produce a comprehensive list of structural variants in the human genome, scientists from the University of Washington, the University of Chicago, Washington University, and Ohio State University sequenced 15 human genomes and have now released the results of their in-depth analysis.
The Cell publication, “Characterizing the Major Structural Variant Alleles of the Human Genome,” comes from lead authors Peter Audano and Arvis Sulovari, senior author Evan Eichler, and collaborators. The data generated by this work “provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity,” the authors report.
The analysis represents remarkable genomic diversity. The team used the PacBio RS II and the Sequel System to produce high-coverage, long-read sequence data for 11 diploid genomes, primarily sourced from HapMap samples and spanning Yoruban, Gambian, Luhya, Han Chinese, Vietnamese, Puerto Rican, Columbian, Peruvian, Telugu, Northwestern Europe, and Finnish ethnicities. They also used existing PacBio genome assemblies for two hydatidiform moles (CHM1 and CHM13) as well as the recently published Korean (AK1) and Chinese (HX1) genomes.
From this wealth of long-read data, the scientists then resolved and annotated nearly 100,000 common structural variants (defined as insertions, deletions, or inversions at least 50 bp long). Of those, more than 2,200 variants were shared by all genomes analyzed and another 13,000 were detected in most genomes — “indicating minor alleles or errors in the reference,” the team notes. Most of the variants were not reported in previous studies that relied on short-read technology. “Importantly, the breakpoints and content of these major alleles are now resolved at the single-base-pair level,” the scientists add, “providing the requisite sequence specificity on a GRCh38 coordinate system to begin to develop not only alternate haplotypes, but also to develop a more comprehensive graph-based assembly representation of the human genome.” The authors also noted that there are more structural variant alleles to discover, estimating that adding 35 more genomes (50 total) would increase the number of alleles by 39%.
The scientists also make clear that this kind of study would not have been possible even a few years ago. “Recent advances in sequencing technology have now allowed us to systematically whole-genome shotgun (WGS) sequence large stretches (>10 kbp) of native DNA without the need to propagate clone inserts in E. coli,” they explain. “This is particularly advantageous for structural variation since the long reads provide the necessary context to anchor and sequence resolve most structural variants (SVs) irrespective of sequence composition.” In addition, the analysis determined that variants were more likely to be found in GC-rich or GC-poor sequences, which means they “were likely problematic to clone, sequence, and assemble using large-insert BAC clones” during the Human Genome Project, the scientists add.
The results of this impressive work now comprise the first database of structural variants in control individuals sequenced with long reads, making it a valuable resource for researchers seeking to discover pathogenic structural variants associated with particular diseases. “The sequences we now add to the human genome provide the necessary substrate to discover new disease associations, especially as they relate to repeat instability,” the authors conclude.
There are more structural variants waiting to be found in human genomes. If you’re interested in related research, use our project calculator to estimate the time and materials needed and to get suggested study designs.
The recent Nature paper describing the first evidence of somatic gene recombination in the human brain has been getting so much attention that we went back to the lab’s PI to learn more. Jerold Chun is Professor in the Degenerative Diseases Program and Senior Vice President of Neuroscience Drug Discovery at Sanford Burnham Prebys Medical Discovery Institute in La Jolla, Calif. He spoke with us about this remarkable discovery in the APP gene in patients with sporadic Alzheimer’s disease, the decades-long hunt for somatic recombination in genes active in the brain, and how SMRT Sequencing made a difference.
Previous efforts to find somatic recombination in the human brain failed. Why did you continue the hunt?
This goes way, way, way back. Anyone who knew about V(D)J recombination that was originally reported in the ’70s and knew something about the nervous system has been intrigued by that possibility. It was the seed for trying to identify some type of similar recombination in the brain. But back then ideas were very vague; it was simply trying to take what we knew about the immune system and projecting what might occur in the nervous system. Nevertheless, the concept remained compelling and our studies on genomic mosaicism that occurred in the interim supported something interesting going on. As it turns out, the thought was good but the details were quite different from what we originally thought. We’re now at the point where we can talk about it not as a phantom but as reality.
After all those years of looking for this evidence, what was it like to finally find it?
You kind of scratch your head about the vagaries of science. This is a concept that was written off by almost any sane scientist years ago because so much effort had gone into chasing it and nothing emerged.
In the paper, you noted that short-read sequencing had been used for these efforts in the past but wasn’t successful. Why was that?
We had originally thought that if we could use single-cell technologies which rely on short-read sequencing, it would open this area up. The challenge is that the resolution of the sequencing technology is not sufficient even to interrogate the wild type locus. Even under the best circumstances we’re pretty much around 1 million base pairs. That’s not going to allow us to see 300 kilobases, which is where the APP locus is. That was a major limitation. Also, most short-read sequencing approaches require mapping to a reference genome. If there were inversions, insertions, or deletions, they may well be missed or be filtered out because they don’t map to what was expected in the reference. As soon as PacBio came onto the scene for our work, it just became absolutely clear that this was the way to pursue it so we could look at the complete sequence of what we now know are variants.
How did your team use SMRT Sequencing for this project?
There’s a really cool and special kind of sequencing with PacBio — circular consensus sequencing, or CCS. If you have a small enough piece — say, in the 3 kb to 5 kb range — the polymerase can go around and around and around the template. As a result, you can get many, many reads of the same template, so you can line those up and take the consensus read by looking at which of the residues show up most often. This is a way to get around the inherent polymerase error rates. In so doing, you get enormously high Phred scores as well as certainty levels. I think in this case we had a median Phred score of around 93 and a certainty of 99.999999%. It was actually approaching Sanger sequencing levels of certainty.
In the publication you speculated that HIV antiretroviral therapies might be used for patients with sporadic Alzheimer’s disease. Do you see that as near-term or will it take a long time to assess the possibility?
I think this is now. What we need to do is convince the clinical community to embark on it. The epidemiological signals are some of the most compelling of any that one could hope for. The total number of individuals in the United States who have HIV, are being treated with these antiretrovirals, and are at risk of Alzheimer’s disease because they are 65 or older is more than 120,000. The projections for developing Alzheimer’s in that age group is about 3% to 10%. But in 2016 the first reported case of an HIV patient with Alzheimer’s appeared in the literature, and as of now that’s the only case. I think it would be to everyone’s benefit to look at whether we can recapitulate that signal in a controlled, prospective clinical trial. Importantly, these are FDA-approved agents, some of which have been in humans since the 1980s, and thus there is sufficient proven safety to use these agents over long periods of time. Based on the science, we now have an explanation for why this might work.
This discovery must open new doors for your lab. What’s next?
There’s a new universe that’s been accessed here. It should impact both other forms of Alzheimer’s disease as well as other brain diseases and perhaps even other diseases that involve cells with a long life span. I think we’re in a position to search for and test whether gene recombination producing genomic cDNAs are more prevalent and involving other genes, and PacBio is certainly going to be a big part of that analysis.