This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
Traditional RNA-Seq is done by fragmenting cDNA, and then sequencing the fragmented reads with paired-end sequencing. The problem comes when trying to identify the full-length isoform during assembly. This is computationally challenging, and sometimes intractable.
The solution? Long-read isoform sequencing, according to PacBio Principal Scientist Elizabeth Tseng and PacBio user Gloria Sheynkman, a research fellow at Dana-Farber Cancer Institute. The two recently participated in a webinar, sharing their experiences using PacBio’s Iso-Seq method.
Tseng started by explaining the method and some of its applications.
“In contrast to traditional RNA-Seq, the Iso-Seq method produces full-length cDNA, and using the PacBio long, accurate reads, can sequence the full transcript. No assembly is required,” Tseng said.
She also discussed some of the bioinformatics tools available, including SQANTI2, which can be used to classify full-length transcripts against annotations such as GENCODE, and as a quality control tool.
The Iso-Seq method in action
At Dana Farber, Sheynkman uses the Iso-Seq method to characterize cancer cells and create complete, accurate transcriptomes.
In one example, she ran five breast cancer cell lines and eight melanoma samples in one pooled library on a single SMRT Cell 8M each, achieving around 6 million polymerase reads, with an average base yield of 300 Gb and an average polymerase read length of ~50 kb. Sequencing was performed by Maryland Genomics, a PacBio certified service provider.
With the improvements in the chemistry of the new Sequel II System, Sheynkman said she has been able to capture a much wider range of transcript lengths, without having to do size selection.
“Overall, we’re really detecting a much larger range, with cDNA molecules up to 6 and 7 kb,” she said.
For both breast cancer and melanoma, she obtained about 14,000 unique genes and 11,000 unique isoforms, around 30% of which were novel. And each SMRT Link Iso-Seq job was completed in just 6-9 hours, she noted.
Towards accurate isoform quantification
But the most promising application for Sheynkman is isoform quantification.
“I think this is a really important goal for the field,” she said. “To know how many copies of each transcript is expressed in the cell will really open up a lot of avenues in biomedical research, such as having consistent biomarkers, understanding disease mechanisms, and even just fundamental biological understanding.”
While short-read sequencing can achieve gene quantification with reliable results, the complexity of isoform structure requires more comprehensive coverage.
“Isoform quantification methods are very dependent on having accurate transcript models,” Sheynkman said.
The improved depth of coverage and reduced bias in sampling full-length reads on the new Sequel II System should mitigate these limitations, Sheynkman said. And the high technical reproducibility achieved by PacBio is another big strength, she added.
Targeted or whole transcriptome sequencing?
Sheynkman further discussed cases in which targeted isoform sequencing may be preferred over whole transcriptome sequencing. By targeting only genes of interest, rare isoforms could be identified with low to moderate sequencing effort. Multiplexing could further reduce the cost and increase sample size, which could be useful for applications such as biomarker discovery and validation.
Sheynkman also provided a new solution to a common problem when doing targeted sequencing: probes. ORF Capture-Seq is designed as an easy, versatile option to make capture probes directly from available clones/PCR product within a single day, using low-cost molecular biology reagents. It also allows for the generation of many complex probe sets tailored to different genes, Sheynkman said. The complexity of the probe sets allows you to target anywhere from 1 to 1,000 genes.
The webinar also included a lively Q&A that is well worth the listen.
We were delighted to host an educational workshop at last month’s annual meeting of the American Society of Human Genetics (ASHG), where we had the opportunity to feature talks from two customers as well as an overview of SMRT Sequencing. If you couldn’t attend, check out the videos or read the highlights below.
Emily Hatas, our director of business development, kicked things off with a look at how SMRT Sequencing has evolved over the years. Compared to the first instrument we offered, the Sequel II System represents a 100-fold improvement in read length and a 10,000-fold improvement in throughput. As of last month, customers were averaging about 160 Gb per SMRT Cell, a yield more than 10 times higher than the Sequel System.
Most of the presentation focused on applications in human genome analysis. High-throughput structural variant detection, which makes use of continuous long-read (CLR) sequencing, is well-suited to population studies and can be run at a cost of about $670 per sample when running two samples on each SMRT Cell. Comprehensive variant detection, which uses HiFi sequencing to make multiple passes around each molecule for optimal accuracy, is great for disease research — particularly for solving rare diseases — and costs about $2,600 per sample, assuming each library uses two SMRT Cells. Finally, de novo assembly of reference genomes should also be based on HiFi reads, Hatas told attendees, since it achieves comparable contiguity to CLR mode with about six times higher accuracy. In addition, HiFi data cuts analysis time in half and generate much smaller files to make de novo assembly more scalable.
Next up was Naomichi Matsumoto from Yokohama City University to speak about the use of SMRT Sequencing to solve Mendelian diseases. He shared the story of how his lab discovered a 12.4 kb structural variant that’s responsible for progressive myoclonic epilepsy in two siblings. The variant was in a repetitive, GC-rich region, which was why previous attempts to find it had failed. With low-coverage whole genome sequencing on the Sequel System, his team identified the variant and later confirmed that it was causal.
Matsumoto also reported progress in understanding repeat expansion disorders — many of which have neurological components — by pairing SMRT Sequencing with new analysis tools designed to highlight repetitive areas. In one example, his team was able to distinguish between the smaller number of repeats associated with healthy controls and the larger numbers associated with symptomatic patients.
The final talk came from Shawn Levy of the HudsonAlpha Institute for Biotechnology and the recently spun out services lab, now known as HudsonAlpha Discovery, which is a division of Discovery Life Sciences. He offered a look at his team’s early access experience with the Sequel II System, which was so successful that the research institute now has four of the instruments.
His data showed the increasing output of the system over time, as well as yield increases from the HiFi method. Levy noted that accuracy improves with each pass around the molecule, but reaches a plateau at the tenth pass or so. For Iso-Seq experiments, the team saw a significant improvement in yield from the Sequel System to the Sequel II System. Levy also shared hot-off-the-presses data from a project designed to determine the quality of Iso-Seq reads that can be gleaned from FFPE samples. The longer reads made possible with this approach don’t overcome the highly fragmented DNA and RNA coming from the samples, Levy said, but they definitely improve biological resolution and enable the characterization of higher molecular weight RNA that’s present in the samples. The project required a modified Iso-Seq protocol, which is still being optimized for best performance. While conventional approaches are evaluated based on how many 200-nucleotide reads they generate, the SMRT Sequencing method resulted in an average length of 435 bases.
Levy noted that his team also uses long-read sequencing for targeted sequencing applications associated with confoundingly homologous regions and for analyzing complex rearrangements in cancer. Going forward, they will also be sequencing about 7,000 genomes using long-read WGS for the All of Us Research Program to increase discovery of structural variants.
We’d like to thank all of the ASHG attendees who made our workshop such a success! If your research includes human genetics, please consider applying for our 2019 Human Genetics SMRT Grant Program. The winner will receive complimentary sequencing from the HudsonAlpha Genome Sequencing Center of up to 12 SMRT Cells. The deadline to apply is November 22, 2019.
There’s the genome, the transcriptome, the microbiome… and now the NLRome?
Breeders and pathologists have long been interested in uncovering the secrets of plant immunity, and much of their attention has been focused on receptors that can activate immune signalling: cell-surface proteins that recognize microbe-associated molecular patterns (MAMPs), and intracellular proteins that detect pathogen effectors, including nucleotide-binding leucine-rich repeat receptors (NLRs).
Hundreds of NLR genes can be found in the genomes of flowering plants. They are believed to form inflammasome-like structures, or resistosomes, that control cell death following pathogen recognition, and are being investigated as candidates for engineering new pathogen resistances.
For these reasons, scientists are keen to create inventories of NLR genes at different taxonomic levels. But their efforts have been hindered by the extraordinarily polymorphic nature of the gene family, patterns of allelic and structural variation, and clusters with extensive copy-number variation.
Two research teams have successfully overcome these challenges by combining resistance gene enrichment sequencing (RenSeq) with SMRT Sequencing.
In a recent paper in Cell, a multi-institutional team led by Felix Bemm and colleagues at the Max Planck Institute for Developmental Biology in Germany detailed their creation of a nearly complete species-wide pan-NLRome in Arabidopsis thaliana.
The sequences they obtained allowed them to define the core NLR complement, as well as to chart integrated domain diversity, describe new domain architectures, assess presence or absence of polymorphisms in non-core NLRs, and map uncharacterized NLRs onto the A. thaliana Col-0 reference genome.
“Reference genomes likely include only a fraction of distinct NLR genes within a species, which in turn has made it impossible to obtain a clear picture of NLR diversity based on resequencing efforts,” the authors wrote.
“Our work provides a foundation for the identification and functional study of disease-resistance genes in agronomically important species with more complex genomes,” they added.
Using RenSeq to trace NLR evolution in tomatoes
Another team from the University of California at Berkeley, led by Brian Staskawicz and first author Kyungyong Seong, explored the NLRome of the tomato plant Solanum lycopersicum.
This important crop is challenged by more than 200 diseases caused by diverse pathogens. Low genetic diversity and changes of resistance (R) genes of the cultivated tomato during domestication have led to a heightened urgency for genetic improvement. Scientists are keen to draw upon the hereditary disease resistance of wild tomato species that have co-evolved with their pathogens in highly diverse habitats, and R genes in particular, which have shown more durable resistance against multiple pathogens.
Previous attempts to use RenSeq to selectively capture and sequence NLRs in tomato have been limited by the method’s inability to completely resolve highly repetitive sequences and physical clusters of NLRs, the team noted in a bioRxiv preprint, so they turned to SMRT Sequencing as well.
They employed SMRT RenSeq to identify NLRs from 18 Solanaceae accessions (S. lycopersicum Heinz, Nicotiana benthamiana, Capsicum annuum ECW20R, plus 15 wild tomato accessions belonging to five species).
They produced 264 to 332 high-quality NLR gene models in tomato, and annotated 314 NLRs alongside the reference genome of S. lycopersicum Heinz.
“Our RenSeq results improved the annotation of 128 NLRs, including 13 existing annotations which were incomplete because of mis-assembly or unfilled gaps,” the authors wrote.
“We demonstrated that SMRT RenSeq is a cost-effective, efficient alternative to the whole genome sequencing. We also verified that SMRT RenSeq was capable of…resolving the complexity of NLRs and their clusters.”
The team used the gene models and annotations to explore NLR evolution, but noted that larger scale comparative studies including evolutionarily distant wild tomato species should be done to provide more comprehensive insights, and to expand the scope of NLRome for genome engineering and breeding of tomatoes.
“Our study provides high quality gene models of NLRs that can serve as resources for future studies for crop engineering and elucidates greater evolutionary dynamics of the extended NLRs than previously assumed,” the authors wrote.
The PIK3CA oncogene has been the target of intense research scrutiny for decades. Remarkably, though, a new paper in Science today reports completely novel findings about compound mutations that are associated with patients who respond extremely well to targeted therapies. While more studies are needed, this work has important implications for delivering treatment to patients with breast cancer and other common cancers.
“Double PIK3CA mutations in cis increase oncogenicity and sensitivity to PI3Kα inhibitors” comes from lead author Neil Vasan, senior authors Maurizio Scaltriti and José Baselga, and collaborators at Memorial Sloan Kettering Cancer Center, the Icahn School of Medicine at Mount Sinai, one pharma company, and other institutions. The project emerged from follow-up studies of a breast cancer patient previously identified as a super-responder to the targeted PI3Kα inhibitor alpelisib. It turned out the patient had double PIK3CA mutations, so scientists embarked on an effort to find out whether that mattered for response to treatment — and what they learned could change how oncologists approach therapy selection.
As it turns out, multiple PIK3CA mutations are far more common than expected; the vast majority are double mutations. Researchers detected them across a wide variety of cohorts in 12% to 15% of breast cancers and other types of cancer. Previously, that number was believed to be less than 1%.
According to the authors, that discrepancy can be attributed to the approaches traditionally used to analyze PIK3CA mutations in cancer genomes. “The common practice of sequencing only certain single-nucleotide variants or some but not all exons across a gene likely underestimates the frequency of multiple mutations in PIK3CA mutant cancers,” Vasan et al. write.
For this project, scientists deployed SMRT Sequencing to analyze the full PIK3CA gene, which not only gave them more complete information, but also allowed them to phase mutations and determine when the double mutations occurred in the same allele. “Establishing their allelic configuration is important because cis mutations would result in a single protein with two mutations, whereas trans mutations would result in two proteins with separate individual mutations, and these could have different functional consequences,” the team notes.
But proving the mutations occurred in cis was no easy task. “To study the allelic configuration of double mutations, we faced several technical hurdles based on our observation that the most frequent double PIK3CA mutants are located far apart in genomic DNA.” This meant that short-read and even Sanger sequencing could not span the distance. It also meant that analyzing degraded DNA from FFPE samples would not support the kind of full-length sequence that was required. Researchers went back to patients and collected new samples that were frozen prior to sequencing. With those samples and long PacBio reads, scientists were able to distinguish patients with cis versus trans mutations.
“The overall consequence of these cis mutations is a phenotype of enhanced oncogenicity and greater sensitivity to PI3Ka inhibitors,” the team writes. “Our findings provide a rationale for testing whether patients with multiple–PIK3CA-mutant tumors are markedly sensitive to PI3Kα inhibitors.”
We caught up with Vasan shortly before this publication was released and asked him about the implications of these findings. “This gene has been so well studied for decades, it’s humbling that we found something new,” he told us. “In the cancer sequencing field, I think that we’ve hit a plateau in terms of single nucleotide variation. From a discovery point of view, we need to focus on higher-order interactions such as these double mutations.”
The cultivation and conservation of one of the most important commercial fishes in the world may come down to sex determination — how can you successfully breed a species without knowing the sex of your stock?
A Japanese research team has come up with a solution, thanks to a new Pacific bluefin tuna reference genome and the male-specific DNA markers they were able to identify as a result.
In a study published recently in the Nature journal Scientific Reports, first author Ayako Suda and lead author Atushi Fujiwara of the Japan Fisheries and Education Agency of Yokohama, described how they developed a PCR assay to accurately identify male tuna, based on a new high-quality PacBio and Illumina assembly.
Wild populations of Thunnus orientalis have been in drastic decline due to overfishing, leading Japan and other nations to develop full-life-cycle tuna aquaculture systems as early as the 1970s. They have identified optimum rearing conditions for the species, but these conditions are difficult to achieve. Spawning is strongly influenced by environmental factors such as water temperature, for example, and only some females spawn in cultivation conditions, reducing genetic diversity. Controlling the sex ratio in sea cages could help increase the production of fertilized eggs, but the Pacific bluefin tuna lacks morphological sexual dimorphism, making it difficult to identify and remove males. Furthermore, identifying males through gonad inspection can be lethal and inconclusive in young fish.
So the researchers set out to improve the T. orientalis genome, first assembled in 2013. Despite being used as a reference, that assembly is highly fragmented, with a large number of gaps in its scaffolding.
By combining sequence data from PacBio long reads and Illumina short reads, the Japanese team created a 787 Mb genome assembly in only 444 scaffolds with a contig N50 of 3 Mb. This represents a 376-fold increase in contiguity and a 148-fold reduction in the number of gaps compared to the existing reference.
Through analyzing re-sequence data of several males and females, 250 male-specific SNPs were identified from more than 30 million polymorphisms, with seven distinct regions being identified. The team then focused on one in particular: a 3,174 bp section of a single scaffold that contained 51 male-specific variants. They created a PCR-based sex identification assay targeting this stretch of DNA and achieved high accuracy in testing across 115 fish.
“Sex identification using our PCR assay is easy, requiring minimal handling of individuals. Moreover, sex of juveniles can be identified using our method, allowing the sex ratio in cages to be adjusted at an early stage, which could enhance breeding programs,” the authors state.
They also note that the approach might be less stressful to tuna, and requires less effort than sampling approaches based on fin clips or muscle tissues from live individuals.
“It could also be used to obtain data from wild populations, providing useful information for the management and conservation of these natural stocks,” they add.
The assay could be implemented in surveys that evaluate sex ratio analyses, rather than waiting for sexual maturity, for example. Incorporating the sex of fish while tracking their migration patterns — a strategy used by Barbara Block and colleagues in California — could also provide valuable information for the management of wild tuna fisheries.
“Our improved draft genome provides a solid foundation for future population and resource management studies of Pacific bluefin tuna,” the authors conclude.
At the annual meeting of the American Society of Human Genetics in Houston, PacBio scientists presented how our Sequel II System performs for structural variant (SV) detection and for whole transcriptome sequencing. The educational workshop focused on experiments that can be done using a single SMRT Cell 8M on the Sequel II System.
The event kicked off with Aaron Wenger walking through SV analysis, which he said has mirrored the development path of single nucleotide variants, from proof-of-concept to individual rare disease studies and now to large cohort studies like SOLVE-RD in Europe and All of Us in the United States.
Wenger showed a variety of SV types that can be detected with highly accurate SMRT Sequencing, such as insertions, deletions, translocations, and inversions. He also showed the standard SV discovery and analysis workflow, and the precision and recall performance of this method. He noted that a single SMRT Cell is now sufficient to achieve high recall for SV discovery for two human samples.
Next up, Elizabeth Tseng (@magdoll) spoke about the Iso-Seq method for full-length transcript sequencing that eliminates the need for bioinformatics-driven isoform assembly and enables direct ORF prediction even without a reference genome. With the Iso-Seq Express Kit, she noted, customers can generate reliable results from as little as 60 nanograms of total RNA.
Using an Alzheimer’s brain Iso-Seq dataset released for ASHG, Tseng demonstrated the comprehensive, highly accurate results for full-length isoform detection using SMRT Sequencing. More than 99% of transcripts reported by the Iso-Seq method are more than 99% accurate, she added. She also looked at single-cell Iso-Seq experiments, showing examples of SMRT Sequencing paired with DropSeq or 10x Genomics workflows, and at how Iso-Seq analysis can be used to solve rare disease or characterize cancer fusion genes.
Thanks to all the attendees who took the time to listen to our presentations. We hope you enjoyed the meeting as much as we did!
In a new Science publication, researchers from the University of Washington and other institutions report detailed analyses revealing the adaptive importance of copy number variants (CNVs) acquired from Denisovan and Neanderthal ancestors, the closest relatives of modern humans, in the modern-day Melanesian population. The team used PacBio long-read sequencing to study these complex stretches of DNA and the Iso-Seq method to generate full-length transcript data.
“Adaptive archaic introgression of copy number variants and the discovery of previously unknown human genes” comes from lead author PingHsun Hsieh (@phhBenson), senior author Evan Eichler, and collaborators. For the project, they focused on the Melanesians, an oceanic population that are known to have more Denisovan and Neanderthal ancestry than other groups. This made an excellent foundation for studying the role of CNVs in adaptation and archaic introgression.
“Relatively little is known about the extent to which CNVs contribute to the genetic basis of local adaptation and, more importantly, whether CNVs introgressed from other hominins may have been targets of adaptive selection,” the authors write.
As part of this project, scientists focused on “two of the largest and most complex” CNVs found in the Melanesian genome — a 5 kb duplication and a 73.5 kb duplication, both on chromosome 16 — for a deeper investigation. “Both events are largely restricted to Melanesians and the Denisovan archaic genome F2 and are thought to be involved in a single >225-kb complex duplication (DUP16p12) introgressed from the Denisovan genome,” they report. “This region has been difficult to correctly sequence and assemble, and only recently has the sequence structure of the ancestral locus (>1.1 Mb) been correctly resolved.”
To better understand the original duplication, the team generated 75-fold whole-genome coverage of a Melanesian individual using SMRT Sequencing. This allowed them to narrow down the insertion location to a 200 kb region that is enriched in segmental duplication that “predisposes the region to recurrent structural rearrangements associated with autism and developmental delay,” Hsieh et al. write.
By applying the Segmental Duplication Assembler, a methodology recently published in Nature Methods, they wound up with a 1.8 Mb contig including the correctly assembled Melanesian duplication. “Notably, the sequence-resolved assembly shows that the actual length of DUP16p12 duplication polymorphism is ~383 kb, which is longer than previously thought,” the authors report. “Sequence and phylogenetic analyses suggest that the variant originated from a series of complex structural changes involving duplication, deletion, and inversion events ~0.5 to 2.5 million years ago within the Denisovan ancestral lineage.” That duplication was inserted into the Denisovan genome within the last 200,000 to 500,000 years and subsequently introgressed into the ancestors of Melanesians between 60,000 to 170,000 years ago, the authors conclude.
The team performed Iso-Seq with hybridization capture probes toward this region to produce full-length gene models and better characterize the functional effects of CNVs in the Melanesian genome. Based on their results — including a comparison to gene models from other humans and the chimpanzee — the scientists found that the 383 kb duplication is likely adaptive. “This helps to explain why this polymorphism has become nearly fixed within the Melanesian populations (>80%) despite its large size, which is typically regarded as selectively disadvantageous,” they note. “Notably, the Melanesian-specific gene NPIPB shows ~3% amino acid divergence and evidence of positive selection despite its recent origin.” The scientists predict that the proximity of this duplication to a genomic region associated with autism (chr16p11.2) will have an impact on the frequency of autism-associated rearrangements in the Melanesian population.
Based on these results and other data confirming Neanderthal-origin CNVs in the Melanesian genome, the scientists were able to “reconstruct the structure and complex evolutionary history of these polymorphisms and show that both encode positively selected genes absent from most human populations,” they write. “This study highlights the substantial large-scale genetic variation that remains to be characterized in the human population and the need for development of additional reference genomes that better capture the diversity of our species and complete our understanding of human genes.”
We caught up with Hsieh at ASHG 2019, where he was presenting a poster on this research. He summarized the project by stating, “The high-quality, long-read sequencing data opens up an unprecedented venue to study variants in complex genomic regions. The ability to access these new variants helps us advance our understanding of the biology and evolution of our own species.”
A recent bioRxiv preprint reports efforts to sequence the genome of a Tibetan individual and detect the genetic underpinning of adaptive traits associated with tolerating high altitude. The authors used SMRT Sequencing to achieve extremely high contiguity and accuracy, and incorporated scaffolding and other complementary technologies to build a robust assembly.
The results are reported in the preprint, “De novo assembly of a Tibetan genome and identification of novel structural variants associated with high altitude adaptation.” Lead author Ouzhuluobu, senior author Bing Su, and collaborators discuss their evaluation of the new genome assembly as well as key findings from it. They chose to focus on a Tibetan person because of the population’s unique and long-term residence in “one of the most extreme environments on earth”— the Tibetan Plateau, at an average elevation exceeding 4.5 kilometers.
The team’s genome assembly, named “ZF1”, is the first for a Tibetan individual. Using the assembly, the scientists identified 6,500 structural variants that were not detected in two other long-read Asian genome assemblies. “[Genes near] ZF1-specific SVs are enriched in GTPase activity that is required for activation of the hypoxic pathway,” the authors report. In addition, they found a “163-bp intronic deletion in the MKL1 gene showing large divergence between highland Tibetans and lowland Han Chinese.” They note, “This deletion is significantly associated with lower systolic pulmonary arterial pressure, one of the key adaptive physiological traits in Tibetans.”
Previous studies had suggested that the Tibetan population may have more genomic content from archaic hominid species, such as the Denisovans, than other modern populations. “To take advantage of the de novo ZF1 assembly, we performed a genome-wide search of archaic sharing non-reference sequences (NRSs) and compared the results with the two de novo assembled Asian genomes (AK1 and HX1),” the authors report. “We found a total length of 39.6 Mb and 45.9 Mb sequences shared with those of Altai Neanderthal and Denisovan, corresponding to 1.32% and 1.53% of the entire ZF1 genome respectively. These archaic proportions are much higher than that in AK1 (0.82% and 0.70%) or HX1 (0.98% and 0.85%).” One of the archaic shared regions is a 662 bp insertion associated with improved lung function.
“The high-quality genome allows us to better understand the sequences showing population-level or individual-level specificity where they are different or even absent from the human reference genome,” the scientists write. “Our study demonstrates the value of constructing a high-resolution reference genome of representative populations (e.g. native highlanders) for understanding the genetic basis of human adaptation to extreme environments as well as for future clinical applications in hypoxia-related illness.”
A new review article nicely sums up the utility of long-read sequencing for solving rare diseases that cannot be explained by other methods. The paper, published in the Journal of Human Genetics, comes from authors Satomi Mitsuhashi and Naomichi Matsumoto at Yokohama City University in Japan.
The scientists note that long-read sequencing serves as a good complementary approach for cases that are not solved with short-read sequencing alone. “The approximate current diagnostic rate is <50% using [short-read whole exome and genome sequencing], and there remain many rare genetic diseases with unknown cause,” Mitsuhashi and Matsumoto write. “There may be many reasons for this, but one plausible explanation is that the responsible mutations are in regions of the genome [or are types of variants] that are difficult to sequence using conventional technologies.”
Many recent projects have used long-read sequencing technologies to discover pathogenic variants associated with rare disease. “The results of these studies provide hope that further application of long-read sequencers to identify the causative mutations in unsolved genetic diseases may expand our understanding of the human genome and diseases,” the scientists report.
The review discusses several particular types of disease-causing variants, including tandem repeats, structural variants, complex rearrangements, and transposable elements. In addition to citing studies that have used long-read sequencing to search for pathogenic variants, the scientists also consider why long reads make a difference for each situation. With tandem repeats, for example, they note that “long tandem repeats are difficult to analyze by Sanger sequencing,” and “long reads are a straightforward way to detect repeat changes because an adequately long read can encompass an entire expanded repeat as well as flanking unique sequences.”
Mitsuhashi and Matsumoto also review studies in which researchers made use of the PacBio No-Amp targeted sequencing application to target a region of the genome using CRISPR instead of PCR amplification. In the studies, scientists “found this approach accurate” and obtained high-coverage HiFi reads of the targeted region.
Going forward, the authors suggest a workflow for solving rare disease cases: begin with short-read exome or whole genome sequencing for small variants, and if that does not yield an answer, move on to long-read sequencing for larger variants. “Long-read sequencing is especially highly recommended when repeat diseases or complex chromosomal rearrangements are suspected,” they conclude.
Matsumoto will be presenting his team’s research at our ASHG 2019 workshop on Wednesday, October 16, 2019. Register today to reserve your seat or to get the recording.
Learn more about Variant Detection with SMRT Sequencing.
When MRSA hits your hospital, what do you do?
If you’re located in Europe or other places where infection rates are still relatively low, you can take a seek-and-destroy approach, isolating an affected patient and working out in concentric circles to identify contacts and potential transmissions.
If you’re in New York City, however, the strategy is not so simple. Hospital-associated infections with methicillin-resistant Staphylococcus aureus are endemic in the Big Apple, and this has required a fresh approach to treat and prevent the costly bacterial menace.
At Mount Sinai Hospital, the strategy now involves SMRT Sequencing. Established in 2013, the Mount Sinai Pathogen Surveillance Program has sequenced more than 2,000 genomes, cataloging around 43,000 isolates from 22,000 patients. While its original role was in reactive outbreak investigation, it is now also used as a tool for proactive, continuous infection surveillance for common hospital pathogens.
As previously reported in this blog post and this webinar, adding SMRT Sequencing to routine surveillance of MRSA and C. difficile throughout the hospital has provided a more comprehensive view of drug resistance and revealed new pathogenic strains and unexpected transmission paths.
We recently caught up with Harm van Bakel, Assistant Professor of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, to learn more about how the program has evolved and some of the results published in this paper.
Why did you choose SMRT Sequencing?
MRSA is highly clonal. Traditional molecular strain typing methods, such as pulsed-field gel electrophoresis (PFGE), S. aureus protein A (spa) typing, and multilocus sequence typing (MLST), can facilitate rapid screening, but their resolution is limited, and they are unable to capture genetic changes that lead to alteration or loss of typing elements.
Asa result, short-read whole genome sequencing has emerged as the gold standard for studying lineage evolution and nosocomial outbreaks. By comparing sequences of your clone of interest to a reference genome, you are able to profile changes in the core genome. However, you may still be missing crucial information contained in non-conserved ‘accessory’ genome elements, which harbor a lot of virulence and drug resistance determinants, and evolve more rapidly.
These accessory elements, which include endogenous prophages, mobile genetic elements, and plasmids, are very repetitive in nature, so long-read sequencing that is able to resolve these was needed to help us determine what’s going on in both the core and accessory genome elements. It also gives us additional information to tease apart the evolution of an outbreak.
In the paper, we gave an example of a persistent outbreak in the hospital’s neonatal intensive care unit, which was eventually traced to adult hospital wards, with ventilators as a potential vector. Long-read sequencing enabled comparative genome and gene expression analyses of the outbreak clone to hospital background strains, in which we identified genetic and epigenetic changes, including acquisition of accessory genome elements that may have contributed to the persistence of the outbreak clone.
What lessons have you learned from continuous surveillance?
Logistically, we have learned that integration and automation are key. In the beginning, we were a little naive. But we quickly realized that you can’t just analyze the genomes in a vacuum. Just identifying infection relatedness between two patients isn’t enough either. You have to understand how certain strains may have spread between patients. This requires detailed information about the patient’s condition, where they have been in the hospital, which staff and equipment they have come in contact with. We had to develop bioinformatics tools to layer genetic information on top of patient and hospital epidemiologic records to create a single, integrated map. We’ve found we need an entire support system in addition to the sequencing in order to make it work. We’re fortunate that our health system has the centralized testing and medical records systems to help us operate in an efficient manner.
Scientifically, we have learned that there’s a lot more happening under the surface than we were aware of — in regards to strain types, evolution, frequency, virulence and transmission. We continue to see under-the-radar outbreaks that we wouldn’t be aware of without a sequencing-based surveillance program. And we’ve learned that MRSA can be colonized for a long time and re-emerge weeks, if not months later, when infected patients return to the hospital.
The continuous surveillance has led to a much better understanding of how pathogens circulate throughout the hospital. It allows us to be more proactive. In some cases, it has led to interventions in certain wards. In the case of the NICU outbreak, if we had implemented continuous sequencing before it occurred, we may have been able to intervene much sooner.
We want to continue to improve the program so that it’s fast and cost effective enough to inform infection prevention. Not only does this benefit patient care, but it can help avoid larger outbreaks, ward closures, and other costs associated with investigating and reacting to infections.
We also want to build a more comprehensive view of all pathogens, so we have started to track viral pathogens too, including influenza.
And we have expanded the program beyond just Mount Sinai Hospital, into the entire Mount Sinai Health System, including other hospitals and community care facilities that cover most of the Manhattan area. We hope this will help us understand just how pathogens are moving throughout the system as patients travel from facility to facility, and to detect new pathogens emerging from the community.
We can’t make headway into reducing the MSRA endemic unless we understand it better. We need to better map the route of transmission of the pathogen between people, and in the environment. This is only possible through hospital-wide – and ideally region-wide – surveillance.
The DNA sequencing community lost one of its founding fathers last month with the death of Jo Messing, director of the Waksman Institute at Rutgers University.
Dr. Messing, who died at the age of 73, developed shotgun sequencing and the M13 sequencing vector used for cloning in the 1980s. Because he declined to patent this work, it was freely available and quickly became the foundation for a burgeoning molecular genetics field.
Dr. Messing’s scientific acumen and commitment to innovation in DNA sequencing remained a guiding force for the community throughout his life. In a PNAS paper published just a few days after his death, he and his colleagues presented some truly fascinating discoveries about the evolutionary and immune function of sequence repeats in the genome of Spirodela polyrhiza, an aquatic plant.
His dedication to generating high-quality genome assemblies for plants helped improve our understanding of maize, rice, sorghum, and other challenging genomes. A few years ago, we reported on his analysis of structural variation in maize, in which his team produced a highly accurate representation of copy number variation in the plant’s genome.
The PacBio team will remember Dr. Messing as a kind and generous scientist whose enthusiasm for our long-read sequencing technology was a true gift. He always remained humble no matter how great his accomplishments, and we count ourselves fortunate to have known him.
It was the first multicellular eukaryotic genome sequenced to apparent completion, but it turns out the Caenorhabditis elegans reference that’s been used as a resource for the past 20 years does not exactly correspond with any N2 strain that exists today.
Assembled using sequence data from N2 and CB1392 populations of uncertain lineage grown in at least two different laboratories during the 1980s and 1990s, accuracy of the C. elegans reference genome is limited both by genetic variants and by the limitations of the technology of the time (clone-based Sanger technology). It is believed the strain may have accumulated up to 1,000 neutral mutations even before it was first frozen in 1969 with substantial genetic differences between strains in different laboratories since then.
So a team of researchers from Stanford, Cornell, and the University of Tokyo sought to recomplete the genome by performing long-read assembly of VC2010, a modern and easily available nonmutagenized derivative of N2. Not satisfied with the completeness of earlier assembly attempts, the team decided to use three sequencing technologies: Illumina short reads, as well as PacBio and Nanopore long reads.
As described in their cover-gracing Genome Research study, their VC2010 assembly has 99.98% identity to N2, but with an additional 1.8 Mb, including tandem repeat expansions, genome duplications, and more than 53 newfound genes. For 116 structural discrepancies between N2 and VC2010, 97 structures matching VC2010 (84%) were also found in two outgroup strains, implying deficiencies in N2.
“Although we do not expect this or any assembly to be perfect, the VC2010 assembly provides substantial advantages over its predecessors in both precision and completeness,” the authors wrote.
The team assembled raw PacBio reads with the long-read genome assemblers Canu, FALCON, miniasm, and HINGE, yielding complementary assembly gaps from the same input sequencing data. Merging these PacBio assemblies resulted in an assembly containing only five gaps across the genome. Using Nanopore long reads, they were able to close three more, bringing the total gaps down to two.
The improved assembly “yielded features not visible in the N2 reference assembly, and has several technical and biological implications,” they added.
They also suggested that more of the nematode genetic record may need to be corrected. They note that almost 2% of the putatively gap-free C. elegans genome proved to be missing from the N2 assembly, including long stretches of repetitive DNA, and said “it seems likely that most of the nematode assemblies generated over the last decade are missing some repetitive regions of genomic DNA.”
Such highly tandemly repeated regions may be crucial for understanding fast-evolving gene families relevant to nematode ecology, and for identifying rapidly evolving virulence factors in parasites such as N. brasiliensis, they added.
“With the possible exception of highly reduced genomes such as Pratylenchus coffeae, long-read assembly will probably be needed to detect and resolve these systematically lost genome sequences,” they wrote.
In order to ensure reproducibility of their new assembly in vivo, the team derived a highly clonal strain from VC2010, called PD1074 (available here), and used it to generate most of the genomic sequence data.
“C. elegans researchers who wish to have significantly higher genomic and genetic reproducibility than is possible with N2 are encouraged to adopt PD1074 as a new reference strain for wild-type controls, classical mutagenesis, and genome engineering,” the authors wrote.
Reining in the Wild Strains
In a second Genome Research paper, researchers from Seoul National University generated a de novo assembly of CB4856, which is one of the most genetically divergent strains of C. elegans compared to the N2 reference strain.
Their study sought to determine how substantial genomic changes are generated and tolerated within a species, and to compare the wild strain with the N2 reference, as the two have numerous heritable phenotypic differences, including aggregation behavior, mating, nictation behavior, pathogen response and genetic incompatibility.
Not satisfied that the current, short-read generated CB4856 reference genome accurately represents genomic rearrangements that are longer than the insert length, and concerned that it might be missing insertions and repetitive sequences, the team generated their own PacBio genome assembly to the level of pseudochromosomes containing 76 contigs.
They identified structural variations that affected as many as 2,694 genes, and found that subtelomeric regions contained the most extensive genomic rearrangements, even creating new subtelomeres in some cases.
The high variability of subtelomeres over generations facilitates the emergence of new genes and may help to increase the fitness of organisms, the authors note. However, subtelomeres — hypervariable regions adjacent to the telomere — are highly repetitive by nature, which makes genome assembly at their sites very difficult and has hampered the study of their involvement in chromosome evolution.
The subtelomere structure that the Korean team was able to unravel with PacBio sequencing implies that ancestral telomere damage was repaired by alternative lengthening of telomeres, even in the presence of a functional telomerase gene, and that a new subtelomere was formed by break-induced replication, the authors said.
“Our study demonstrates that substantial genomic changes including structural variations and new subtelomeres can be tolerated within a species, and that these changes may accumulate genetic diversity within a species,” they wrote.
The researchers said they hoped their CB4856 genome will serve as a better reference genome for wild C. elegans strains, and that the numerous SVs between N2 and CB4856 will help to better understand the effect of SVs on traits by association studies using these strains.
The National Human Genome Research Institute has awarded nearly $30 million for new sequencing and bioinformatics initiatives that aim to better represent the full range of human genetic diversity. An entirely new human reference genome — the “pangenome” — will be built from high-quality sequencing of 350 individuals from across the human population. Here at PacBio, we’re excited that highly accurate long-read WGS data from our Sequel II Systems will be an important component of this new program.
“It has grown more and more important to have a high-quality, highly usable human genome reference sequence that represents the diversity of human populations,” said Adam Felsenfeld, NHGRI program director in the Division of Genome Sciences, in a statement announcing the news.
The Human Pangenome Project will be carried out at a sequencing center and a reference center, each having extensive collaborations with scientists at many institutions. The University of California, Santa Cruz (UCSC) will form the sequencing center with teammates at the University of Washington, Washington University in St. Louis, and Rockefeller University. Their goal will be to use cutting-edge sequencing technologies, including SMRT Sequencing, to create higher-quality assemblies than have been produced in the past. A separate reference center will be created by Washington University in St. Louis, UCSC, and the European Bioinformatics Institute. Its aim will be to pull all 350 assemblies together into a useful representation for the genomics community.
“The human pangenome reference will be a key step forward for biomedical research and personalized medicine. Not only will we have 350 genomes representing human diversity, they will be vastly higher quality than previous genome sequences,” said UCSC’s David Haussler in a statement announcing these grants. “We are going to use all of the latest and best sequencing technologies and push their capabilities to get the most complete and accurate sequences possible.”
Evan Eichler, professor of genome science at the University of Washington and a Howard Hughes Medical Institute investigator, said in a statement: “We finally have the technology and methods to go after the parts of the human genome that were beyond our reach 20 years ago. It’s an exciting time for human genetics with implications for improved variant discovery associated with disease.”
For genome sequencing with PacBio technology, scientists will use the Sequel II System’s HiFi reads for highly accurate consensus sequencing of multi-kilobase inserts. In a recently posted preprint about a haploid human genome, scientists noted that HiFi mode successfully resolved segmental duplications (SDs), which are known for being difficult to assemble; in fact, authors stated that the HiFi assembly has “the highest fraction of resolved SDs for any of the published assemblies analyzed thus far.” They concluded, “Our results suggest that HiFi may currently be the most effective stand-alone technology for de novo assembly of human genomes.”
A hearty congratulations to all the scientists who will be involved in this new initiative! We look forward to working closely with these teams as they generate much-needed data for the biomedical research community.
It was the coolest critter Erin Bernberg (@ErinBernberg) had ever worked with – quite literally.
The senior scientist at the University of Delaware Sequencing and Genotyping Center, a PacBio certified service provider, received a shipment of tiny, live ice worms from Washington State University and immediately faced several challenges. How would she get them out of their ice cubes? How would she isolate DNA from the delicate, dark pigmented creatures? And would she be able to extract enough DNA to sequence?
Thanks to the new PacBio low DNA input protocol, the answer to the last question was yes. In fact, Bernberg was able to create a library with enough yield for up to 17 SMRT Cells 1M from only 500 ng of DNA.
She detailed her process for extracting ice worm DNA and shared some of the results during a recent webinar about the new low DNA input workflow.
“Low-input works,” Bernberg said. “You don’t need as much DNA as everyone is worrying about with PacBio. You can generate good data with it.”
Tiny body, giant genome
The ice worm may be tiny at ~1.5 cm, but its genome is giant compared to other annelids, around 1.5 Gb.
Previous attempts to sequence the genome failed because it was “inconveniently large” and the DNA came from pools of several specimens, which clouded the assembly with individual-specific noise, according to Scott Hotaling (@mtn_science), a postdoctoral researcher in the lab of Joanna Kelley at Washington State University who is studying the ice worm as part of his research into species that have adapted to the harsh conditions in extreme environments, like polar regions.
Hotaling said he is excited to delve into the 160 Gb of data generated by Bernberg’s 10 SMRT Cells. He said the genome they have assembled is around 1,000-fold more contiguous than that of its closest relative, the earthworm.
Why is contiguity important in the ice worm genome? There are so many unknowns when it comes to annelid genetics, Hotaling said. As an example, he cited AMP deaminase, a known regulator of energy metabolism that is suspected to play a part in the ice worm’s unique method of thermal regulation.
“Ice worms do this cool thing where they actually ramp up their energy levels as they get colder, which is essentially the inverse of what a lot of organisms do,” Hotaling said.
A partial sequence of AMP deaminase (540 amino acids long) had been previously generated for iceworms, so Hotaling was able to easily locate it in his new PacBio generated genome. In doing so, he discovered just how much information had been missing: “tons.”
“Imagine this across 20,000 genes. There’s so much more power to look from gene to gene at variation that might be evolutionarily relevant,” he said.
Evolution at the extremes
Mountain glacier habitats may look desolate, but there’s a lot going on below the surface. Not only are there millions of microbes, but also snow algal blooms and other lifeforms. Among them, ice worms (Mesenchytraeus solifugus) are the largest to spend their life cycle in ice.
The ice worms that Hotaling studies on Mt. Rainier survive extremely harsh conditions. Tens of millions of ice worms can live on a single glacier. Not only do they face constant cold stress, but the highly reflective ice and snow surfaces make the glaciers some of the most UV intensive places in the world.
It’s unclear how they have adopted to such an extreme lifestyle, but such knowledge could provide insight into evolutionary processes, Hotaling said. He wants to learn whether their extreme lifestyle translates to extreme physiology, and how. He will be exploring which genes and pathways have been under selection in ice worm genomes, and whether there is congruence between genes under selection and those differentially expressed in response to stress.
“We want to know things like can they freeze? It seems like a dumb question because they live in ice, but scientific observations from 30 years ago suggest they can only handle temperatures within a few degrees of freezing,” Hotaling said.
Early experiments suggest the worms are living at the absolute lower limit of their thermal tolerance, and they also exhibit a high tolerance for UV. His team is now fleshing out the picture by layering RNA-based annotations on top of the main sequencing data.
“Lineages like the ice worm are pretty far from anything else that’s been sequenced. So just looking at what’s in there is, in its own right, a pretty interesting pursuit. But we’d also like to add some context to look at genome evolution and gene expression,” he said.
He is also keen to taxonomically annotate the gene to capture contamination from other organisms that the ice worms may have hosted.
“We sequenced an entire worm – all of its gut contents, all of its parasites, whatever was on its body. So contamination is almost a certainty,” he said. “I think this is going to be a big challenge for anyone working in this space. How do we develop tools to really deal with this in a high-powered, efficient way?”
For further information on this topic and specifics of the low DNA input workflow, watch the full webinar and read the application note referenced in the presentation.
Great news from the rare disease community: the European research program SOLVE-RD has chosen SMRT Sequencing technology to help reveal the genetic mechanisms responsible for these tough-to-diagnose genetic diseases. As part of this work, scientists will sequence more than 500 whole human genomes with the PacBio Sequel II System to pinpoint disease-causing variants.
The SOLVE-RD research program, a consortium of more than 20 institutions funded with a five-year, €15 million award from the European Union’s Horizon 2020 initiative, aims to improve the diagnosis and treatment of rare diseases by applying novel tools to cases that were not solved with short-read exome sequencing.
In a press release issued today, Alexander Hoischen, Associate Professor for Genomic Technologies and Immuno-Genomics and a member of the SOLVE-RD team at Radboud University Medical Center, stated: “Even with exome sequencing, as many as 50% of rare disease cases remain unsolved. The SOLVE-RD team believes that long-read SMRT Sequencing will be essential for discovering the causal elements that have proven elusive with previous approaches, and we anticipate that this research will ultimately make it easier for doctors to diagnose other patients with these rare diseases in the future.”
The sequencing will be performed at Radboud University Medical Center. Marcel Nelen, Laboratory Specialist in Genome Diagnostics, commented in the press release: “Our team is eager to deploy PacBio’s Sequel II System to generate hundreds of high-quality human genomes for phenotypes very likely to be associated with challenging genomic regions or structural variants including repeat expansions. In our experience, SMRT Sequencing reliably detects far more structural variants — including pathogenic variants — than any other sequencing technology.”
Attendees of this week’s AGBT Precision Health meeting in La Jolla, Calif., can learn more about this project from Hoischen’s presentation on Saturday, Sept. 7th, at 10 am PDT.
Until recently, enriching for certain regions of the genome has been virtually impossible. Repeat expansions, extreme GC regions, and other genomic elements are very difficult to target using traditional enrichment methods. That’s why our new “No-Amp” targeted sequencing application — a streamlined, amplification-free approach based on the CRISPR/Cas9 system — is a valuable addition to the SMRT Sequencing toolbox.
The method was demonstrated in a recent PLoS One publication, and a new webinar delves into technical details of the protocol. Hosted by our own Paul Kotturi and Jenny Ekholm, the presentation offers an overview of uses for which the No-Amp method is beneficial, real-world examples of its results, and advantages it holds compared to traditionally used PCR and Southern blot techniques.
Kotturi kicked off the presentation with a look at the general advantages of SMRT Sequencing, including long reads, high accuracy, single-molecule resolution, simultaneous epigenetic detection, and uniform coverage. He also noted some recent performance metrics from the new Sequel II System: more than half of data is in reads >190 kb , and each SMRT Cell 8M generates up to 160 Gb of sequence data. With the HiFi sequencing mode that makes use of circular consensus sequencing, the system can achieve Q30 accuracy with just eight passes around a molecule.
Next, Ekholm stepped in to focus on the No-Amp application. Generating a sequencing library using the No-Amp method is relatively straightforward, the first step is to block the 5’ and 3’ ends of the genomic DNA, followed by the CRISPR/Cas9 digestion. To enrich for the region of interest guide RNAs are designed flanking each end of the targeted region, making them available for sequencing adapter ligation after the Cas9 digestion. The sequencing library is then cleaned up before sequencing. The No-Amp method takes two days (with less than four hours of hands-on time) and is compatible with both the Sequel System and the Sequel II System.
Users of the No-Amp method can multiplex target regions, samples, or both to maximize sequencing efficiency and minimize cost. Typical target insert sizes range from 4 kb to 6 kb, though scientists have successfully extracted even longer fragments with this process, Ekholm noted. The expected yield is hundreds of Q20 reads per target and the on-target rate for the No-Amp method is 40-60%, which translates to enrichment factors of 10,000-100,000 fold.
Later in the webinar, Kotturi discussed elements needed for this protocol: high-purity, high molecular weight DNA; 5-10 µg of DNA per SMRT Cell but only 1-2 µug / sample when multiplexing 5-10 samples / run; guide RNAs; barcoded adapters, if multiplexing samples; and a No-Amp accessory kit with primers and buffers. He also presented information about cost. In a five-sample multiplex workflow, the cost (U.S. list price) comes to $220 per sample. When multiplexing increases to 10 samples, the per-sample cost drops to $130 per sample. When multiplexing multiple targets per sample, these costs drop even further per locus. At PacBio, we routinely run 4 targets per sample.
If your research would benefit from capturing and sequencing regions that are otherwise intractable, this webinar is well worth your time. It also includes valuable information about data analysis and visualization, specific examples of targeting disease-associated repeat expansion regions, and much more.
Watch the complete webinar and visit www.pacb.com/noamp to learn more:
With her distinctive dark eyeshadow, grey lipstick-like markings and delicate disposition, she was a natural film star. And her life certainly provided enough drama for any Hollywood blockbuster, complete with high-speed boat chases in pursuit of black market “cocaine of the sea” cartels. Unfortunately, her ending was not a happy one. But efforts by an international consortium of conservation geneticists are making sure her legacy isn’t lost.
The DNA of one of the last remaining vaquita porpoises in the world has been preserved and decoded, as part of an ambitious project to create chromosomal-level genome assemblies of all extant vertebrates species on Earth — 70,000 in total.
Members of the Vertebrate Genomes Project (VGP), an international consortium of more than 150 scientists from 50 academic, industry and government institutions in 12 countries, and the Earth Biogenome Project (EBP) gathered in New York on August 27 to announce the completion of the first 100 genomes, including several species of critical conservation and scientific interest. The assemblies represent 77 orders sequenced to such completeness for the first time, which, along with 13 from the previous data set, add to a total 90 of the 260 orders the group is seeking to sequence as Phase 1 of the project.
The most endangered among them is the vaquita. It is estimated that less than 20 remain in the world. The female whose sample contributed to the VGP effort died shortly after a rescue attempt by the group Vaquita CPR. It is hoped that her legacy will live on in the information extracted from her DNA; it has already provided insight into breeding patterns of the Phocoena sinus, whose habitat is limited to a small area in the northern Gulf of California.
“We hope and trust that useful information will result that may benefit other endangered species of threatened porpoises. And we are saddened to think that one day, these tissue samples may be all that is left of this animal,” said Oliver Ryder, Ph.D., director of Conservation Genetics for San Diego Zoo Global, where the vaquita’s tissue was taken to be stored in its Frozen Zoo repository.
Her plight was featured in a documentary film produced by Hollywood star Leonardo DiCaprio, Sea of Shadows, which follows the efforts of Mexican police forces to crack down on the vaquita’s biggest threat: illegal poaching of the totoaba fish (Totoaba macdonaldi), whose swim bladders, or maws, fetch high prices in Chinese markets for their use in traditional medicine. The tiny vaquita – the smallest member of the cetacean order that also includes whales, dolphins and other porpoises – gets trapped in the gillnets used by poachers.
Another species championed by DiCaprio, the Bolson tortoise (Gopherus flavomarginatus), also made the VGP sequencing list, as did two other critically endangered species (European eel and Smalltooth sawfish); seven endangered species (Blue whale, Grey crowned-crane, Green sea turtle, Atlantic halibut, Ring-tailed lemur, Chimpanzee and Golden aronawa); and eight vulnerable species (Sterlet, Thorny skate, Siamese fighting fish, Abyssinian ground hornbill, Atlantic cod, European turtle dove, Marmoset monkey and Red-bellied piranha).
Origin of a Species
Beyond conservation, the genomes of some of the species may also shed light on fundamental processes of evolution. Often referred to as ‘living dinosaurs’, leatherback sea turtles (Dermochelys coriacea) are an ancient lineage that possess unique physiological adaptations, including those that allow them to survive in cold waters exploiting habitats far beyond many other ectotherms.
“Their populations having declined by greater than 90%, Pacific leatherbacks are one of eight species among the most at risk of extinction in the near future protected by the United States NOAA under the Endangered Species Act,” said Lisa M. Komoroske, Assistant Professor of Conservation Genomics & Ecophysiology at the University of Massachusetts, Amherst.
She will be using the VGP genome to study the remaining genetic diversity in the species and to inform new leatherback conservation initiatives, including translocation across oceans to enable genetic mixing with other populations to avoid excessive inbreeding.
Weird and Wondrous
Among the species studied are a few truly unique – and strange – creatures.
The Great Potoo (aka Nyctibius grandis) may have a silly name and equally cartoonish look, with huge eyes and a ginormous mouth, but they’re no joke. They are generally heard rather than seen. During the day, they remain motionless in mimic of broken tree branches. At night, the nocturnal creatures make unsettling sounds that haunt the Neotropics, with mouths open wide to catch passing insects, occasionally moving to pounce on other prey in quick sallies.
Ubiquitous but Useful
Other genomes, like the chicken, may seem mundane, but could prove vital to agriculture and biomedical research.
The most commonly studied avian genome in these areas, the chicken genome is getting an important upgrade as part of the project. It is one of 12 genomes that reflects the DNA of both parents. Using a process called “trio binning”, the DNA of the parents are used to separate the DNA sequences of the child chromosomes to assemble two genomes (one each from mother and father) from one individual. Based on an assembly approach developed by Sergey Koren and Arang Rhie of the Adam Phillippy Lab at the National Human Genome Research Institute, these trio-based assemblies are 40-60% better than the non-trio based assemblies at separating out parentally-inherited DNA.
Approximately $600 million will be needed to complete the VGP project. Crowdsourcing among scientists has so far raised $4.8 million of the $6 million needed for Phase 1.
Get more information about the first 100 species.
An ambitious project to sequence 5,000 microbial genomes was jointly initiated by a consortium of 10 institutions across China, including Nankai University, China CDC, Academy of Military Medical Science, Third Institute of Oceanography-Ministry of Natural Resources, South China Sea Institute of Oceanology-CAS, China National Center for Food Safety Risk Assessment, Shandong University, Tianjin University of Science & Technology, East China University of Science and Technology, and Tianjin Biochip Corporation (TBC).
TBC, a PacBio service provider in China, has led the sequencing phase of the project, which is expected to be completed by the end of 2019. We recently sat down with Sun Yamin, general manager of TBC, to learn more about the project.
What’s the difference between the Prokaryotes 5,000 Complete Genomes Project (P5KCGP) and other microbial sequencing projects?
Previous microbial genome projects were scattered and typically based on one researcher’s own interests and directions. As a result, many common microbial species’ genomes have been sequenced repeatedly, while less commonly studied microbial species have still not been sequenced at all.
The current microbial genome database has an obvious species imbalance. Many microbial genomes have only low-quality genomic scaffolds. Our goal is to create a genomic database that covers a much broader array of microbial diversity, including pathogenic microorganisms, food safety microbes, marine microbes, and terrestrial resource microbes.
We expected to add at least 500 new microbial genomes that are currently not found in the NCBI database by the completion of the project. Our goal is to submit a high-quality, closed genome with no gaps for each of the 5,000 microbial genomes included in our project. In order to achieve this goal, we chose the PacBio Sequel System as our sequencing platform, as SMRT Sequencing technology combines long read lengths, high accuracy, and no GC content bias.
At present, only the Sequel System can meet our project requirements, given the challenges presented by many bacterial genomes. Using the latest version 3.0 reagents, the average read length of 22 kb on the Sequel System is sufficient to span repeats that can be more than a dozen kilobases in length in some bacterial genomes. In addition, we have seen GC content up to 70% in microbial samples we’ve sequenced. Even so, assembly can be accomplished easily with PacBio data.
What is the significance of the P5KCGP project?
While microorganisms were the first genomes to be sequenced by scientists, the sum of all microbial sequencing data worldwide is less than the amount of data produced by a laboratory that performs human genome sequencing. Although the genomes of microorganisms are relatively small, the enormous species and functional diversity of microorganisms in nature means that microbial genomics has not been given sufficient attention. For pathogenic and foodborne microorganisms in particular, it is important to have reference-quality genomes.
What challenges has the P5KCGP project encountered?
1) Sample collection. On average, each partner needs to provide 400-500 microorganisms. Since our goal was to include bacterial species that are rare in nature, it can take a long time to isolate and grow samples.
2) Controlling costs. Generating closed microbial genomes requires more resources than simply coming up with a bunch of draft genomes. To manage sequencing costs, we have succeeded in multiplexing 16 microbial samples on each SMRT Cell 1M by optimizing the library preparation process.
3) Dealing with difficult-to-sequence microbes. The habitat of microorganisms in nature is diverse, and some live in extreme environments requiring quite high GC content in their genomes. Sequencing of such microbial genomes is more difficult.
What groundwork does this project lay for future research efforts?
We want to better understand how microbes that are widely distributed in nature have evolved and adapted to diverse environments with the much more complete survey of microbial genomes made available through this sequencing project. In addition, some rare microorganisms living in extreme environments often have potential industrial value. Two examples sequenced through this project are the extremely acidophilic methanotroph isolate V4, Methylacidiphilum infernorum, and the Geobacillus thermodenitrificans.
Patients with myotonic dystrophy type 1 (DM1) want to know their size — the size of the expansion of repeats of the unstable CTG sequences that cause the progressive deterioration of neuromuscular functions that they might face.
Size matters to them, because it has been found to correlate with the severity and onset of symptoms, which can range from severe cardiac and respiratory abnormalities and intellectual impairment in children, to muscle weakness, hypersomnolence or cataracts in adults. The earlier the onset, the more severe the symptoms tend to be. The autosomal disorder, which is the most common form of inherited muscular dystrophy in adults, also tends to get progressively worse with each generation. But the manifestations vary widely between patients, and even within families, making it extremely difficult to predict how it will affect any individual.
Stéphanie Tomé would like to arm genetic counselors with more information to help patients navigate through their difficult diagnoses and prognoses, and to inform their decisions about their own lives and those of their offspring. Ultimately, she would also like to be able to provide them with new options to manage or even alter their diseases.
To do so, she needs to be able to read the repeats, which can be encoded in sections as large as 3,000 triplets. So she has turned to PacBio SMRT Sequencing, which is capable of capturing sequences of long stretches of DNA, including complete regions of repeats found in patients with DM1 and other expansion disorders, such as Huntington’s Disease and Fragile X.
Tomé, an investigator at the Centre de Recherche en Myologie at Sorbonne Université/INSERM in Paris, is the winner of the 2019 Targeted Sequencing SMRT Grant. Along with the 10 other scientists in her research group, led by Geneviève Gourdon, and collaborators from around the world, Tomé will sequence sections of mutated genes in DM1 patients to determine the exact size and pattern of CTG repeats.
“We have some idea of what may be going on at either end of these regions, but we don’t have any information about what is happening in the middle,” Tomé said. “Improving our knowledge of the entire repeat sequence will help us make clearer correlations between the genetic instability and the clinical manifestations of DM1.”
Information generated in the project could also help researchers advance their understanding of some of the mechanisms behind the degenerative disorder.
If the disorder is characterized by long lengths of trinucleotides gone haywire, then it would be advantageous to be able to shrink the repeat regions back down to an asymptomatic size. Researchers have found cases where the regions have naturally contracted, and others where there are interruptions in the repeat codes.
Tomé and colleagues are pursuing this avenue of research, hoping to be able to harness knowledge about contractions and/or interruptions to induce them as a way to prevent and/or treat DM1 and other disorders. Tomé said drug screens on mouse models have already identified some potential compounds that could induce contraction, but they need to be tested and modified for use in humans.
Putting it in Perspective
Tomé admits that the data gathered from this project will likely not lead to immediate solutions, but it could provide some immediate relief to patients hungry for more insight into their disorder. And she hopes that SMRT Sequencing could become an alternative method of molecular diagnostics to ameliorate the prognosis and counseling offered to patients.
“Currently, the clinical labs tend to use Triplet Prime PCR. With this technique, we can say whether a patient is going to become sick or not sick, but it’s difficult to provide any sort of prognosis,” Tomé said. “To be able to give the patient more precise information, quickly, is very important, I think. Many patients are anxious, and don’t understand why there is so much variability between their son and daughter. They want to know.”
By collaborating with clinicians and a multidisciplinary group of 10 teams at Centre de Recherche en Myologie, Tomé embraces any opportunity to get different perspectives on the disorder, including the patient perspective.
“It’s very interesting to talk to the patients. By staying in the lab, you can lose sight of the bigger picture. By leaving the lab, you get new ideas, you learn more about what the problems are and what you might be able to do to improve the lives of patients,” Tomé said.
As the behavior of repeat regions appears similar between triplet diseases, Tomé said the project’s findings might also be applicable to 13 other expanded repeat disorders.
“This widens the potential impact of our study considerably,” she said.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win free sequencing. Thank you to our co-sponsor and Certified Service Provider, the McDonnell Genome Institute at Washington University in St. Louis, for supporting the 2019 Targeted Sequencing SMRT Grant Program.
The annual meeting of the European Society of Human Genetics — held last month in the sleek Swedish Exhibition & Congress Center in Gothenburg, Sweden — was a terrific assembly of thousands of scientists who are together pushing the boundaries of what’s possible in genome research. The PacBio team particularly enjoyed seeing so many impressive ESHG presentations with scientific results from SMRT Sequencing pipelines featuring applications such as de novo whole genome sequencing, structural variant detection, the Iso-Seq method, and targeted sequencing.
For example, in a plenary talk, the University of Southern California’s Mark Chaisson (@mjpchaisson) spoke about using long-read PacBio sequencing to analyze structural variation across human genomes. Representing the Human Genome Structural Variation Consortium, he talked about the growing number of available de novo sequenced human genomes, along with the need to characterize their complete universe of structural variants, many of which are missed in short-read assemblies.
Chaisson presented results from trio sequencing projects run by the consortium, showing that this approach allows for reliable and accurate phasing even of large structural variants, thanks to the use of long-read data. He noted that with the PacBio Sequel II System, it is now feasible to fully sequence a human genome in a single run. Chaisson concluded, “We are now in a realm where large scale human genome sequencing studies can be done using a long-read approach.”
Other great presentations came from Jozef Gecz at the University of Adelaide, who spoke about a repeat expansion associated with a heritable form of epilepsy, and Michael Talkowski of the Broad Institute, who presented on structural variant discovery and the use of sequencing systems for genomic medicine. There were also several posters from PacBio users with exciting results, such as a Swedish reference genome, clinical sequencing of the SMA gene, and amplification-free sequencing of a repeat expansion that causes corneal dystrophy.
Our own team presented posters at ESHG as well. Billy Rowell shared “Comprehensive Variant Detection in a Human Genome with Highly Accurate Long Reads” while Jenny Ekholm presented “Sequencing the Previously Unsequenceable Using Amplification-free Targeted Enrichment Powered by CRISPR/Cas9.”
We’d like to thank all of the scientists who checked out our posters or stopped by the PacBio booth to learn more about SMRT Sequencing applications and the new Sequel II System. We appreciate your time and interest!