At Baltimore UGM, New Tools and Research Breakthroughs with SMRT Sequencing
Thursday, August 3, 2017
We were delighted to be back at the University of Maryland this summer for our annual East Coast User Group Meeting. The day-long event, preceded by half-day workshops on sample prep and bioinformatics, exceeded our expectations. From the packed session hall to the terrific science and great discussions, the UGM facilitated the exchange of best practices and new suggestions for optimizing SMRT Sequencing performance for a variety of applications. Below is a recap of the day’s highlights, with several of the presentations available to download.
PacBio scientist Aaron Wenger presented the Structural Variant Calling application that is included in the SMRT Link v5.0 software release. The application utilizes the read aligner NGM-LR, and features both a command-line tool called pbsv and a web interface. Noting that most of the genetic difference between any two people lies in structural variation, he showed that short-read sequencers cannot detect the vast majority of these important variants. Wenger demonstrated that even low-coverage SMRT Sequencing can be used to discover structural variants; in an experiment, 10-fold coverage revealed almost 100% of homozygous variants and nearly 90% of heterozygous variants in a human individual.
Michael Schatz from Johns Hopkins University gave a talk entitled “In Pursuit of Perfect Genome Sequencing” in which he walked through three key metrics for evaluating genome quality: correctness (basepair accuracy), completeness (no gaps in the sequence), and contiguity (sequence ordered as on the physical chromosomes). Schatz compared the leading sequencing technologies available today, and explained that PacBio SMRT Sequencing is the most capable technology for all three metrics.
Continuing the human genome theme, Ricardo Mouro Pinto from Massachusetts General Hospital spoke about using SMRT Sequencing to quantify CAG repeat instability in Huntington’s disease. Caused by a CAG repeat expansion, Huntington’s occurs when a person’s genome harbors 40 or more copies. Pinto noted that typically, the longer the repeat, the younger the person is at disease onset. The Huntington’s locus is difficult to enrich because it is resistant to PCR amplification. By using Cas9 digestion to perform non-amplification-based target enrichment followed by PacBio sequencing, Pinto’s team was able to capture wild type and disease alleles with no amplification bias. He noted that results are preliminary, and he hopes to expand the number of samples studied to get a better handle on CAG instability.
Representing the plant community, Hamid Ashrafi and Hamed Bostan from North Carolina State University tag-teamed a presentation on the blueberry genome and transcriptome. The fruit plant naturally occurs in diploid, tetraploid, and hexaploid genomes. The scientists generated a high-quality diploid assembly using SMRT Sequencing and noted that long reads were essential to get through the highly repetitive genome. Next, they used Iso-Seq to study several types of tissue from diploid, tetraploid, and hexaploid blueberry plants, finding many transcripts missed by short-read sequence data. Using both genome and transcriptome approaches was particularly important, Ashrafi noted, because SNPs explain only a small portion of natural variation for this plant, and he believes that alternative splicing and structural variants likely contribute a much larger proportion of variation. The team is still analyzing results but said that switching to long reads was “a dream come true.”
On the microbial front, Jethro Johnson from The Jackson Laboratory for Genomic Medicine gave a talk on full-length 16S rRNA sequencing, which is useful for taxa identification. By genotyping or using short-read data, Johnson said, so much of the information in the variable regions of 16S is missed that it often is impossible to accurately classify organisms. So, Johnson turned to SMRT Sequencing and circular consensus sequencing (CCS), which generates highly accurate long reads. Johnson applied CCS for a mock bacterial community of 36 species and found that SMRT Sequencing offered accurate results for identification. In studies of fecal samples, PacBio sequencing was able to provide a unique identification in cases where short-read sequencing generated ambiguous results. The team is now expanding SMRT Sequencing results to include internal transcribed spacer regions.
In a separate presentation, Phillip Tai from the University of Massachusetts Medical School highlighted the use of long-read sequencing for genome population sequencing of adeno-associated viruses. These harmless viruses have gained new interest recently as a vector for gene therapies, so Tai’s lab is interested in analyzing large groups of them to filter out any that would not be ideal vectors. By applying SMRT Sequencing to recombinant AAVs, they generate complete resolution of the vector genome, including the difficult-to-sequence inverted terminal repeats. This accomplishment could have tremendous value in the gene therapy field, he said.
The meeting also included some new tools and protocols from the community. New England Biolabs’ Bo Yan presented SMRT-cappable-seq, a method for characterizing operons across an entire bacterial genome. It involves capping the 5’ end of bacterial primary transcripts and using SMRT Sequencing to produce full-length transcripts. Yan said the protocol increases library prep efficiency and accurately defines and links the transcription start site and transcription termination site (something short reads cannot do). A validation project in E. coli revealed 840 novel operons, extending 40% of annotated operons in RegulonDB. In another talk, Manuel Tardaguila from the University of Florida discussed SQANTI, a new tool to perform quality control for long-read transcripts. The pipeline performs classification, curation, and quantification of transcripts to filter out any artifacts and ensure that scientists analyze only the highest-quality results. SQANTI incorporates PacBio data, a reference genome, and other resources to conduct its rigorous evaluation.
We’d like to thank our hosts for the meeting, the Genomics Resource Center, Institute for Genome Sciences at the University of Maryland, as well as our partners: Advanced Analytical Technologies, Diagenode, and Sage Science. And, of course, thanks to all the scientists who took time out of their busy schedules to make this event a success!