Hundreds of SMRT scientists came together recently in Leiden to learn about the latest updates to PacBio technology and to showcase their data analysis tools. Extremely useful information was shared, and future collaborations were sparked. For those who weren’t able to jet to the Netherlands to attend, we’ve rounded up the top tools and tips presented at the European SMRT Informatics Developers Meeting. For an in-depth report on the event, check out this blog post by PacBio Principal Scientist Elizabeth Tseng.
- SMRT Link – Of course our own open-source SMRT analysis software suite will be top of the list. Updates to the system have resulted in many improvements, including 8x faster time-to-results for CCS generation and 20x faster mapping with minimap2 using our own wrapper pbmm2; important improvements to CCS to support PacBio’s HiFi data type; detection of more types of structural variants; increased automation; and PDF reports.
- Bioconda – Want to be the first to try out new and improved analysis tools? Many updates to PacBio algorithms, assembly packages, and other tools are available on Bioconda before their official release, including the latest Sequel II System changes.
- pbsv – Our structural variant (SV) calling and analysis tool has also been updated. What’s new? An increase in sensitivity for large insertions and deletions, and calling of duplications and copy number variation… meaning that pbsv now calls all major SV types 20 bp and longer.
- DAZZLER Suite – Need to find all significant local alignments between reads? Or to remove chimeras, adaptamers, and low-quality dropouts? Da’ Gene Myers (@TheGeneMyers) has an app for that. Or several, actually, including DALIGNER, DASCRUBBER and DAMASKER. Myers announced he has updated the suite to better support highly accurate, long HiFi reads.
- PRINCESS – Prolific toolmaker Fritz Sedlazeck (@sedlazeck), creator of the SV caller Sniffles, unveiled his work-in-progress, PRINCESS, a Snakemake pipeline to call and phase SNPs and SVs. Keep an eye on his Github site to snag it when it drops.
- TAMA – The all-in-one Transcriptome Annotation by Modular Algorithms tool by Iso-Seq expert Richard Kuo (@GenomeRik) can do many things, including: mapping RNA reads to transcript annotation, merging annotations (can combine PacBio with references like ENSEMBL), identifying coding regions and associating them with known genes.
- SQANTI – This quality control pipeline by Ana Conesa (@anaconesa) can categorize Iso-Seq data against a reference annotation. It allows users to see which genes/transcripts are novel/known and offers detailed annotations on canonical/non-canonical junctions. A modified version of SQANTI is SQANTI2 by Elizabeth Tseng (@magdoll).
- TAPPAS – A Java-based application, also by Ana Conesa, that creates beautiful visualizations utilizing information at both the transcript and protein level. It can identify differential expression at both the isoform level and the gene level.
- pyPaSWAS – Program for DNA/RNA/protein sequence alignment, read mapping and trimming, by Sven Warris (@swarris).
- WhatsHap – Software from Tobias Marschall (@tobiasmarschal) for read-backed phasing of variants. Jana Ebler discussed an extension to WhatsHap to simultaneously call and phase variants in long reads.
BONUS:
Variant callers are not all the same – in fact, there are times when their algorithms don’t agree. So, what do you do? Ryan E. Mills (@ryan_e_mills), an assistant professor at the University of Michigan, laid out the problem — and two of his solutions — in a presentation at the Labroots Genetics and Genomics conference:
- VaPoR – A structural variant validator that uses a dotplot of PacBio reads against the reference genome to visualize and automatically score candidates for patterns that suggest deletions, insertions, tandem duplication or inversions.
- PALMER – The Pre-mAsking Long reads for Mobile Element InseRtion tool detects non-reference MEI events (LINE, Alu and SVA) and other insertions, by using the indexed reference-aligned BAM files from long-read technology as inputs. It uses the track from RepeatMasker to mask the portions of reads that aligned to these repeats, defines the significant characteristics of MEIs (TSD motifs, 5′ inverted sequence, 3′ transduction sequence, polyA-tail), and reports sequences for each insertion event.