GIAB Expands Variant Call Sets with SMRT Sequencing Results
Wednesday, March 6, 2019
You may have missed last week’s Advances in Genome Biology & Technology conference in sunny Marco Island, Fla., but you definitely shouldn’t miss the two posters presented there by Justin Zook and Justin Wagner from NIST’s Genome in a Bottle (GIAB) consortium.
The GIAB team has made critical progress in generating high-quality human genome reference materials and benchmarks that have helped to improve the accuracy and reproducibility of variant calling across laboratories. The latest results advance that work with an expansion of the benchmark set to include additional small (single-nucleotide variant and indel) variants and — for the first time — large (structural) variants.
The poster on structural variants (“A new benchmark for human germline structural variant calls”) describes a benchmark set of 11,869 insertions and deletions ≥50 bp and corresponding benchmark regions that span 2.69 Gb (89%) of the human genome. The set is derived from multiple technologies and is validated by manual curation and consistent inheritance in a mother-father-son trio. With tools like Truvari, the benchmark set provides a direct measure of false positives and false negatives in individual variant callsets. This will enable the improvement of structural variant calling software, just as the small variant benchmark did for single-nucleotide variants and indels.
In a second poster (“Expanding the Genome in a Bottle benchmark callsets with high-confidence small variant calls from long and linked read sequencing technologies”), the GIAB team discusses expanding and improving their small variant benchmark set by integrating linked- and long-read technologies, including PacBio circular consensus sequencing (CCS) reads. CCS reads — described in a recent preprint — have similar base accuracy to typical NGS reads but are much longer, and thus map unambiguously to repetitive or low-complexity regions of the genome that are not accessible with short NGS reads.
Integrating linked reads and PacBio CCS reads expands the region over which GIAB can confidently call variants by more than 84 Mb (>3%), and detects an additional 156,000 variants (>4%), “mostly in regions difficult to map with short reads,” the authors report. In a list of medically relevant genes, this new benchmark adds 418 more variants, an increase of about 5%.
For more information about GIAB, check out the public workshop being held March 28-29 at Stanford University. As indicated on both posters, the GIAB team welcomes new collaborators interested in the accurate and complete characterization of human genomes.