April 21, 2020

A robust benchmark for germline structural variant detection

Authors: Zook, Justin M. and Hansen, Nancy F. and Olson, Nathan D. and Chapman, Lesley M. and Mullikin, James C. and Xiao, Chunlin and Sherry, Stephen and Koren, Sergey and Phillippy, Adam M. and Boutros, Paul C. and Sahraeian, Sayed Mohammad E. and Huang, Vincent and Rouette, Alexandre and Alexander, Noah and Mason, Christopher E. and Hajirasouliha, Iman and Ricketts, Camir and Lee, Joyce and Tearle, Rick and Fiddes, Ian T. and Barrio, Alvaro Martinez and Wala, Jeremiah and Carroll, Andrew and Ghaffari, Noushin and Rodriguez, Oscar L. and Bashir, Ali and Jackman, Shaun and Farrell, John J and Wenger, Aaron M and Alkan, Can and Soylev, Arda and Schatz, Michael C. and Garg, Shilpa and Church, George and Marschall, Tobias and Chen, Ken and Fan, Xian and English, Adam C. and Rosenfeld, Jeffrey A. and Zhou, Weichen and Mills, Ryan E. and Sage, Jay M. and Davis, Jennifer R. and Kaiser, Michael D. and Oliver, John S. and Catalano, Anthony P. and Chaisson, Mark JP and Spies, Noah and Sedlazeck, Fritz J. and Salit, Marc and ,

New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. Translating these methods to routine research and clinical practice requires robust benchmark sets. We developed the first benchmark set for identification of both false negative and false positive germline SVs, which complements recent efforts emphasizing increasingly comprehensive characterization of SVs. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods, both alignment- and de novo assembly-based, from short-, linked-, and long-read sequencing, as well as optical and electronic mapping. The final benchmark set contains 12745 isolated, sequence-resolved insertion and deletion calls =50 base pairs (bp) discovered by at least 2 technologies or 5 callsets, genotyped as heterozygous or homozygous variants by long reads. The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.66 Gbp and 9641 SVs supported by at least one diploid assembly. Support for SVs was assessed using svviz with short-, linked-, and long-read sequence data. In general, there was strong support from multiple technologies for the benchmark SVs, with 90 % of the Tier 1 SVs having support in reads from more than one technology. The Mendelian genotype error rate was 0.3 %, and genotype concordance with manual curation was >98.7 %. We demonstrate the utility of the benchmark set by showing it reliably identifies both false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping.

Journal: BioRxiv
DOI: 10.1101/664623
Year: 2019

Read publication

Antimicrobial resistance research

Support

A robust benchmark for germline structural variant detection

Talk with an expert