Haplotype-resolved diverse human genomes and integrated analysis of structural variation

Authors: Ebert, Peter and Audano, Peter A. and Zhu, Qihui and Rodriguez-Martin, Bernardo and Porubsky, David and Bonder, Marc Jan and Sulovari, Arvis and Ebler, Jana and Zhou, Weichen and Serra Mari, Rebecca and Yilmaz, Feyza and Zhao, Xuefang and Hsieh, PingHsun and Lee, Joyce and Kumar, Sushant and Lin, Jiadong and Rausch, Tobias and Chen, Yu and Ren, Jingwen and Santamarina, Martin and H{\"o}ps, Wolfram and Ashraf, Hufsah and Chuang, Nelson T. and Yang, Xiaofei and Munson, Katherine M. and Lewis, Alexandra P. and Fairley, Susan and Tallon, Luke J. and Clarke, Wayne E. and Basile, Anna O. and Byrska-Bishop, Marta and Corvelo, Andr{\'e} and Evani, Uday S. and Lu, Tsung-Yu and Chaisson, Mark J.P. and Chen, Junjie and Li, Chong and Brand, Harrison and Wenger, Aaron M. and Ghareghani, Maryam and Harvey, William T. and Raeder, Benjamin and Hasenfeld, Patrick and Regier, Allison A. and Abel, Haley J. and Hall, Ira M. and Flicek, Paul and Stegle, Oliver and Gerstein, Mark B. and Tubio, Jose M.C. and Mu, Zepeng and Li, Yang I. and Shi, Xinghua and Hastie, Alex R. and Ye, Kai and Chong, Zechen and Sanders, Ashley D. and Zody, Michael C. and Talkowski, Michael E. and Mills, Ryan E. and Devine, Scott E. and Lee, Charles and Korbel, Jan O. and Marschall, Tobias and Eichler, Evan E.

Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent{\textendash}child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average contig N50: 26 Mbp) integrate all forms of genetic variation even across complex loci. We identify 107,590 structural variants (SVs), of which 68\% are not discovered by short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterize 130 of the most active mobile element source elements and find that 63\% of all SVs arise by homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1,526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population.

Journal: Science
DOI: 10.1126/science.abf7117
Year: 2021

