Novel Workflow Produces Fully Phased Human Genome Assemblies Without Trio Sequencing
Friday, January 3, 2020
A new preprint from lead authors David Porubsky and Peter Ebert, senior authors Evan Eichler and Tobias Marschall (@tobiasmarschal), and collaborators reports a method for generating fully phased, de novo human genome assemblies without parental data. The approach combines PacBio HiFi reads (>99% accuracy, 10-20 kb) with the short-read, single-cell Strand-seq technique. The authors provide a proof-of-principle through assembling the genome of a Puerto Rican female from the 1000 Genomes Project.
The work extends a recent publication from many of the same authors in which HiFi reads were used to produce an accurate and contiguous assembly of the human haploid genome, CHM13. To help assemble a phased diploid genome, the newer work adds Strand-seq, “a single-cell sequencing method able to preserve structural contiguity of individual homologs in every single cell.” The authors used Strand-seq to group HiFi reads by chromosome, order and orient contigs, and phase variants over long genomic distances. “Taken together, these features make Strand-seq the method of choice to be combined with high-accuracy long-read sequencing platforms to physically phase and assemble diploid genomes.”
The team generated 33.4-fold HiFi read coverage of the selected sample using the Sequel II System. They called single nucleotide variants in the HiFi reads with DeepVariant and phased variants using Strand-seq and HiFi reads. That “resulted in chromosome-length haplotypes with >95% … of all these heterozygous variants placed into a single haplotype block,” the scientists report. “With such global and complete haplotypes we assigned ~81% of the original PacBio HiFi reads to either parental haplotype 1 (H1) or haplotype 2 (H2).”
The team then used two tools, Canu and Peregrine, to assemble the haplotype-separated reads. A small number of chimeric contigs were corrected with Strand-seq data and the SaaRclust algorithm. The final contig N50s of the fully phased assemblies were 25.8 Mb and 28.9 for each haplotype. Assemblies were found to be highly accurate, with basepair quality scores higher than QV40; nearly all gene-disrupting indels in the sequence were found to be true biological events, not assembly artifacts. By titrating HiFi read coverage, the authors found that around 15-fold coverage of each haplotype is sufficient to produce an accurate, contiguous assembly.
“Our assembly strategies allow us to transition from ‘collapsed’ human assemblies of ~3 Gbp to fully phased assemblies of ~6 Gbp where all genetic variants, including [structural variants], are fully phased at the haplotype level,” the scientists report. In addition to the importance of using this method for assembling individual genomes, the authors note, “Fully phased, reference-free genomes are also the first step in constructing comprehensive human pangenome references that aim to reflect the full range of human genome variation.”