Data Release: Highest-Quality, Most Contiguous Individual Human Genome Assembly to Date
Monday, October 8, 2018
We’re proud to announce the release of the most contiguous diploid human genome assembly of a single individual to date, representing the nearly complete DNA sequence from all 46 chromosomes inherited from both parents. The sample used was derived from a Puerto Rican female who has been included in population genetics studies such as the 1000 Genomes Project. The phased diploid assembly will give unprecedented views of population-specific variation through the long-range resolution of maternal and paternal haplotypes.
This work is part of a larger effort in the field of personalized medicine and human genomics to add ethnic diversity to the available human reference genomes. More than 40 global initiatives are currently underway to apply de novo assembly methods to individuals representing multiple ethnic populations. Notable among these initiatives is the McDonnell Genome Institute at Washington University, which has contributed 11 high-quality PacBio genomes for individuals representing populations from Africa, Asia, Europe, and the Americas.
Our approach to the Puerto Rican genome relied upon the current best practices for de novo assembly while also pushing read lengths ever longer and adding new methods and data types to better tackle the problem of diploid genome assembly.
The Puerto Rican sample was sequenced on the Sequel System with 2.1 chemistry and v5.1 software using a large insert library aggressively size selected to 35 kb. The resulting contig assembly totaling 2.89 Gb has the highest contiguity to date, with half of the genome contained in gapless contigs longer than 27 Mb. These results are even better than the consistently stellar assemblies MGI has been producing, which typically have contig N50s of 20-25 Mb.
Like the MGI genomes, the new Puerto Rican genome was assembled using FALCON, but with a newer version of FALCON-Unzip that includes algorithmic improvements to phasing and accuracy. Nearly 85% of the genome was resolved as maternal and paternal haplotypes, with more than 600 Mb of sequence in haplotype blocks longer than 1 Mb. An analysis of variants within phase blocks indicates high accuracy with 95% of SNPs showing concordant inheritance from a single parent.
In addition to the improvements in PacBio’s FALCON-Unzip assembler, the Puerto Rican assembly includes the novel use of Hi-C data to extend phasing between haplotype blocks. In collaboration with Phase Genomics, PacBio developed a new method for enhanced phasing that does not rely on family trio data. The new method, called FALCON-Phase, maps ultra-long range Hi-C reads to the FALCON-Unzip contigs to extend phasing to the contig scale. The Hi-C data was also used to scaffold the phased contigs before performing another round of phasing on the scaffolds.
The resulting assembly consists of 46 chromosome-scale scaffolds, representing the maternal and paternal chromosome set for the Puerto Rican individual. Each set of 23 scaffolds contain only 511 gaps and are a total of 2.83 Gb long. The remainder of each haploid genome is contained in 260 scaffolds of 63 Mb in length.
Genome: https://www.ncbi.nlm.nih.gov/genome/?term=RBJD00000000 (currently not live)