Diploid Assembly of Korean Genome Reveals Population-Specific Variation and Novel Sequence
Wednesday, October 5, 2016
In a paper published today in Nature, scientists from Seoul National University, Macrogen, and other institutions present the de novo genome assembly for a Korean individual. The effort used SMRT Sequencing and other technologies to generate the assembly, fully phase all chromosomes, and perform detailed analyses of structural variation and other elements. In the process, the team generated novel sequence data that helps fill gaps in the human reference genome and continues the trend of developing important new population-specific resources.
The work, reported in “De novo assembly and phasing of a Korean human genome,” was contributed by lead authors Jeong-Sun Seo, Arang Rhie, Junsoo Kim, and Sangjin Lee, senior author Changhoon Kim, and collaborators. The authors note that standard NGS approaches could not have accomplished the high-quality genomic resource they required. “Simple alignment of short reads to a reference genome cannot be used to investigate the full range of structural variation and phased diploid architecture, which are important for precision medicine,” they write. “By contrast, the single-molecule real-time (SMRT) sequencing platform produces long reads that can resolve repetitive structures effectively.”
For this effort, the scientists performed genome sequencing with PacBio technology and then integrated data from orthogonal platforms such as BioNano Genomics. SMRT Sequencing alone produced a highly accurate de novo assembly with 3,128 contigs and a contig N50 length of nearly 18 Mb. Combined with BioNano data and polished with Illumina sequence, the final assembly “is characterized by marked contiguity that has not been achieved by non-reference assemblies of the human diploid genome so far, and improves on the previous best N50 length by 18 Mb,” the scientists note. Ninety percent of the genome is covered in the largest 91 scaffolds.
That assembly was compared to the human reference, GRCh38, where it closed 105 of 190 remaining euchromatic gaps and extended into 72 more, adding about 1 Mb of novel sequence. “These locations, previously intractable using only short reads, commonly contained simple tandem repeats,” the authors report.
The contiguity of the assembly allowed scientists to delve deeply into structural variation, identifying more than 18,000 variants—nearly 12,000 of which had never been reported before. “Of the new SVs, 86% were highly enriched for clusters of mobile and tandem repeats,” the team writes. A look at insertions found that almost half had significant variability in frequency across populations, while nearly 10 percent of them were specific to people of Asian descent. (This follows the pattern seen with other population-specific assemblies, such as the recently published Chinese genome.)
Finally, the scientists constructed separate assemblies for each haplotype to more accurately represent the diploid genome. To assess the results, they examined the HLA complex, finding that phasing had been successful despite a large amount of structural variation. “Our approach also allowed a clinically important duplication of CYP2D6 to be detected and assigned to one phase,” the scientists report. “This result demonstrates that de novo assembly-based phasing has advantages in resolving challenging hypervariable regions, and could be used further for pharmacogenomics.”
The scientists note that this work produced “the most contiguous diploid human genome assembly so far,” supporting the idea that integrating technologies leads to optimal results for detecting structural variants and other elements that have been impossible to resolve with short reads. They also remind the community that many more population-specific resources will be important for realizing the potential of genomics. “Our findings demonstrate the important genomic differences of Asian ancestral group from the others, and highlight the need for further genomic studies focused on individuals outside of European ancestry to describe the full range of functionally important variations in humans,” they write.