As part of our effort to support the National Institutes of Health and the Genome Reference Consortium (GRC) in creating platinum genomes for the research community and improving the reference genome, in 2014 we generated 54X SMRT® Sequencing coverage of the CHM1 cell line, derived from a human haploid hydatidiform mole, using our P5-C3 chemistry, and made it publicly available through the SRA database at NCBI.
The CHM1 dataset was quickly taken up by researchers eager to use long, unbiased reads to identify regions of the genome prone to structural variation and to fill in sequence gaps in the GRC-maintained human genome reference. Mark Chaisson and Evan Eichler used PacBio® CHM1 data to resolve 26,079 euchromatic structural variants at the base-pair level, 85% of which were novel. Furthermore, they were able to close or extend 55% of the remaining gaps in GRCh37 [Chaisson et.al. (2015) Resolving the complexity of the human genome using single molecule sequencing. Nature. 517, 608-611]. At the Advances in Genome Biology and Technology (AGBT) 2015 GRC workshop, Karen Meltz Steinberg and Tina Graves-Lindsay from the McDonnell Genome Institute at Washington University presented the use of PacBio CHM1 data as part of GRC efforts to fill in gaps in GRCh38. During her talk, Graves-Lindsay presented a high-level comparison of several assemblies of the PacBio CHM1 data using a number of newly developed long-read assembly tools, including MHAP by Adam Phillippy, Dazzler by Gene Myers, and Falcon by Jason Chin.
As PacBio CEO Mike Hunkapiller was listening to the talks, he realized that by upgrading the dataset, he could support not only the community’s effort to create a high-quality haploid human genome assembly and improve the reference, but also foster innovative genome assembly tools. Jason Chin notes, “Right now there are many approaches to whole genome assembly which are similar but have subtle differences. We need to evaluate what methods are the best for moving the field forward. Having a common dataset is useful to compare methods.” As it seemed the developer community had converged on this haploid cell line as a useful lingua franca for comparing different assembly pipelines, CHM1 data with the improved read length and accuracy of the newest P6-C4 chemistry would give bioinformaticians a new benchmarking opportunity, while advancing the goals of a platinum haploid genome assembly and resolving gaps and errors in the reference assembly.
Following up on Hunkapiller’s promise at AGBT, PacBio released a second CHM1 dataset to NCBI in September with ~60x coverage using P6-C4 chemistry. The dataset was generated with the new 30 kb sample prep protocol, and has a read length N50 of 19 kb. In the intervening months, several bioinformatics groups have been working with the new data and Chin has now uploaded his assembly results to NCBI to share with the community. The new assembly has a contig N50 of 26.9 Mb, with half of the genome contained within 30 contigs (contig L50). Regarding summary statistics, however, Chin emphasizes, “Genome assembly is a complex process, and no single statistic can sufficiently describe the results. Many different aspects of an assembly need to be evaluated to ensure high-quality results, including overall contiguity, completeness, the prevalence of mis-assemblies, and base-level accuracy. Releasing the whole assembly will allow all the experts within the community to fully understand the strengths and weaknesses of different approaches and determine how to move the field forward.”
Chin’s current assembly was created with Myers’ Daligner and the Falcon assembler developed at PacBio.
Figure 1. Jason Chin’s new CHM1 assembly resolves the q arms of chromosomes 2 and 6 into very few contigs, with max contigs 107 Mbp and 109 Mbp long, respectively.
A highlight of the CHM1 assembly Chin submitted to NCBI is the near-complete assembly of the q arms of chromosome 6 in a contig 109 Mbp long. Another contig of 107 Mb spans more than two-thirds of the chromosome 2 q arm. Using the same publicly available dataset, Phillippy and Sergey Koren, now at NHGRI, are planning to submit their own CHM1 assembly to NCBI in the next month. This assembly will be generated using different assembly tools, namely the MHAP method developed by Konstantin Berlin and co-authors [Berlin, K et al. (2015) Assembling large genomes with single-molecule sequencing and locality sensitive hashing. Nature Biotech. 33,623-630], paired with Celera® Assembler. We think it will be very useful for the community to have access to both assemblies to comment on the strengths and weaknesses of the different approaches, or to compare these assemblies to their own efforts. These two submissions can be seen as part of a communal work in progress toward finding the best and most general approaches to large genome assembly. In addition, we hope other researchers will be able to use this dataset to further their own assembler development work.
There are multiple ways to learn more about all the work being done with the updated CHM1 data during ASHG.
- Register to hear our workshop on Wednesday, October 7, from 1:00-2:30 PM EDT either in person or streaming, where Rick Wilson will highlight work he has done at the McDonnell Genome Institute developing high-quality references using both the CHM1 and CHM13 cell lines in a talk entitled “Of Reference Genomes and Precious Metals” (Sheraton Inner Harbor Hotel, Chesapeake Ballroom I/II/III, 3rd Floor).
- Attend the GRC workshop ahead of ASHG on Tuesday, October 6, from 1:00-4:00 PM (Convention Center, Room 349, Level 3).
- Attend the DNAnexus workshop on Thursday, October 8, from 1:00-2:30 PM (Convention Center, Room 345, Level 3), where Tina Graves-Lindsay will share her work combining PacBio and BioNano CHM1 and CHM13 data to generate assemblies with extremely high scaffold N50s.
- See Karyn Meltz Steinberg give a talk during the Platinum Genomes session on Friday, October 9, at 2:15 PM (Convention Center, Room 316, Level 3) entitled “Building a Platinum Assembly From Single Haplotype Human Genomes Generated From Long Molecule Sequencing,” in which she will present work resolving regions of the genome associated with large, repetitive sequences and exhibiting complex allelic diversity.