For Reference-Grade Human Genome Assemblies, SMRT Sequencing Yields Optimal Results
Thursday, June 14, 2018
SMRT Sequencing is a go-to technology for generating reference-grade human genome assemblies, according to speakers in a recent webinar. In their presentations, Tina Graves-Lindsay from Washington University and Adam Ameur from Uppsala University spoke about diploid assemblies, discovering novel sequence, improving diversity of the current human reference genome, and much more. Finally, our own Paul Peluso gave a presentation that included the technology roadmap showing the next several upgrades for the Sequel System.
Graves-Lindsay began with efforts from the Genome Reference Consortium to “represent the full range of genetic diversity in humans,” a task requiring the generation of many population-specific references. She presented data from two haploid and 13 diploid genomes produced so far, and noted that two others are underway. For each reference, the scientists generate ~60-fold WGS coverage with PacBio, then assemble with FALCON. To assist with assembly QC and scaffolding, they merge the resulting sequence contigs with data from orthogonal long-range technologies such as Bionano Genomics or 10x Genomics. The approach has yielded impressive results: three of the 13 reference genomes achieved chromosome-level assembly; the highest contig N50 reached 26 Mb. To highlight the value of population-specific reference genomes, Graves-Lindsay offered some examples of regions that are not yet represented in the current human reference (GRCh38 build) – such as a 65 kb insertion found in a Yoruban assembly. To further resolve the diploid genome assemblies, her team is running FALCON-Unzip to generate haplotype-resolved contigs. These haplotigs better represent each of the maternal and paternal haplotypes for each genome, as opposed to a single collapsed contig sequence, and will serve as an allele-specific reference for the populations they represent.
Ameur’s talk focused on an effort that came out of SweGen, a population sequencing effort that covered 1,000 individuals in Sweden. His team chose two participants — one male and one female — and used SMRT Sequencing to produce reference-grade assemblies for each. They generated 75-fold WGS coverage for each individual, and combined PacBio assembled contigs with Bionano optical maps to produce highly contiguous genomes. By comparing results to the initial SweGen results, Ameur found that a large proportion of the 20,000 structural variants detected in each reference assembly were missed by short-read sequencing. The new assemblies also included a total of 24 Mb of novel genome sequence, not represented in GRCh38; the vast majority of that data came from repetitive regions 5 kb or longer. While about 30% of the novel sequence had no hits in NCBI, the nearly 70% remaining did match existing sequences, leading Ameur to suspect that at least some of those sequences had been mis-annotated because they were not found in the human reference. Now, his team is going back to the original SweGen short-read WGS data and aligning it against the new reference genomes, which is helping to improve variant detection in the Swedish population, resolve false-positive SNPs, and improve alignment in some coding regions.
The webinar’s final presentation came from Peluso, who offered a quick overview of the features of SMRT Sequencing and its growing use for high-quality assemblies. Of the 65 human assemblies most recently submitted to NCBI, 90% of those with a contig N50 greater than 1 Mb were generated with PacBio data. Ongoing population studies and reference genome projects aim to use SMRT Sequencing on more than 2,400 human genomes globally. Peluso also presented data from the recent effort to sequence a Puerto Rican genome, HG00733, which used the latest advances for the Sequel System (v2.1 chemistry and 5.1 software). The SMRTbell Express Template Prep Kit allowed for faster sample prep and better yield, leading to libraries that generated more than 50% of data in reads longer than 33 kb and a contig N50 of 31.4 Mb. Average output per SMRT Cell was 10 Gb. The new assembly compared favorably to the Sanger-assembled GRCh38.p12, with fewer contigs (982 vs. 1536) and only slightly smaller contig N50 (31.4 Mb vs. 56.4 Mb). Peluso described cost efficiencies using the latest Sequel System improvements for de novo assembly, noting that “the original human reference genome cost $3 billion, and today you can characterize a single human genome with PacBio for around $3,000 (1/1-millionth the cost), and build a reference-quality genome de novo for around $20,000.”
Peluso also announced the availability of FALCON-Phase, an improved phasing assembly tool that incorporates long-range Hi-C data and can be found on Github. Looking ahead, he said that simplified library prep is on the roadmap for midyear, with a chemistry update to improve accuracy and yield slated for release in late 2018. Next year, a new SMRT Cell 8M is expected to expand yield and reduce costs significantly.
The event concluded with an audience Q&A covering details about alignment stringency, shared structural variants across the Swedish population, decoy sequences, and more. If you missed the live webinar, watch the recording any time.