With Greater Contiguity, New Gorilla Genome Assembly Offers Insights into Gene Content, SVs, and More
Thursday, March 31, 2016
In a Science paper published today, scientists from the University of Washington, the McDonnell Genome Institute, and other organizations present a new gorilla genome assembly generated with PacBio long-read sequencing, representing an over 150-fold improvement over previous assemblies.
From lead authors David Gordon, John Huddleston, Mark Chaisson, and Christopher Hill, and senior author Evan Eichler, the paper reports that the new assembly recovers nearly all reference exons missing from the previous assembly, and provides an unprecedented look at structural variation, genetic diversity, ancestral evolution, repeat structures, and more.
The project was launched to address shortcomings with the existing gorilla assembly, which was built with short-read and Sanger sequencing data. While short-read sequencing has been instrumental for genomics, the authors write, “assemblies have become increasingly more incomplete and fragmented in large part because the underlying sequence reads are too short (<200 bp) to traverse complex repeat structures. This has led to incomplete gene models, less accurate representation of repeats, and biases in our understanding of genome biology.” The previous gorilla assembly was highly fragmented, with more than 400,000 gaps, and had been assembled using the human genome as a guiding reference.
The team used SMRT Sequencing on a western lowland gorilla named Susie, followed by assembly and polishing with FALCON and Quiver, respectively. The resulting assembly size is 3.1 Gb, with a contig N50 length of 9.6 Mb. The assembly closes 93% of the gaps, many of which are characterized by GC-rich content, and provides at least 148 Mbp of additional euchromatic sequence.
The scientists incorporated additional genome data from six gorillas, generating a reference genome called Susie3. A gene content analysis determined that nearly 95% of RefSeq exons missing from the original assembly were recovered in this assembly, and that 96% of previously incomplete gorilla genes were represented in at least one isoform. They also looked at structural variation, finding that 86% of the indels and inversion variants detected had never been seen before. “These analyses provide a comprehensive catalog of mobile element differences between human and gorilla (24.1% of all structural variation events),” the authors report.
The assembly also suggests that previous estimates of evolutionary divergence and population sizes were not as accurate as expected. “Although the difference was subtle, we found that human versus gorilla sequence alignments were significantly less divergent with Susie3 (1.60% divergent) when compared to the published gorilla assembly (1.65% divergent),” the scientists write. “We found a strong correlation with the difference in divergence and regions enriched for Alu and G+C content … suggesting that mismapping, collapse or underrepresentation within these regions of the Illumina-based assemblies may be contributing to this excess of divergence.” They also report that previous estimates of the most recent population bottleneck for western lowland gorillas “may have been underestimated by a factor of ~1.5, highlighting the importance of using higher quality assemblies when fitting demographic models.”
The scientists note that SMRT Sequencing has put high-quality de novo mammalian assemblies within reach of individual labs. “Our results demonstrate the utility of long-read sequence technology to generate high-quality working draft genomes of complex vertebrate genomes without guidance from preexisting reference genomes,” they conclude. “The genome assembly that results from using the long-read data provides a more complete picture of gene content, structural variation and repeat biology as well as allows us to refine population genetic and evolutionary inferences.”
This exciting advance was also presented by Christopher Hill at AGBT — check out the recording of his presentation.