New Diploid Avian Reference Genomes to Establish Quality Standards for G10K, B10K
Tuesday, September 19, 2017
A new publication from scientists at The Rockefeller University and PacBio presents reference-grade, phased diploid genome assemblies for two important avian models for vocal learning, Anna’s hummingbird and zebra finch. Results are expected to help establish genome quality standards for the G10K and B10K sequencing projects, in addition to providing a better foundation for neuroscience studies.
Published in GigaScience, “De Novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads” comes from lead author Jonas Korlach, senior author Erich Jarvis, and collaborators. The team undertook this project to improve the quality of genome assemblies available for these birds, demonstrating that key genes of interest were completely represented in single contigs. Existing assemblies produced with Sanger or short-read sequencing were incomplete and highly fragmented, precluding the comprehensive scientific view required for a deeper understanding of vocal learning.
By incorporating SMRT Sequencing, the team not only raised the bar for assembly quality but also phased the genomes using FALCON-Unzip, a diploid assembly tool. The new zebra finch assembly represented “a 108-fold reduction in the number of contigs and a 150-fold improvement in contiguity compared to the current Sanger-based reference,” the authors write. For hummingbird, the PacBio assembly led to “a 116-fold reduction in the number of contigs and a 201-fold improvement in contiguity over the reference.” Both assemblies had contig N50s greater than 5 Mb. “These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references,” the scientists report, “including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult to sequence regions, complex repeat structure errors, and allelic differences between the two haplotypes.”
The team assessed gene content of the assemblies with CEGMA and BUSCO comparisons. In both cases, the number of complete or nearly complete genes increased. They also used RNA-seq to evaluate the reference genomes, finding that the PacBio long-read assemblies increased “total transcript read mappings compared to the Sanger-based reference … suggesting more genic regions available for read alignments,” they write.
Finally, the scientists conducted in-depth interrogations of four genes particularly important for vocal learning. EGR1, for instance, has gaps in previous zebra finch and hummingbird reference genomes. In both SMRT Sequencing assemblies, though, the gene was fully resolved and spanned in a complete contig. There were similar improvements for DUSP1, FOXP2, and SLIT1.
“We found that the long-read diploid assemblies resulted in major improvements in genome completeness and contiguity, and completely resolved the problems in all of our genes of interest,” the scientists report. “We now, for the first time, have complete and accurate assembled genes of interest that can be pursued further without the need to individually and arduously clone, sequence, and correct the assemblies one gene at a time.”
For more, check out our recent release of Iso-Seq data for hummingbird and zebra finch.