New Assembly for Complex Bread Wheat Genome: 10 Times Higher Contiguity
Wednesday, July 19, 2017
UPDATE: This preprint is now published! Check it out in the November 2017 issue of GigaScience.
In a new bioRxiv preprint, scientists from Johns Hopkins present a major step forward in accuracy and completeness for the wheat genome. Their new assembly, generated largely from PacBio data, demonstrates the importance of using long, highly-accurate reads for resolving extremely complex, repetitive genomes.
“The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum,” comes from lead author Aleksey Zimin, senior author Steven Salzberg, and collaborators. In launching this project, the team aimed to overcome a longstanding challenge for the wheat research community. “Common bread wheat, Triticum aestivum, has one of the most complex genomes known to science, with 6 copies of each chromosome, enormous numbers of near-identical sequences scattered throughout, and an overall size of more than 15 billion bases,” they write. “Multiple past attempts to assemble the genome have failed.”
The first publication for this species, which came out in 2012, only assembled one-third of the genome. In 2014, a short-read assembly managed to capture two-thirds of the genome in a highly fragmented assembly, while a subsequent short-read-based effort delivered more sequence but in millions of contigs.
For this project, scientists adopted SMRT Sequencing to produce reads long enough to span repetitive elements that are a hallmark of the plant’s genome. The result was phenomenal: “Ours is the first assembly that contains essentially the entire length of the genome, with more than 15.3 billion bases, and its contiguity is more than ten times better than the partial assemblies published in the past,” the authors report.
The team took two approaches to creating the assembly: a hybrid Illumina-PacBio version, and an all-PacBio version. Ultimately, they merged both to create the final assembly, which has a contig N50 of 232.6 kb; the longest contig is 4.5 Mb. The PacBio-only version, produced with the FALCON assembler, relied on 36-fold genome coverage to generate an assembly of 12.94 Gb with a contig N50 of 215.3 kb. “The key factor in producing a true draft assembly for this exceptionally repetitive genome was the use of very long reads, averaging just under 10,000 bp each, which were required to span the long, ubiquitous repeats in the wheat genome,” the scientists note.
Evaluating the assembly’s quality was a tall order given the state of previous assemblies. The scientists compared it to an assembly for a diploid ancestor of bread wheat and found that 99.8% of the smaller genome aligned to the new assembly, offering “strong support for its accuracy” as well as its completeness, they write.
One of the most interesting findings of this effort was the delineation of that ancestral plant’s contributions to the bread wheat genome (known as the wheat D genome). “By aligning this assembly to the draft genome of Aegilops tauschii, the progenitor of the wheat D genome, we were able to cleanly separate the D genome component from the A and B genomes of hexaploid wheat, which is reported here for the first time,” the team explains.
Ultimately, the scientists believe the new wheat assembly offers a significant boost to the wheat community, which has never had the benefit of a well-annotated, high-quality genome for crop improvement efforts. “This represents by far the most complete and contiguous assembly of the wheat genome to date,” the scientists write, “providing a strong foundation for future genetic studies of this important food crop.”