In Study, Continuous Long Reads Outperform Synthetic Long Reads for Resolving Tandem Repeats
Thursday, April 30, 2015
Scientists from Argentina and Brazil published the results of a study comparing long-read approaches to characterize the genome structure of a highly complex region of the Y chromosome in Drosophila melanogaster. They found that Single Molecule, Real-Time (SMRT®) Sequencing outperformed synthetic long reads in accurately representing tandem repeats.
The study aimed to resolve the structure of the autosomal gene Mst77F, which had previously been found to have multiple tandem copies; the region, however, was known to be grossly misassembled in the reference. The scientists, from Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas and Universidade Federal do Rio de Janeiro, used Illumina TruSeq Synthetic Long-Reads technology with Celera Assembler as well as PacBio® long-read sequence data assembled with MHAP to interrogate the genomic region. Results were published in the journal G3: Genes, Genomes, Genetics in a paper entitled “Long-read single molecule sequencing to resolve tandem gene copies: The Mst77Y region on the Drosophila melanogaster Y chromosome.”
Lead author Flavia Krsticevic and collaborators report that the synthetic long reads failed to completely cover the region of interest. The resulting assembly “is incomplete and fragmented,” the scientists write. “The scaffolds are small (all below 15 kb), and hence provide little information on the genomic structure and context of the Mst77Y region.”
The authors note that synthetic long reads can accurately resolve repetitive regions “as long as there is only one copy of a repeat in each 10 kb fragments; i.e., the repeats should be interspersed.” Tandem repeats, on the other hand, pose a major challenge to this approach. “It is worth noting also that several biologically interesting and poorly known regions of the Drosophila genome such as other recently duplicated genes, the histone and rDNA clusters, and the centromeres, have a tandem repeat organization, and in these cases synthetic long reads are predicted to have limited utility,” Krsticevic et al. write.
In contrast, the team found that SMRT Sequencing generated data fully covering the genomic region, which assembled into a single contig using MHAP. The assembly revealed 18 copies of the gene, some of them present in identical copies, covering 96 kb. The team independently validated the findings, demonstrating that six previously detected versions of this gene were likely PCR artifacts and discovering six new versions of the gene that had never been identified before. Their validation found two single-base errors across the entire span. “Thus, the assembly of this region seems to be essentially perfect,” they write.
The scientists took advantage of the D. melanogaster reference genome that PacBio generated and made publicly available last year; it served as a comparison point to the PacBio sequence they produced. We’re glad to see that the community is finding resources like this to be helpful.