Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive…
Highly repetitive satellite DNA (satDNA) repeats are found in most eukaryotic genomes. SatDNAs are rapidly evolving and have roles in genome stability and chromosome segregation. Their repetitive nature poses a challenge for genome assembly and makes progress on the detailed study of satDNA structure difficult. Here, we use single-molecule sequencing long reads from Pacific Biosciences (PacBio) to determine the detailed structure of all major autosomal complex satDNA loci in Drosophila melanogaster, with a particular focus on the 260-bp and Responder satellites. We determine the optimal de novo assembly methods and parameter combinations required to produce a high-quality assembly of these…
Third-generation sequencing (TGS) platforms that generate long reads, such as PacBio and Oxford Nanopore technologies, have had a dramatic impact on genomics research. However, despite recent improvements, TGS reads suffer from high-error rates and the development of read correction methods is an active field of research. This motivates the need to develop tools that can evaluate the accuracy of noisy long reads correction tools.We introduce LRCstats, a tool that measures the accuracy of long reads correction tools. LRCstats takes advantage of long reads simulators that provide each simulated read with an alignment to the reference genome segment they originate from,…
Sex-chromosomes have formed repeatedly across Diptera from ordinary autosomes, and X-chromosomes mostly conserve their ancestral genes. Y-chromosomes are characterized by abundant gene-loss and an accumulation of repetitive DNA, yet the nature of the gene repertoire of fly Y-chromosomes is largely unknown. Here we trace gene-content evolution of Y-chromosomes across 22 Diptera species, using a subtraction pipeline that infers Y genes from male and female genome, and transcriptome data. Few genes remain on old Y-chromosomes, but the number of inferred Y-genes varies substantially between species. Young Y-chromosomes still show clear evidence of their autosomal origins, but most genes on old Y-chromosomes…
The next generation sequencing (NGS) techniques have been around for over a decade. Many of their fundamental applications rely on the ability to compute good genome assemblies. As the technology evolves, the assembly algorithms and tools have to continuously adjust and improve. The currently dominant technology of Illumina produces reads that are too short to bridge many repeats, setting limits on what can be successfully assembled. The emerging SMRT (Single Molecule, Real-Time) sequencing technique from Pacific Biosciences produces uniform coverage and long reads of length up to sixty thousand base pairs, enabling significantly better genome assemblies. However, SMRT reads are…
Recent technological developments have revolutionized the way we perform genetic analyses. In particular whole-genome sequencing provides access to the entire genetic makeup of an individual, and it is now an affordable approach for many research groups. As a consequence genome sequencing is pervading many fields of biological research. Sequencing technologies are evolving rapidly and so do their applications. Here we provide a first primer on whole-genome sequencing, focusing on two of the most popular applications: (1) de novo genome sequencing, in which the objective is obtaining a high-quality genome assembly that can serve as a reference for a species or…
Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data…