The May issue of Genome Research is a special edition focusing on advances in sequencing technologies and genome assembly techniques. The research papers selected for this special issue cover reference-grade genome assemblies, structural variant detection, diploid assemblies, and other features enabled by new high-quality sequencing tools.
The issue kicks off with a perspective from NHGRI’s Adam Phillippy, who reflects on the history of sequencing and assembly. Dusting off publications from as early as 1979, he illustrates the remarkable pace of advances in this field for the past four decades. Phillippy has worked with just about every kind of sequence data, so his view of the current landscape is particularly instructive. “The biggest gains in contig lengths have come from single-molecule sequencing,” he writes. “Critically, 10-kb reads are longer than the most common repeats in both microbial and vertebrate genomes and can therefore generate highly continuous assemblies. In fact, the complete reconstruction of bacterial genomes—a process that used to require teams of people—is now automated and routine.” Phillippy also notes that long-read sequencing assemblies have spurred “a renewed interest in repetitive sequences, which can be properly analyzed for the first time” and are “even revealing new variation in the human genome.”
We are very pleased that more than half of the papers in this special issue feature our sequencing data and genome assemblies derived therefrom, underscoring PacBio’s leading role in long-read sequencing and de novo assembly. We congratulate all the authors for their exciting contributions to this special issue and encourage you to review these excellent publications:
- Discovery and genotyping of structural variation from long-read haploid genome sequence data: Scientists used SMRT Sequencing to scan human genomes for structural variants, finding that more than 89% of those found had been missed in the 1,000 Genomes Project.
- Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly: An exploration of the latest human reference assembly, which expands the number of alternate loci and for the first time includes sequence coverage of centromeres.
Plant and Animal Genomes
- Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data: This project used SMRT Sequencing data to generate genomes of three relatives of the model plant Arabidopsis thaliana,assembling all three genomes into only a few hundred contigs. Integration of optical mapping and chromosome conformation capture techniques yielded chromosome-scale assemblies of these repetitive plant genomes. The scaffolds even revealed some of the heterochromatic regions which are not present in gold standard reference sequences.
- Single-molecule sequencing resolves the detailed structure of complex satellite DNA loci in Drosophila melanogaster: Long-read PacBio sequencing allowed scientists to characterize complex satellite DNA regions, which have been challenging to resolve due to their repetitive nature.
- Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications: This analysis of Eurasian crow genomes found that assembling two high-quality genome references using SMRT sequencing, combined with optical mapping, made it possible to recover missing regions and correct errors in a previous short-read-only assembly.
- An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations: Scientists use SMRT Sequencing of full-length cDNAs for genome annotation of a new wheat genome assembly, identifying protein-coding genes and noncoding RNA genes with high confidence.
New Tools for Long-Read Data
- Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm: Scientists present a new hybrid assembly algorithm to combine short-read and long-read data for optimal accuracy and contiguity.
- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation: Based on Celera Assembler, Canu was designed for long-read data and significantly reduces computational time for genome assembly.
- HINGE: long-read assembly achieves optimal repeat resolution: This assembler focuses on resolving challenging repeats.
- Fast and accurate de novo genome assembly from long uncorrected reads: For long-read assembly, scientists pair Racon with miniasm to rapidly generate high-quality consensus sequences without an error-correction step.
- HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies: This tool performs fast, high-resolution haplotype assembly from data produced by long-read sequencing, short-read sequencing, and other genome analysis technologies.
- HySA: a Hybrid Structural variant Assembly approach using next-generation and single-molecule sequencing technologies: This method calls structural variants from human genomes using short-read and long-read sequence data; tests showed it improved detection rates for several types of variants.