Several new 3rd generation long-range DNA sequencing and mapping technologies have recently become available that are starting to create a resurgence in genome sequence quality. Unlike their 2nd generation, shortread counterparts that can resolve a few hundred or a few thousand basepairs, the new technologies can routinely sequence 10,000 bp reads or map across 100,000 bp molecules. The substantially greater lengths are being used to enhance a number of important problems in genomics and medicine, including de novo genome assembly, structural variation detection, and haplotype phasing. Here we discuss the capabilities of the latest technologies, and show how they will improve the “3Cs of Genome Assembly”: the contiguity, completeness, and correctness. We derive this analysis from (1) a metaanalysis of the currently available 3rd generation genome assemblies, (2) a retrospective analysis of the evolution of the reference human genome, and (3) extensive simulations with dozens of species across the tree of life. We also propose a model using support vector regression (SVR) that predicts genome assembly performance using four features: read lengths(L) and coverage values(C) that can be used for evaluating potential technologies along with genome size(G) and repeats(R) that present species specific characteristics. The proposed model significantly improves genome assembly performance prediction by adopting data-driven approach and addressing limitations of the previous hypothesis-driven methodology. Overall, we anticipate these technologies unlock the genomic “dark matter”, and provide many new insights into evolution, agriculture, and human diseases.
Organization: Cold Spring Harbor Laboratory