Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.
In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that >89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF > 1%). We estimate that this theoretical human diploid differs by as much as ~16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery from genotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that ~59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.© 2017 Huddleston et al.; Published by Cold Spring Harbor Laboratory Press.
The human genome is arguably the most complete mammalian reference assembly, yet more than 160 euchromatic gaps remain and aspects of its structural variation remain poorly understood ten years after its completion. To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing. We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome–78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.
A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly. © 2014 Steinberg et al.; Published by Cold Spring Harbor Laboratory Press.
The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.
The intractability of homogeneous a-satellite arrays has impeded understanding of human centromeres. Artificial centromeres are produced from higher-order repeats (HORs) present at centromere edges, although the exact sequences and chromatin conformations of centromere cores remain unknown. We use high-resolution chromatin immunoprecipitation (ChIP) of centromere components followed by clustering of sequence data as an unbiased approach to identify functional centromere sequences. We find that specific dimeric a-satellite units shared by multiple individuals dominate functional human centromeres. We identify two recently homogenized a-satellite dimers that are occupied by precisely positioned CENP-A (cenH3) nucleosomes with two ~100-base pair (bp) DNA wraps in tandem separated by a CENP-B/CENP-C-containing linker, whereas pericentromeric HORs show diffuse positioning. Precise positioning is largely maintained, whereas abundance decreases exponentially with divergence, which suggests that young a-satellite dimers with paired ~100-bp particles mediate evolution of functional human centromeres. Our unbiased strategy for identifying functional centromeric sequences should be generally applicable to tandem repeat arrays that dominate the centromeres of most eukaryotes.
Structural variation is common in human and cancer genomes. High-throughput DNA sequencing has enabled genome-scale surveys of structural variation. However, the short reads produced by these technologies limit the study of complex variants, particularly those involving repetitive regions. Recent ‘third-generation’ sequencing technologies provide single-molecule templates and longer sequencing reads, but at the cost of higher per-nucleotide error rates.We present MultiBreak-SV, an algorithm to detect structural variants (SVs) from single molecule sequencing data, paired read sequencing data, or a combination of sequencing data from different platforms. We demonstrate that combining low-coverage third-generation data from Pacific Biosciences (PacBio) with high-coverage paired read data is advantageous on simulated chromosomes. We apply MultiBreak-SV to PacBio data from four human fosmids and show that it detects known SVs with high sensitivity and specificity. Finally, we perform a whole-genome analysis on PacBio data from a complete hydatidiform mole cell line and predict 1002 high-probability SVs, over half of which are confirmed by an Illumina-based assembly. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: email@example.com.
Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly.
The human reference genome assembly plays a central role in nearly all aspects of today’s basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health. © 2017 Schneider et al.; Published by Cold Spring Harbor Laboratory Press.
Complex chromosomal rearrangements are structural genomic alterations involving multiple instances of deletions, duplications, inversions, or translocations that co-occur either on the same chromosome or represent different overlapping events on homologous chromosomes. We present SVelter, an algorithm that identifies regions of the genome suspected to harbor a complex event and then resolves the structure by iteratively rearranging the local genome structure, in a randomized fashion, with each structure scored against characteristics of the observed sequencing data. SVelter is able to accurately reconstruct complex chromosomal rearrangements when compared to well-characterized genomes that have been deeply sequenced with both short and long reads.
Long arrays of near-identical tandem repeats are a common feature of centromeric and subtelomeric regions in complex genomes. These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools. Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, e.g. assembly, long reads allow direct inference of satellite higher order repeat structure. To automate characterization of local centromeric tandem repeat sequence variation we have designed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), that takes advantage of Pacific Bioscience long-reads from whole-genome sequencing datasets. By operating on reads prior to assembly, our approach provides a more comprehensive set of repeat-structure variants and is not impacted by rearrangements or sequence underrepresentation due to misassembly.We demonstrate the utility of Alpha-CENTAURI in characterizing repeat structure for alpha satellite containing reads in the hydatidiform mole (CHM1, haploid-like) genome. The pipeline is designed to report local repeat organization summaries for each read, thereby monitoring rearrangements in repeat units, shifts in repeat orientation and sites of array transition into non-satellite DNA, typically defined by transposable element insertion. We validate the method by showing consistency with existing centromere high order repeat references. Alpha-CENTAURI can, in principle, run on any sequence data, offering a method to generate a sequence repeat resolution that could be readily performed using consensus sequences available for other satellite families in genomes without high-quality reference assemblies.Documentation and source code for Alpha-CENTAURI are freely available at http://github.com/volkansevim/alpha-CENTAURI CONTACT: firstname.lastname@example.orgSupplementary information: Supplementary data are available at Bioinformatics online.© The Author 2016. Published by Oxford University Press.