Explore how long-read sequencing enables solving of rare and mendelian diseases.
Although the accuracy of the human reference genome is critical for basic and clinical research, structural variants (SVs) have been difficult to assess because data capable of resolving them have been limited. To address potential bias, we sequenced a diversity panel of nine human genomes to high depth using long-read, single-molecule, real-time sequencing data. Systematically identifying and merging SVs =50 bp in length for these nine and one public genome yielded 83,909 sequence-resolved insertions, deletions, and inversions. Among these, 2,839 (2.0 Mbp) are shared among all discovery genomes with an additional 13,349 (6.9 Mbp) present in the majority of humans, indicating minor alleles or errors in the reference, which is partially explained by an enrichment for GC-content and repetitive DNA. Genotyping 83% of these in 290 additional genomes confirms that at least one allele of the most common SVs in unique euchromatin are now sequence-resolved. We observe a 9-fold increase within 5 Mbp of chromosome telomeric ends and correlation with de novo single-nucleotide variant mutations showing that such variation is nonrandomly distributed defining potential hotspots of mutation. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. To illustrate the utility of sequence-resolved SVs in resequencing experiments, we mapped 30 diverse high-coverage Illumina-sequenced samples to GRCh38 with and without contigs containing SV insertions as alternate sequences, and we found these additional sequences recover 6.4% of unmapped reads. For reads mapped within the SV insertion, 25.7% have a better mapping quality, and 18.7% improved by 1,000-fold or more. We reveal 72,964 occurrences of 15,814 unique variants that were not discoverable with the reference sequence alone, and we note that 7% of the insertions contain an SV in at least one sample indicating that there are additional alleles in the population that remain to be discovered. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity. We present a summary of our findings and discuss ideas for revealing variation that was once difficult to ascertain.
Genomics studies have shown that the insertions, deletions, duplications, translocations, inversions, and tandem repeat expansions in the structural variant (SV) size range (>50 bp) contribute to the evolution of traits and often have significant associations with agronomically important phenotypes. However, most SVs are too small to detect with array comparative genomic hybridization and too large to reliably discover with short-read DNA sequencing. While de novo assembly is the most comprehensive way to identify variants in a genome, recent studies in human genomes show that PacBio SMRT Sequencing sensitively detects structural variants at low coverage. Here we present SV characterization in the major crop species Oryza sativa subsp. indica (rice) with low-fold coverage of long reads. In addition, we provide recommendations for sequencing and analysis for the application of this workflow to other important agricultural species.
Structural variants (SVs, differences >50 base pairs) account for most of the base pairs that differ between two human genomes, and are known to cause over 1,000 genetic disorders including…
In this presentation Fritz Sedlazeck describes his latest work to obtain comprehensive genomes leveraging long-read sequencing and linked reads.
In this PacBio User Group Meeting presentation, Jonas Korlach and Roberto Lleras share the latest updates to the structural variation application and analysis tools.