Structural Variants (SVs), which include deletions, insertions, duplications, inversions and chromosomal rearrangements, have been shown to effect organism phenotypes, including changing gene expression, increasing disease risk, and playing an important role in cancer development. Still it remains challenging to detect all types of SVs from high throughput sequencing data and it is even harder to detect more complex SVs such as a duplication nested within an inversion. To overcome these challenges we developed algorithms for SV analysis using longer third generation sequencing reads. The increased read lengths allow us to span more complex SVs and accurately assess SVs in repetitive regions, two of the major limitations when using short Illumina data. Our enhanced open-source analysis method Sniffles accurately detects structural variants based on split read mapping and assessment of the alignments. Sniffles uses a self-balancing interval tree in combination with a plane sweep algorithm to manage and assess the identified SVs. Central to its high accuracy is its advanced scoring model that can distinguish erroneous alignments from true breakpoints flanking SVs. In experiments with simulated and real genomes (e.g human breast cancer), we find that Sniffles outperforms all other SV analysis approaches in both the sensitivity of finding events as well as the specificity of those events. Sniffles is available at: https://github.com/fritzsedlazeck/Sniffles
Characterizing haplotype diversity at the immunoglobulin heavy chain locus across human populations using novel long-read sequencing and assembly approaches
The human immunoglobulin heavy chain locus (IGH) remains among the most understudied regions of the human genome. Recent efforts have shown that haplotype diversity within IGH is elevated and exhibits population specific patterns; for example, our re-sequencing of the locus from only a single chromosome uncovered >100 Kb of novel sequence, including descriptions of six novel alleles, and four previously unmapped genes. Historically, this complex locus architecture has hindered the characterization of IGH germline single nucleotide, copy number, and structural variants (SNVs; CNVs; SVs), and as a result, there remains little known about the role of IGH polymorphisms in inter-individual antibody repertoire variability and disease. To remedy this, we are taking a multi-faceted approach to improving existing genomic resources in the human IGH region. First, from whole-genome and fosmid-based datasets, we are building the largest and most ethnically diverse set of IGH reference assemblies to date, by employing PacBio long-read sequencing combined with novel algorithms for phased haplotype assembly. In total, our effort will result in the characterization of >15 phased haplotypes from individuals of Asian, African, and European descent, to be used as a representative reference set by the genomics and immunogenetics community. Second, we are utilizing this more comprehensive sequence catalogue to inform the design and analysis of novel targeted IGH genotyping assays. Standard targeted DNA enrichment methods (e.g., exome capture) are currently optimized for the capture of only very short (100’s of bp) DNA segments. Our platform uses a modified bench protocol to pair existing capture-array technologies with the enrichment of longer fragments of DNA, enabling the use of PacBio sequencing of DNA segments up to 7 Kb. This substantial increase in contiguity disambiguates many of the complex repeated structures inherent to the locus, while yielding the base pair fidelity required to call SNVs. Together these resources will establish a stronger framework for further characterizing IGH genetic diversity and facilitate IGH genomic profiling in the clinical and research settings, which will be key to fully understanding the role of IGH germline variation in antibody repertoire development and disease.
Tremendous flexibility is maintained in the human proteome via alternative splicing, and cancer genomes often subvert this flexibility to promote survival. Identification and annotation of cancer-specific mRNA isoforms is critical…
A 12-kb structural variation in progressive myoclonic epilepsy was newly identified by long-read whole-genome sequencing.
We report a family with progressive myoclonic epilepsy who underwent whole-exome sequencing but was negative for pathogenic variants. Similar clinical courses of a devastating neurodegenerative phenotype of two affected siblings were highly suggestive of a genetic etiology, which indicates that the survey of genetic variation by whole-exome sequencing was not comprehensive. To investigate the presence of a variant that remained unrecognized by standard genetic testing, PacBio long-read sequencing was performed. Structural variant (SV) detection using low-coverage (6×) whole-genome sequencing called 17,165 SVs (7,216 deletions and 9,949 insertions). Our SV selection narrowed down potential candidates to only five SVs (two deletions and three insertions) on the genes tagged with autosomal recessive phenotypes. Among them, a 12.4-kb deletion involving the CLN6 gene was the top candidate because its homozygous abnormalities cause neuronal ceroid lipofuscinosis. This deletion included the initiation codon and was found in a GC-rich region containing multiple repetitive elements. These results indicate the presence of a causal variant in a difficult-to-sequence region and suggest that such variants that remain enigmatic after the application of current whole-exome sequencing technology could be uncovered by unbiased application of long-read whole-genome sequencing.
Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes.
Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation. © The Author 2017. Published by Oxford University Press.