Menu
July 7, 2019  |  

Smooth q-Gram, and its applications to detection of overlaps among long, error-prone sequencing reads

We propose smoothq-gram, the frst variant of q-gram that captures q-gram pair within a small edit distance. We apply smooth q-gram to the problem of detecting overlapping pairs of error-prone reads produced by single molecule real time sequencing (SMRT), which is the frst and most critical step of the de novo fragment assembly of SMRT reads. We have implemented and tested our algorithm on a set of real world benchmarks. Our empirical results demonstrated the signifcant superiority of our algorithm over the existing q-gram based algorithms in accuracy.


July 7, 2019  |  

Tigmint: correcting assembly errors using linked reads from large molecules.

Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity. As a result, assembly errors are common. In the absence of a reference genome, these misassemblies may be identified by comparing the sequencing data to the assembly and looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembled sequence. Although tools exist to identify and correct misassemblies using Illumina paired-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint to address this gap.To demonstrate the effectiveness of Tigmint, we applied it to assemblies of a human genome using short reads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified by QUAST in the ABySS assembly by 216 (27%). While scaffolding with ARCS alone more than doubled the scaffold NGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of the assembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assembly correction in refining assemblies. We demonstrate the utility of Tigmint in correcting the assemblies of multiple tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing.Scaffolding an assembly that has been corrected with Tigmint yields a final assembly that is both more correct and substantially more contiguous than an assembly that has not been corrected. Using single-molecule sequencing in combination with linked reads enables a genome sequence assembly that achieves both a high sequence contiguity as well as high scaffold contiguity, a feat not currently achievable with either technology alone.


July 7, 2019  |  

FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods.

Comprehensive and accurate identification of structural variations (SVs) from next generation sequencing data remains a major challenge. We develop FusorSV, which uses a data mining approach to assess performance and merge callsets from an ensemble of SV-calling algorithms. It includes a fusion model built using analysis of 27 deep-coverage human genomes from the 1000 Genomes Project. We identify 843 novel SV calls that were not reported by the 1000 Genomes Project for these 27 samples. Experimental validation of a subset of these calls yields a validation rate of 86.7%. FusorSV is available at https://github.com/TheJacksonLaboratory/SVE .


July 7, 2019  |  

ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers.

The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time.Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50?=?4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50?=?14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~?10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n?=?13).ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.


July 7, 2019  |  

GtTR: Bayesian estimation of absolute tandem repeat copy number using sequence capture and high throughput sequencing.

Tandem repeats comprise significant proportion of the human genome including coding and regulatory regions. They are highly prone to repeat number variation and nucleotide mutation due to their repetitive and unstable nature, making them a major source of genomic variation between individuals. Despite recent advances in high throughput sequencing, analysis of tandem repeats in the context of complex diseases is still hindered by technical limitations. We report a novel targeted sequencing approach, which allows simultaneous analysis of hundreds of repeats. We developed a Bayesian algorithm, namely – GtTR – which combines information from a reference long-read dataset with a short read counting approach to genotype tandem repeats at population scale. PCR sizing analysis was used for validation.We used a PacBio long-read sequenced sample to generate a reference tandem repeat genotype dataset with on average 13% absolute deviation from PCR sizing results. Using this reference dataset GtTR generated estimates of VNTR copy number with accuracy within 95% high posterior density (HPD) intervals of 68 and 83% for capture sequence data and 200X WGS data respectively, improving to 87 and 94% with use of a PCR reference. We show that the genotype resolution increases as a function of depth, such that the median 95% HPD interval lies within 25, 14, 12 and 8% of the its midpoint copy number value for 30X, 200X WGS, 395X and 800X capture sequence data respectively. We validated nine targets by PCR sizing analysis and genotype estimates from sequencing results correlated well with PCR results.The novel genotyping approach described here presents a new cost-effective method to explore previously unrecognized class of repeat variation in GWAS studies of complex diseases at the population level. Further improvements in accuracy can be obtained by improving accuracy of the reference dataset.


July 7, 2019  |  

Meeting report: mobile genetic elements and genome plasticity 2018

The Mobile Genetic Elements and Genome Plasticity conference was hosted by Keystone Symposia in Santa Fe, NM USA, February 11–15, 2018. The organizers were Marlene Belfort, Evan Eichler, Henry Levin and Lynn Maquat. The goal of this conference was to bring together scientists from around the world to discuss the function of transposable elements and their impact on host species. Central themes of the meeting included recent innovations in genome analysis and the role of mobile DNA in disease and evolution. The conference included 200 scientists who participated in poster presentations, short talks selected from abstracts, and invited talks. A total of 58 talks were organized into eight sessions and two workshops. The topics varied from mechanisms of mobilization, to the structure of genomes and their defense strategies to protect against transposable elements.


July 7, 2019  |  

Speeding up DNA sequence alignment by optical correlator

In electronic computers, extensive amount of computations required for searching biological sequences in big databases leads to vast amount of energy consumption for electrical processing and cooling. On the other hand, optical processing is much faster than electrical counterpart, due to its parallel processing capability, at a fraction of energy consumption level and cost. In this regard, this paper proposes a correlation-based optical algorithm using metamaterial, taking advantages of optical parallel processing, to efficiently locate the edits as a means of DNA sequence comparison. Specifically, the proposed algorithm partitions the read DNA sequence into multiple overlapping intervals, referred to as windows, and then, extracts the peaks resulted from their cross-correlation with the reference sequence in parallel. Finally, to locate the edits, a simple algorithm utilizing number and location of the peaks is introduced to analyze the correlation outputs obtained from window-based DNA sequence comparison. As a novel implementation approach, we adopt multiple metamaterial-based optical correlators to optically implement the proposed parallel architecture, named as Window-based Optical Correlator (WOC). This wave-based computing architecture fully controls wave transmission and phase using dielectric and plasmonic materials. Design limitations and challenges of the proposed architecture are also discussed in details. The simulation results, comparing WOC with the well-known BLAST algorithm, demonstrate superior speed-up up to 60%, as well as, high accuracy even at the presence of large number of edits. Also, WOC method considerably reduces power consumption as a result of implementing metamaterial-based optical computing structure.


July 7, 2019  |  

Spalter: A meta machine learning approach to distinguish true DNA variants from sequencing artefacts

Being able to distinguish between true DNA variants and technical sequencing artefacts is a fundamental task in whole genome, exome or targeted gene analysis. Variant calling tools provide diagnostic parameters, such as strand bias or an aggregated overall quality for each called variant, to help users make an informed choice about which variants to accept or discard. Having several such quality indicators poses a problem for the users of variant callers because they need to set or adjust thresholds for each such indicator. Alternatively, machine learning methods can be used to train a classifier based on these indicators. This approach needs large sets of labeled training data, which is not easily available. The new approach presented here relies on the idea that a true DNA variant exists independently of technical features of the read in which it appears (e.g. base quality, strand, position in the read). Therefore the nucleotide separability classification problem – predicting the nucleotide state of each read in a given pileup based on technical features only – should be near impossible to solve for true variants. Nucleotide separability, i.e. achievable classification accuracy, can either be used to distinguish between true variants and technical artefacts directly, using a thresholding approach, or it can be used as a meta-feature to train a separability-based classifier. This article explores both possibilities with promising results, showing accuracies around 90%.


July 7, 2019  |  

STRetch: detecting and discovering pathogenic short tandem repeat expansions.

Short tandem repeat (STR) expansions have been identified as the causal DNA mutation in dozens of Mendelian diseases. Most existing tools for detecting STR variation with short reads do so within the read length and so are unable to detect the majority of pathogenic expansions. Here we present STRetch, a new genome-wide method to scan for STR expansions at all loci across the human genome. We demonstrate the use of STRetch for detecting STR expansions using short-read whole-genome sequencing data at known pathogenic loci as well as novel STR loci. STRetch is open source software, available from github.com/Oshlack/STRetch .


July 7, 2019  |  

Picky comprehensively detects high-resolution structural variants in nanopore long reads.

Acquired genomic structural variants (SVs) are major hallmarks of cancer genomes, but they are challenging to reconstruct from short-read sequencing data. Here we exploited the long reads of the nanopore platform using our customized pipeline, Picky ( https://github.com/TheJacksonLaboratory/Picky ), to reveal SVs of diverse architecture in a breast cancer model. We identified the full spectrum of SVs with superior specificity and sensitivity relative to short-read analyses, and uncovered repetitive DNA as the major source of variation. Examination of genome-wide breakpoints at nucleotide resolution uncovered micro-insertions as the common structural features associated with SVs. Breakpoint density across the genome is associated with the propensity for interchromosomal connectivity and was found to be enriched in promoters and transcribed regions of the genome. Furthermore, we observed an over-representation of reciprocal translocations from chromosomal double-crossovers through phased SVs. We demonstrate that Picky analysis is an effective tool for comprehensive detection of SVs in cancer genomes from long-read data.


July 7, 2019  |  

iMGEins: detecting novel mobile genetic elements inserted in individual genomes.

Recent advances in sequencing technology have allowed us to investigate personal genomes to find structural variations, which have been studied extensively to identify their association with the physiology of diseases such as cancer. In particular, mobile genetic elements (MGEs) are one of the major constituents of the human genomes, and cause genome instability by insertion, mutation, and rearrangement.We have developed a new program, iMGEins, to identify such novel MGEs by using sequencing reads of individual genomes, and to explore the breakpoints with the supporting reads and MGEs detected. iMGEins is the first MGE detection program that integrates three algorithmic components: discordant read-pair mapping, split-read mapping, and insertion sequence assembly. Our evaluation results showed its outstanding performance in detecting novel MGEs from simulated genomes, as well as real personal genomes. In detail, the average recall and precision rates of iMGEins are 96.67 and 100%, respectively, which are the highest among the programs compared. In the testing with real human genomes of the NA12878 sample, iMGEins shows the highest accuracy in detecting MGEs within 20?bp proximity of the breakpoints annotated.In order to study the dynamics of MGEs in individual genomes, iMGEins was developed to accurately detect breakpoints and report inserted MGEs. Compared with other programs, iMGEins has valuable features of identifying novel MGEs and assembling the MGEs inserted.


July 7, 2019  |  

Bridging gaps in transposable element research with single-molecule and single-cell technologies

More than half of the genomic landscape in humans and many other organisms is composed of repetitive DNA, which mostly derives from transposable elements (TEs) and viruses. Recent technological advances permit improved assessment of the repetitive content across genomes and newly developed molecular assays have revealed important roles of TEs and viruses in host genome evolution and organization. To update on our current understanding of TE biology and to promote new interdisciplinary strategies for the TE research community, leading experts gathered for the 2nd Uppsala Transposon Symposium on October 4–5, 2018 in Uppsala, Sweden. Using cutting-edge single-molecule and single-cell approaches, research on TEs and other repeats has entered a new era in biological and biomedical research.


January 23, 2017  |  

Tutorial: HGAP4 de novo assembly application

This tutorial provides an overview of the Hierarchical Genome Assembly Process (HGAP4) de novo assembly analysis application. HGAP4 generates accurate de novo assemblies using only PacBio data. HGAP4 is suitable…


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.