Menu
July 7, 2019

FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods.

Comprehensive and accurate identification of structural variations (SVs) from next generation sequencing data remains a major challenge. We develop FusorSV, which uses a data mining approach to assess performance and merge callsets from an ensemble of SV-calling algorithms. It includes a fusion model built using analysis of 27 deep-coverage human genomes from the 1000 Genomes Project. We identify 843 novel SV calls that were not reported by the 1000 Genomes Project for these 27 samples. Experimental validation of a subset of these calls yields a validation rate of 86.7%. FusorSV is available at https://github.com/TheJacksonLaboratory/SVE .


July 7, 2019

ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers.

The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time.Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50?=?4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50?=?14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~?10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n?=?13).ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.


July 7, 2019

GtTR: Bayesian estimation of absolute tandem repeat copy number using sequence capture and high throughput sequencing.

Tandem repeats comprise significant proportion of the human genome including coding and regulatory regions. They are highly prone to repeat number variation and nucleotide mutation due to their repetitive and unstable nature, making them a major source of genomic variation between individuals. Despite recent advances in high throughput sequencing, analysis of tandem repeats in the context of complex diseases is still hindered by technical limitations. We report a novel targeted sequencing approach, which allows simultaneous analysis of hundreds of repeats. We developed a Bayesian algorithm, namely – GtTR – which combines information from a reference long-read dataset with a short read counting approach to genotype tandem repeats at population scale. PCR sizing analysis was used for validation.We used a PacBio long-read sequenced sample to generate a reference tandem repeat genotype dataset with on average 13% absolute deviation from PCR sizing results. Using this reference dataset GtTR generated estimates of VNTR copy number with accuracy within 95% high posterior density (HPD) intervals of 68 and 83% for capture sequence data and 200X WGS data respectively, improving to 87 and 94% with use of a PCR reference. We show that the genotype resolution increases as a function of depth, such that the median 95% HPD interval lies within 25, 14, 12 and 8% of the its midpoint copy number value for 30X, 200X WGS, 395X and 800X capture sequence data respectively. We validated nine targets by PCR sizing analysis and genotype estimates from sequencing results correlated well with PCR results.The novel genotyping approach described here presents a new cost-effective method to explore previously unrecognized class of repeat variation in GWAS studies of complex diseases at the population level. Further improvements in accuracy can be obtained by improving accuracy of the reference dataset.


July 7, 2019

Meeting report: mobile genetic elements and genome plasticity 2018

The Mobile Genetic Elements and Genome Plasticity conference was hosted by Keystone Symposia in Santa Fe, NM USA, February 11–15, 2018. The organizers were Marlene Belfort, Evan Eichler, Henry Levin and Lynn Maquat. The goal of this conference was to bring together scientists from around the world to discuss the function of transposable elements and their impact on host species. Central themes of the meeting included recent innovations in genome analysis and the role of mobile DNA in disease and evolution. The conference included 200 scientists who participated in poster presentations, short talks selected from abstracts, and invited talks. A total of 58 talks were organized into eight sessions and two workshops. The topics varied from mechanisms of mobilization, to the structure of genomes and their defense strategies to protect against transposable elements.


July 7, 2019

iMGEins: detecting novel mobile genetic elements inserted in individual genomes.

Recent advances in sequencing technology have allowed us to investigate personal genomes to find structural variations, which have been studied extensively to identify their association with the physiology of diseases such as cancer. In particular, mobile genetic elements (MGEs) are one of the major constituents of the human genomes, and cause genome instability by insertion, mutation, and rearrangement.We have developed a new program, iMGEins, to identify such novel MGEs by using sequencing reads of individual genomes, and to explore the breakpoints with the supporting reads and MGEs detected. iMGEins is the first MGE detection program that integrates three algorithmic components: discordant read-pair mapping, split-read mapping, and insertion sequence assembly. Our evaluation results showed its outstanding performance in detecting novel MGEs from simulated genomes, as well as real personal genomes. In detail, the average recall and precision rates of iMGEins are 96.67 and 100%, respectively, which are the highest among the programs compared. In the testing with real human genomes of the NA12878 sample, iMGEins shows the highest accuracy in detecting MGEs within 20?bp proximity of the breakpoints annotated.In order to study the dynamics of MGEs in individual genomes, iMGEins was developed to accurately detect breakpoints and report inserted MGEs. Compared with other programs, iMGEins has valuable features of identifying novel MGEs and assembling the MGEs inserted.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.