Menu
July 7, 2019  |  

Privacy-preserving read mapping using locality sensitive hashing and secure kmer voting

The recent explosion in the amount of available genome sequencing data imposes high computational demands on the tools designed to analyze it. Low-cost cloud computing has the potential to alleviate this burden. However, moving personal genome data analysis to the cloud raises serious privacy concerns. Read alignment is a critical and computationally intensive first step of most genomic data analysis pipelines. While significant effort has been dedicated to optimize the sensitivity and runtime efficiency of this step, few approaches have addressed outsourcing this computation securely to an untrusted party. The few secure solutions that have been proposed either do not scale to whole genome sequencing datasets or are not competitive with the state of the art in read mapping. In this paper, we present BALAUR, a privacy-preserving read mapping algorithm based on locality sensitive hashing and secure kmer voting. BALAUR securely outsources a significant portion of the computation to the public cloud by formulating the alignment task as a voting scheme between encrypted read and reference kmers. Our approach can easily handle typical genome-scale datasets and is highly competitive with non-cryptographic state-of-the-art read aligners in both accuracy and runtime performance on simulated and real read data. Moreover, our approach is significantly faster than state-of-the-art read aligners in long read mapping.


July 7, 2019  |  

Next-generation sequencing: a diagnostic one-stop shop for Hepatitis C?

Before starting chronic hepatitis C treatment, the viral genotype/subtype has to be accurately determined and potentially coupled with drug resistance testing. Due to the high genetic variability of the hepatitis C virus, this can be a demanding task that can potentially be streamlined by viral whole-genome sequencing using next-generation sequencing as demonstrated by an article in this issue of the Journal of Clinical Microbiology by E. Thomson, C. L. C. Ip, A. Badhan, M. T. Christiansen, W. Adamson, et al. (J Clin Microbiol. 54:2455-2469, 2016, http://dx.doi.org/10.1128/JCM.00330-16). Copyright © 2016, American Society for Microbiology. All Rights Reserved.


July 7, 2019  |  

Silicon content of individual cells of Synechococcus from the North Atlantic Ocean

The widely distributed marine cyanobacterium Synechococcus is thought to exert an influence on the marine silicon (Si) cycle through its high cellular Si relative to organic content. There are few measurements of Si in natural populations of Synechococcus, however, and the degree to which Synechococcus from various oligotrophic field sites and depths accumulate the element is unknown. We used synchrotron x-ray fluorescence to measure Si quotas in individual Synechococcus cells collected during three cruises in the western North Atlantic Ocean in the summer and fall, focusing on cells from the surface mixed layer (SML; <10 m) and the deep chlorophyll maximum (DCM). Individual cell quotas varied widely, from 1 to 4700 amol Si cell- 1, though the middle 50% of quotas ranged between 17 and 119 amol Si cell- 1. Mean station-specific quotas exhibited an even narrower range of 31–72 amol Si cell- 1. No significant differences in Si quotas were observed across cruises or among stations, and no effect of ambient silicic acid concentration on quotas was observed within the narrow range of silicic acid concentrations encountered (0.6–1.3 µM). Despite this small range in ambient silicic acid, cells collected from the SML had an average of two-fold more Si than cells collected from the DCM. Differences in Si content with depth may be related to observed differences in the dominant Synechococcus clades between the SML and DCM habitats, determined by petB gene sequencing.


July 7, 2019  |  

Probabilistic viral quasispecies assembly

Viruses are pathogens that cause infectious diseases. The swarm of virions is subject to the host’s immune pressure and possibly antiviral therapy. It may escape this selective pressure and gain selective advantage by acquiring one or more of the genomic alterations: single-nucleotide variants (SNVs), loss or gain of one or more amino acids, large deletions, for example, due to alternative splicing, or recombination of different strains. Genotypic antiretroviral drug resistance testing is performed via sequencing. Next-generation sequencing (NGS) technologies revolutionized assessing viral genetic diversity experimentally. In viral quasispecies analysis, there are two main goals: the identification of low-frequency variants and haplotype assembly on a whole-genome scale. PacBio performs single-molecule sequencing. This chapter elaborates human haplotyping and its relationship to probabilistic viral haplotype reconstruction methods. Viral quasispecies assembly has the potential to replace the current de facto diversity estimation by SNV calling. With advances in library preparation, increasing sensitivity of sequencing platforms, and more sophisticated models, it might be possible to detect all or most viral strains in a single individual.


July 7, 2019  |  

MICADo – Looking for mutations in targeted PacBio cancer data: an alignment-free method.

Targeted sequencing is commonly used in clinical application of NGS technology since it enables generation of sufficient sequencing depth in the targeted genes of interest and thus ensures the best possible downstream analysis. This notwithstanding, the accurate discovery and annotation of disease causing mutations remains a challenging problem even in such favorable context. The difficulty is particularly salient in the case of third generation sequencing technology, such as PacBio. We present MICADo, a de Bruijn graph based method, implemented in python, that makes possible to distinguish between patient specific mutations and other alterations for targeted sequencing of a cohort of patients. MICADo analyses NGS reads for each sample within the context of the data of the whole cohort in order to capture the differences between specificities of the sample with respect to the cohort. MICADo is particularly suitable for sequencing data from highly heterogeneous samples, especially when it involves high rates of non-uniform sequencing errors. It was validated on PacBio sequencing datasets from several cohorts of patients. The comparison with two widely used available tools, namely VarScan and GATK, shows that MICADo is more accurate, especially when true mutations have frequencies close to backgound noise. The source code is available at http://github.com/cbib/MICADo.


July 7, 2019  |  

Improve homology search sensitivity of PacBio data by correcting frameshifts.

Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than secondary generation sequencing technologies such as Illumina. The long read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and identify gene isoforms with higher accuracy in transcriptomic sequencing. However, PacBio data has high sequencing error rate and most of the errors are insertion or deletion errors. During alignment-based homology search, insertion or deletion errors in genes will cause frameshifts and may only lead to marginal alignment scores and short alignments. As a result, it is hard to distinguish true alignments from random alignments and the ambiguity will incur errors in structural and functional annotation. Existing frameshift correction tools are designed for data with much lower error rate and are not optimized for PacBio data. As an increasing number of groups are using SMRT, there is an urgent need for dedicated homology search tools for PacBio data.In this work, we introduce Frame-Pro, a profile homology search tool for PacBio reads. Our tool corrects sequencing errors and also outputs the profile alignments of the corrected sequences against characterized protein families. We applied our tool to both simulated and real PacBio data. The results showed that our method enables more sensitive homology search, especially for PacBio data sets of low sequencing coverage. In addition, we can correct more errors when comparing with a popular error correction tool that does not rely on hybrid sequencing.The source code is freely available at https://sourceforge.net/projects/frame-pro/yannisun@msu.edu. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.


July 7, 2019  |  

Epigenetic mechanisms in microbial members of the human microbiota: current knowledge and perspectives.

The human microbiota and epigenetic processes have both been shown to play a crucial role in health and disease. However, there is extremely scarce information on epigenetic modulation of microbiota members except for a few pathogens. Mainly DNA adenine methylation has been described extensively in modulating the virulence of pathogenic bacteria in particular. It would thus appear likely that such mechanisms are widespread for most bacterial members of the microbiota. This review will present briefly the current knowledge on epigenetic processes in bacteria, give examples of known methylation processes in microbial members of the human microbiota and summarize the knowledge on regulation of host epigenetic processes by the human microbiota.


July 7, 2019  |  

Microbial metagenomics mock scenario-based sample simulation (M3S3).

Shotgun sequencing in increasingly applied in clinical microbiology for unbiased culture-independent diagnosis. While software solutions for metagenomics proliferate, integration of metagenomics in clinical care, requires method standardisation and validation. Virtual metagenomics samples could underpin validation by substituting real samples and thus we sought to develop a novel solution for simulation of metagenomics samples based on user-defined clinical scenarios.We designed the Microbial Metagenomics Mock Scenario-based Sample Simulation (M3S3) workflow, which allows users to generate virtual samples from raw reads or assemblies. The M3S3 output is a mock sample in FASTQ or FASTA format. M3S3 was tested by generating virtual samples for ten challenging infectious disease scenarios, involving a background matrix ‘spiked’ in silico with pathogens including mixtures. Replicate samples (seven per scenario) were used to represent different compositional ratios. Virtual samples were analysed using Taxonomer and Kraken db.The ten challenge scenarios were successfully applied, generating 80 samples. For all tested scenarios, the virtual samples showed sequence compositions as predicted from the user input. Spiked pathogen sequences were identified with the majority of the replicates and most exhibited acceptable abundance (deviation between expected and observed abundance of spiked pathogens), with slight differences observed between software tools.Despite demonstrated proof-of-concept, integration of clinical metagenomics in routine microbiology remains a substantial challenge. M3S3 is capable of producing virtual samples on-demand, simulating a spectrum of clinical diagnostic scenarios of varying complexity. The M3S3 tool can therefore support the development and validation of standardised metagenomics applications. Copyright © 2017. Published by Elsevier Ltd.


July 7, 2019  |  

RIFRAF: a frame-resolving consensus algorithm.

Protein coding genes can be studied using long-read next generation sequencing. However, high rates of indel sequencing errors are problematic, corrupting the reading frame. Even the consensus of multiple independent sequence reads retains indel errors. To solve this problem, we introduce Reference-Informed Frame-Resolving multiple-Alignment Free template inference algorithm (RIFRAF), a sequence consensus algorithm that takes a set of error-prone reads and a reference sequence and infers an accurate in-frame consensus. RIFRAF uses a novel structure, analogous to a two-layer hidden Markov model: the consensus is optimized to maximize alignment scores with both the set of noisy reads and with a reference. The template-to-reads component of the model encodes the preponderance of indels, and is sensitive to the per-base quality scores, giving greater weight to more accurate bases. The reference-to-template component of the model penalizes frame-destroying indels. A local search algorithm proceeds in stages to find the best consensus sequence for both objectives.Using Pacific Biosciences SMRT sequences from an HIV-1 env clone, NL4-3, we compare our approach to other consensus and frame correction methods. RIFRAF consistently finds a consensus sequence that is more accurate and in-frame, especially with small numbers of reads. It was able to perfectly reconstruct over 80% of consensus sequences from as few as three reads, whereas the best alternative required twice as many. RIFRAF is able to achieve these results and keep the consensus in-frame even with a distantly related reference sequence. Moreover, unlike other frame correction methods, RIFRAF can detect and keep true indels while removing erroneous ones.RIFRAF is implemented in Julia, and source code is publicly available at https://github.com/MurrellGroup/Rifraf.jl.Supplementary data are available at Bioinformatics online.


July 7, 2019  |  

Immunoglobulin gene analysis as a tool for investigating human immune responses.

The human immunoglobulin repertoire is a hugely diverse set of sequences that are formed by processes of gene rearrangement, heavy and light chain gene assortment, class switching and somatic hypermutation. Early B cell development produces diverse IgM and IgD B cell receptors on the B cell surface, resulting in a repertoire that can bind many foreign antigens but which has had self-reactive B cells removed. Later antigen-dependent development processes adjust the antigen affinity of the receptor by somatic hypermutation. The effector mechanism of the antibody is also adjusted, by switching the class of the antibody from IgM to one of seven other classes depending on the required function. There are many instances in human biology where positive and negative selection forces can act to shape the immunoglobulin repertoire and therefore repertoire analysis can provide useful information on infection control, vaccination efficacy, autoimmune diseases, and cancer. It can also be used to identify antigen-specific sequences that may be of use in therapeutics. The juxtaposition of lymphocyte development and numerical evaluation of immune repertoires has resulted in the growth of a new sub-speciality in immunology where immunologists and computer scientists/physicists collaborate to assess immune repertoires and develop models of immune action.© 2018 The Authors. Immunological Reviews Published by John Wiley & Sons Ltd.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.