Computer simulation of genomic data has become increasingly popular for assessing and validating biological models or for gaining an understanding of specific data sets. Several computational tools for the simulation of next-generation sequencing (NGS) data have been developed in recent years, which could be used to compare existing and new NGS analytical pipelines. Here we review 23 of these tools, highlighting their distinct functionality, requirements and potential applications. We also provide a decision tree for the informed selection of an appropriate NGS simulation tool for the specific question at hand.
FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets.
High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms.Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible.FASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step.FASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge datasets for detecting genetic engineering toolmarks, etc.
LongISLND is a software package designed to simulate sequencing data according to the characteristics of third generation, single-molecule sequencing technologies. The general software architecture is easily extendable, as demonstrated by the emulation of Pacific Biosciences (PacBio) multi-pass sequencing with P5 and P6 chemistries, producing data in FASTQ, H5, and the latest PacBio BAM format. We demonstrate its utility by downstream processing with consensus building and variant calling.LongISLND is implemented in Java and available at http://bioinform.github.io/longislnd CONTACT: email@example.comSupplementary information: Supplementary data are available at Bioinformatics online.© The Author 2016. Published by Oxford University Press.
Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study.
Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated.We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1?kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms.
Shotgun sequencing in increasingly applied in clinical microbiology for unbiased culture-independent diagnosis. While software solutions for metagenomics proliferate, integration of metagenomics in clinical care, requires method standardisation and validation. Virtual metagenomics samples could underpin validation by substituting real samples and thus we sought to develop a novel solution for simulation of metagenomics samples based on user-defined clinical scenarios.We designed the Microbial Metagenomics Mock Scenario-based Sample Simulation (M3S3) workflow, which allows users to generate virtual samples from raw reads or assemblies. The M3S3 output is a mock sample in FASTQ or FASTA format. M3S3 was tested by generating virtual samples for ten challenging infectious disease scenarios, involving a background matrix ‘spiked’ in silico with pathogens including mixtures. Replicate samples (seven per scenario) were used to represent different compositional ratios. Virtual samples were analysed using Taxonomer and Kraken db.The ten challenge scenarios were successfully applied, generating 80 samples. For all tested scenarios, the virtual samples showed sequence compositions as predicted from the user input. Spiked pathogen sequences were identified with the majority of the replicates and most exhibited acceptable abundance (deviation between expected and observed abundance of spiked pathogens), with slight differences observed between software tools.Despite demonstrated proof-of-concept, integration of clinical metagenomics in routine microbiology remains a substantial challenge. M3S3 is capable of producing virtual samples on-demand, simulating a spectrum of clinical diagnostic scenarios of varying complexity. The M3S3 tool can therefore support the development and validation of standardised metagenomics applications. Copyright © 2017. Published by Elsevier Ltd.