Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers.In this paper, we study different techniques and algorithms used to compress genomic data. Most of these techniques take advantage of some properties that are unique to DNA sequences in order to improve the compression rate, and usually perform better than general-purpose compressors. By exploring the performance of available algorithms, we produce a powerful compression tool for NGS data called MZPAQ. Results show that MZPAQ outperforms state-of-the-art tools on all benchmark datasets obtained from a recent survey in terms of compression ratio. MZPAQ offers the best compression ratios regardless of the sequencing platform or the size of the data.Currently, MZPAQ's strength is its higher compression ratio as well as its compatibility with all major sequencing platforms. MZPAQ is more suitable when the size of compressed data is crucial, such as long-term storage and data transfer. More efforts will be made in the future to target other aspects such as compression speed and memory utilization.
Journal: Source code for biology and medicine