First- and second-generation sequencing technologies have led the way in revolutionizing the field of genomics and beyond, motivating an astonishing number of scientific advances, including enabling a more complete understanding of whole genome sequences and the information encoded therein, a more complete characterization of the methylome and transcriptome and a better understanding of interactions between proteins and DNA. Nevertheless, there are sequencing applications and aspects of genome biology that are presently beyond the reach of current sequencing technologies, leaving fertile ground for additional innovation in this space. In this review, we describe a new generation of single-molecule sequencing technologies (third-generation sequencing) that is emerging to fill this space, with the potential for dramatically longer read lengths, shorter time to result and lower overall cost.
Computer simulation of genomic data has become increasingly popular for assessing and validating biological models or for gaining an understanding of specific data sets. Several computational tools for the simulation of next-generation sequencing (NGS) data have been developed in recent years, which could be used to compare existing and new NGS analytical pipelines. Here we review 23 of these tools, highlighting their distinct functionality, requirements and potential applications. We also provide a decision tree for the informed selection of an appropriate NGS simulation tool for the specific question at hand.
Extract It may be hard to believe, but the idea of sequence assembly is around 40 years old. Consider this pair of quotes from Rodger Staden (Staden 1979): “With modern fast sequencing techniques and suitable computer programs it is now possible to sequence whole genomes without the need of restriction maps.” “If the 5′ end of the sequence from one gel reading is the same as the 3′ end of the sequence from another the data is said to overlap. If the overlap is of sufficient length to distinguish it from being a repeat in the sequence the two sequences must be contiguous. The data from the two gel readings can then be joined to form one longer continuous sequence.” Replace “gel reading” with “read” and these sentences would go unnoticed in the introduction of any paper today. Here you can also see the birth of jargon that now pervades the field: overlaps between reads form contigs (contiguous sequences). Just a few months later, Gingeras et al. (1979) described “Computer programs for the assembly of DNA sequences.” It all sounds so modern, until the discussion mentions FORTRAN code stored on magnetic tapes. How, then, can we fill an entire special issue of Genome Research with “new advances” so many years later? To me, this reflects the beauty of the problem—simple enough to be stated in a single paragraph, yet complex enough to sustain a field of research for decades. This dichotomy is common to many famous computational problems; indeed, mathematical formulations of sequence assembly fall into a class of problems known as “NP-hard” that do not admit an easy solution (Medvedev et al. 2007). There is another reason for continued advances in sequence assembly—advances in sequencing technology. As evident from the Staden quotes above, the first assembly methods were …
Survey on the use of whole-genome sequencing for infectious diseases surveillance: Rapid expansion of European national capacities, 2015-2016.
Whole-genome sequencing (WGS) has become an essential tool for public health surveillance and molecular epidemiology of infectious diseases and antimicrobial drug resistance. It provides precise geographical delineation of spread and enables incidence monitoring of pathogens at genotype level. Coupled with epidemiological and environmental investigations, it delivers ultimate resolution for tracing sources of epidemic infections. To ascertain the level of implementation of WGS-based typing for national public health surveillance and investigation of prioritized diseases in the European Union (EU)/European Economic Area (EEA), two surveys were conducted in 2015 and 2016. The surveys were designed to determine the national public health reference laboratories’ access to WGS and operational WGS-based typing capacity for national surveillance of selected foodborne pathogens, antimicrobial-resistant pathogens, and vaccine-preventable diseases identified as priorities for European genomic surveillance. Twenty-eight and twenty-nine out of the 30 EU/EEA countries participated in the survey in 2015 and 2016, respectively. National public health reference laboratories in 22 and 25 countries had access to WGS-based typing for public health applications in 2015 and 2016, respectively. Reported reasons for limited or no access were lack of funding, staff, and expertise. Illumina technology was the most frequently used followed by Ion Torrent technology. The access to bioinformatics expertise and competence for routine WGS data analysis was limited. By mid-2016, half of the EU/EEA countries were using WGS analysis either as first- or second-line typing method for surveillance of the pathogens and antibiotic resistance issues identified as EU priorities. The sampling frame as well as bioinformatics analysis varied by pathogen/resistance issue and country. Core genome multilocus allelic profiling, also called cgMLST, was the most frequently used annotation approach for typing bacterial genomes suggesting potential bioinformatics pipeline compatibility. Further capacity development for WGS-based typing is ongoing in many countries and upon consolidation and harmonization of methods should enable pan-EU data exchange for genomic surveillance in the medium-term subject to the development of suitable data management systems and appropriate agreements for data sharing.
DNA sequencing technologies have changed the face of biological research over the last 20 years. From reference genomes to population level resequencing studies, these technologies have made significant contributions to our understanding of plant biology and evolution. As the technologies have increased in power, the breadth and complexity of the questions that can be asked has increased. Along with this, the challenges of managing unprecedented quantities of sequence data are mounting. This chapter describes a few aspects of the journey so far and looks forward to what may lie ahead.
The use of next generation sequencing has the power to expedite the identification of drug resistance determinants and biomarkers and was applied successfully to drug resistance studies in Leishmania. This allowed the identification of modulation in gene expression, gene dosage alterations, changes in chromosome copy numbers and single nucleotide polymorphisms that correlated with resistance in Leishmania strains derived from the laboratory and from the field. An impressive heterogeneity at the population level was also observed, individual clones within populations often differing in both genotypes and phenotypes, hence complicating the elucidation of resistance mechanisms. This review summarizes the most recent highlights that whole genome sequencing brought to our understanding of Leishmania drug resistance and likely new directions.
The availability of reference genome sequences, especially the human reference, has revolutionized the study of biology. However, while the genomes of some species have been fully sequenced, a wide range of biological problems still cannot be effectively studied for lack of genome sequence information. Here, I identify neglected areas of biology and describe how both targeted species sequencing and more broad taxonomic surveys of the tree of life can address important biological questions. I enumerate the significant benefits that would accrue from sequencing a broader range of taxa, as well as discuss the technical advances in sequencing and assembly methods that would allow for wide-ranging application of whole-genome analysis. Finally, I suggest that in addition to ‘big science’ survey initiatives to sequence the tree of life, a modified infrastructure-funding paradigm would better support reference genome sequence generation for research communities most in need. Copyright © 2015 Elsevier Ltd. All rights reserved.
Current overview on the study of bacteria in the rhizosphere by modern molecular techniques: a mini–review
The rhizosphere (soil zone influenced by roots) is a complex environment that harbors diverse bacterial populations, which have an important role in biogeochemical cycling of organic matter and mineral nutrients. Nevertheless, our knowledge of the ecology and role of these bacteria in the rhizosphere is very limited, particularly regarding how indigenous bacteria are able to communicate, colonize root environments, and compete along the rhizosphere microsites. In recent decades, the development and improvement of molecular techniques have provided more accurate knowledge of bacteria in their natural environment, refining microbial ecology and generating new questions about the roles and functions of bacteria in the rhizosphere. Recently, advances in soil post?genomic techniques (metagenomics, metaproteomics and metatranscriptomics) are being applied to improve our understanding of the microbial communities at a higher resolution. Moreover, advantages and limitations of classical and post?genomic techniques must be considered when studying bacteria in the rhizosphere. This review provides an overview of the current knowledge on the study of bacterial community in the rhizosphere by using modern molecular techniques, describing the bias of classical molecular techniques, next generation sequencing platforms and post?genomics techniques.
FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets.
High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms.Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible.FASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step.FASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge datasets for detecting genetic engineering toolmarks, etc.
In recent years, the increasing awareness that somatic mutations and other genetic aberrations drive human malignancies has led us within reach of personalized cancer medicine (PCM). The implementation of PCM is based on the following premises: genetic aberrations exist in human malignancies; a subset of these aberrations drive oncogenesis and tumor biology; these aberrations are actionable (defined as having the potential to affect management recommendations based on diagnostic, prognostic, and/or predictive implications); and there are highly specific anticancer agents available that effectively modulate these targets. This article highlights the technology underlying cancer genomics and examines the early results of genome sequencing and the challenges met in the discovery of new genetic aberrations. Finally, drawing from experiences gained in a feasibility study of somatic mutation genotyping and targeted exome sequencing led by Princess Margaret Hospital-University Health Network and the Ontario Institute for Cancer Research, the processes, challenges, and issues involved in the translation of cancer genomics to the clinic are discussed.
The dawn of next generation sequencing technologies has opened up exciting possibilities for whole genome sequencing of a plethora of organisms. The 2nd and 3rd generation sequencing technologies, based on cloning-free, massively parallel sequencing, have enabled the generation of a deluge of genomic sequences of both prokaryotic and eukaryotic origin in the last seven years. However, whole genome sequencing of bacterial viruses has not kept pace with this revolution, despite the fact that their genomes are orders of magnitude smaller in size compared with bacteria and other organisms. Sequencing phage genomes poses several challenges; (1) obtaining pure phage genomic material, (2) PCR amplification biases and (3) complex nature of their genetic material due to features such as methylated bases and repeats that are inherently difficult to sequence and assemble. Here we describe conclusions drawn from our efforts in sequencing hundreds of bacteriophage genomes from a variety of Gram-positive and Gram-negative bacteria using Sanger, 454, Illumina and PacBio technologies. Based on our experience we propose several general considerations regarding sample quality, the choice of technology and a “blended approach” for generating reliable whole genome sequences of phages.
Molecular approaches for high throughput detection and quantification of genetically modified crops: A review.
As long as the genetically modified crops are gaining attention globally, their proper approval and commercialization need accurate and reliable diagnostic methods for the transgenic content. These diagnostic techniques are mainly divided into two major groups, i.e., identification of transgenic (1) DNA and (2) proteins from GMOs and their products. Conventional methods such as PCR (polymerase chain reaction) and enzyme-linked immunosorbent assay (ELISA) were routinely employed for DNA and protein based quantification respectively. Although, these Techniques (PCR and ELISA) are considered as significantly convenient and productive, but there is need for more advance technologies that allow for high throughput detection and the quantification of GM event as the production of more complex GMO is increasing day by day. Therefore, recent approaches like microarray, capillary gel electrophoresis, digital PCR and next generation sequencing are more promising due to their accuracy and precise detection of transgenic contents. The present article is a brief comparative study of all such detection techniques on the basis of their advent, feasibility, accuracy, and cost effectiveness. However, these emerging technologies have a lot to do with detection of a specific event, contamination of different events and determination of fusion as well as stacked gene protein are the critical issues to be addressed in future.
Long read technologies have revolutionized de novo genome assembly by generating contigs orders of magnitude longer than that of short read assemblies. Although assembly contiguity has increased, it usually does not reconstruct a full chromosome or an arm of the chromosome, resulting in an unfinished chromosome level assembly. To increase the contiguity of the assembly to the chromosome level, different strategies are used which exploit long range contact information between chromosomes in the genome.We develop a scalable and computationally efficient scaffolding method that can boost the assembly contiguity to a large extent using genome-wide chromatin interaction data such as Hi-C.we demonstrate an algorithm that uses Hi-C data for longer-range scaffolding of de novo long read genome assemblies. We tested our methods on the human and goat genome assemblies. We compare our scaffolds with the scaffolds generated by LACHESIS based on various metrics.Our new algorithm SALSA produces more accurate scaffolds compared to the existing state of the art method LACHESIS.
Over the last decade, a technological paradigm shift has slashed the cost of DNA sequencing by over five orders of magnitude. Today, the cost of sequencing a human genome is a few thousand dollars, and it continues to fall. Here, we review the most cost-effective platforms for whole-genome sequencing (WGS) as well as emerging technologies that may displace or complement these. We also discuss the practical challenges of generating and analyzing WGS data, and how WGS has unlocked new strategies for discovering genes and variants underlying both rare and common human diseases.
Challenges, solutions, and quality metrics of personal genome assembly in advancing precision medicine.
Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging “third generation sequencing” technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring drug therapy and detecting tumors. We believe the precision medicine would largely benefit from bioinformatics solutions, particularly for personal genome assembly.