The problem of ‘missing heritability’ affects both common and rare diseases hindering: discovery, diagnosis, and patient care. The ‘missing heritability’ concept has been mainly associated with common and complex diseases where promising modern technological advances, like genome-wide association studies (GWAS), were unable to uncover the complete genetic mechanism of the disease/trait. Although rare diseases (RDs) have low prevalence individually, collectively they are common. Furthermore, multi-level genetic and phenotypic complexity when combined with the individual rarity of these conditions poses an important challenge in the quest to identify causative genetic changes in RD patients. In recent years, high throughput sequencing has accelerated discovery and diagnosis in RDs. However, despite the several-fold increase (from ~10% using traditional to ~40% using genome-wide genetic testing) in finding genetic causes of these diseases in RD patients, as is the case in common diseases-the majority of RDs are also facing the ‘missing heritability’ problem. This review outlines the key role of high throughput sequencing in uncovering genetics behind RDs, with a particular focus on genome sequencing. We review current advances and challenges of sequencing technologies, bioinformatics approaches, and resources.
A high-quality genome sequence of any model organism is an essential starting point for genetic and other studies. Older clone-based methods are slow and expensive, whereas faster, cheaper short-read-only assemblies can be incomplete and highly fragmented, which minimizes their usefulness. The last few years have seen the introduction of many new technologies for genome assembly. These new technologies and associated new algorithms are typically benchmarked on microbial genomes or, if they scale appropriately, on larger (e.g., human) genomes. However, plant genomes can be much more repetitive and larger than the human genome, and plant biochemistry often makes obtaining high-quality DNA that is free from contaminants difficult. Reflecting their challenging nature, we observe that plant genome assembly statistics are typically poorer than for vertebrates.Here, we compare Illumina short read, Pacific Biosciences long read, 10x Genomics linked reads, Dovetail Hi-C, and BioNano Genomics optical maps, singly and combined, in producing high-quality long-range genome assemblies of the potato species Solanum verrucosum. We benchmark the assemblies for completeness and accuracy, as well as DNA compute requirements and sequencing costs.The field of genome sequencing and assembly is reaching maturity, and the differences we observe between assemblies are surprisingly small. We expect that our results will be helpful to other genome projects, and that these datasets will be used in benchmarking by assembly algorithm developers. © The Author(s) 2019. Published by Oxford University Press.
Personal transcriptomes in which all of an individual’s genetic variants (e.g., single nucleotide variants) and transcript isoforms (transcription start sites, splice sites, and polyA sites) are defined and quantified for full-length transcripts are expected to be important for understanding individual biology and disease, but have not been described previously. To obtain such transcriptomes, we sequenced the lymphoblastoid transcriptomes of three family members (GM12878 and the parents GM12891 and GM12892) by using a Pacific Biosciences long-read approach complemented with Illumina 101-bp sequencing and made the following observations. First, we found that reads representing all splice sites of a transcript are evident for most sufficiently expressed genes =3 kb and often for genes longer than that. Second, we added and quantified previously unidentified splicing isoforms to an existing annotation, thus creating the first personalized annotation to our knowledge. Third, we determined SNVs in a de novo manner and connected them to RNA haplotypes, including HLA haplotypes, thereby assigning single full-length RNA molecules to their transcribed allele, and demonstrated Mendelian inheritance of RNA molecules. Fourth, we show how RNA molecules can be linked to personal variants on a one-by-one basis, which allows us to assess differential allelic expression (DAE) and differential allelic isoforms (DAI) from the phased full-length isoform reads. The DAI method is largely independent of the distance between exon and SNV–in contrast to fragmentation-based methods. Overall, in addition to improving eukaryotic transcriptome annotation, these results describe, to our knowledge, the first large-scale and full-length personal transcriptome.
Forty years ago the advent of Sanger sequencing was revolutionary as it allowed complete genome sequences to be deciphered for the first time. A second revolution came when next-generation sequencing (NGS) technologies appeared, which made genome sequencing much cheaper and faster. However, NGS methods have several drawbacks and pitfalls, most notably their short reads. Recently, third-generation/long-read methods appeared, which can produce genome assemblies of unprecedented quality. Moreover, these technologies can directly detect epigenetic modifications on native DNA and allow whole-transcript sequencing without the need for assembly. This marks the third revolution in sequencing technology. Here we review and compare the various long-read methods. We discuss their applications and their respective strengths and weaknesses and provide future perspectives. Copyright © 2018 Elsevier Ltd. All rights reserved.
Over the past decade, the field of genomics has seen such drastic improvements in sequencing chemistries that high-throughput sequencing, or next-generation sequencing (NGS), is being applied to generate data across many disciplines. NGS instruments are becoming less expensive, faster, and smaller, and therefore are being adopted in an increasing number of laboratories, including clinical laboratories. Thus far, clinical use of NGS has been mostly focused on the human genome, for purposes such as characterizing the molecular basis of cancer or for diagnosing and understanding the basis of rare genetic disorders. There are, however, an increasing number of examples whereby NGS is employed to discover novel pathogens, and these cases provide precedent for the use of NGS in microbial diagnostics. NGS has many advantages over traditional microbial diagnostic methods, such as unbiased rather than pathogen-specific protocols, ability to detect fastidious or non-culturable organisms, and ability to detect co-infections. One of the most impressive advantages of NGS is that it requires little or no prior knowledge of the pathogen, unlike many other diagnostic assays; therefore for pathogen discovery, NGS is very valuable. However, despite these advantages, there are challenges involved in implementing NGS for routine clinical microbiological diagnosis. We discuss these advantages and challenges in the context of recently described research studies.
A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling.
In the last 5 years, the rapid pace of innovations and improvements in sequencing technologies has completely changed the landscape of metagenomic and metagenetic experiments. Therefore, it is critical to benchmark the various methodologies for interrogating the composition of microbial communities, so that we can assess their strengths and limitations. The most common phylogenetic marker for microbial community diversity studies is the 16S ribosomal RNA gene and in the last 10 years the field has moved from sequencing a small number of amplicons and samples to more complex studies where thousands of samples and multiple different gene regions are interrogated.We assembled 2 synthetic communities with an even (EM) and uneven (UM) distribution of archaeal and bacterial strains and species, as metagenomic control material, to assess performance of different experimental strategies. The 2 synthetic communities were used in this study, to highlight the limitations and the advantages of the leading sequencing platforms: MiSeq (Illumina), The Pacific Biosciences RSII, 454 GS-FLX/+ (Roche), and IonTorrent (Life Technologies). We describe an extensive survey based on synthetic communities using 3 experimental designs (fusion primers, universal tailed tag, ligated adaptors) across the 9 hypervariable 16S rDNA regions. We demonstrate that library preparation methodology can affect data interpretation due to different error and chimera rates generated during the procedure. The observed community composition was always biased, to a degree that depended on the platform, sequenced region and primer choice. However, crucially, our analysis suggests that 16S rRNA sequencing is still quantitative, in that relative changes in abundance of taxa between samples can be recovered, despite these biases.We have assessed a range of experimental conditions across several next generation sequencing platforms using the most up-to-date configurations. We propose that the choice of sequencing platform and experimental design needs to be taken into consideration in the early stage of a project by running a small trial consisting of several hypervariable regions to quantify the discriminatory power of each region. We also suggest that the use of a synthetic community as a positive control would be beneficial to identify the potential biases and procedural drawbacks that may lead to data misinterpretation. The results of this study will serve as a guideline for making decisions on which experimental condition and sequencing platform to consider to achieve the best microbial profiling.
Impressive progress has been made in the field of Next Generation Sequencing (NGS). Through advancements in the fields of molecular biology and technical engineering, parallelization of the sequencing reaction has profoundly increased the total number of produced sequence reads per run. Current sequencing platforms allow for a previously unprecedented view into complex mixtures of RNA and DNA samples. NGS is currently evolving into a molecular microscope finding its way into virtually every fields of biomedical research. In this chapter we review the technical background of the different commercially available NGS platforms with respect to template generation and the sequencing reaction and take a small step towards what the upcoming NGS technologies will bring. We close with an overview of different implementations of NGS into biomedical research. This article is part of a Special Issue entitled: From Genome to Function. Copyright © 2014 Elsevier B.V. All rights reserved.
Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study.
High-throughput RNA sequencing (RNA-seq) greatly expands the potential for genomics discoveries, but the wide variety of platforms, protocols and performance capabilitites has created the need for comprehensive reference data. Here we describe the Association of Biomolecular Resource Facilities next-generation sequencing (ABRF-NGS) study on RNA-seq. We carried out replicate experiments across 15 laboratory sites using reference RNA standards to test four protocols (poly-A-selected, ribo-depleted, size-selected and degraded) on five sequencing platforms (Illumina HiSeq, Life Technologies PGM and Proton, Pacific Biosciences RS and Roche 454). The results show high intraplatform (Spearman rank R > 0.86) and inter-platform (R > 0.83) concordance for expression measures across the deep-count platforms, but highly variable efficiency and cost for splice junction and variant detection between all platforms. For intact RNA, gene expression profiles from rRNA-depletion and poly-A enrichment are similar. In addition, rRNA depletion enables effective analysis of degraded RNA samples. This study provides a broad foundation for cross-platform standardization, evaluation and improvement of RNA-seq.
Different next generation sequencing platforms produce different microbial profiles and diversity in cystic fibrosis sputum.
Cystic fibrosis (CF) is an autosomal recessive disease characterized by recurrent lung infections. Studies of the lung microbiome have shown an association between decreasing diversity and progressive disease. 454 pyrosequencing has frequently been used to study the lung microbiome in CF, but will no longer be supported. We sought to identify the benefits and drawbacks of using two state-of-the-art next generation sequencing (NGS) platforms, MiSeq and PacBio RSII, to characterize the CF lung microbiome. Each has its advantages and limitations.Twelve samples of extracted bacterial DNA were sequenced on both MiSeq and PacBio NGS platforms. DNA was amplified for the V4 region of the 16S rRNA gene and libraries were sequenced on the MiSeq sequencing platform, while the full 16S rRNA gene was sequenced on the PacBio RSII sequencing platform. Raw FASTQ files generated by the MiSeq and PacBio platforms were processed in mothur v1.35.1.There was extreme discordance in alpha-diversity of the CF lung microbiome when using the two platforms. Because of its depth of coverage, sequencing of the 16S rRNA V4 gene region using MiSeq allowed for the observation of many more operational taxonomic units (OTUs) and higher Chao1 and Shannon indices than the PacBio RSII. Interestingly, several patients in our cohort had Escherichia, an unusual pathogen in CF. Also, likely because of its coverage of the complete 16S rRNA gene, only PacBio RSII was able to identify Burkholderia, an important CF pathogen.When comparing microbiome diversity in clinical samples from CF patients using 16S sequences, MiSeq and PacBio NGS platforms may generate different results in microbial community composition and structure. It may be necessary to use different platforms when trying to correctly identify dominant pathogens versus measuring alpha-diversity estimates, and it would be important to use the same platform for comparisons to minimize errors in interpretation. Copyright © 2016 Elsevier B.V. All rights reserved.