In 2012, NIST convened the Genome in a Bottle Consortium to develop the metrology infrastructure needed to enable confidence in human whole genome variant calls.
The Genome in a Bottle Consortium is developing the reference materials, reference methods , and reference data n
Purpose: Clinical laboratories, research laboratories and technology developers all need DNA samples with reliably known genotypes in order to help validate and improve their methods. The Genome in a Bottle Consortium (genomeinabottle.org) has been developing Reference Materials with high-accuracy whole genome sequences to support these efforts.Methodology: Our pilot reference material is based on Coriell sample NA12878 and was released in May 2015 as NIST RM 8398 (tinyurl.com/giabpilot). To minimize bias and improve accuracy, 11 whole-genome and 3 exome data sets produced using 5 different technologies were integrated using a systematic arbitration method . The Genome in a Bottle Analysis Group is adapting these methods and developing new methods to characterize 2 families, one Asian and one Ashkenazi Jewish from the Personal Genome Project, which are consented for public release of sequencing and phenotype data. We have generated a larger and even more diverse data set on these samples, including high-depth Illumina paired-end and mate-pair, Complete Genomics, and Ion Torrent short-read data, as well as Moleculo, 10X, Oxford Nanopore, PacBio, and BioNano Genomics long-read data. We are analyzing these data to provide an accurate assessment of not just small variants but also large structural variants (SVs) in both “easy” regions of the genome and in some “hard” repetitive regions. We have also made all of the input data sources publicly available for download, analysis, and publication.Results: Our arbitration method produced a reference data set of 2,787,291 single nucleotide variants (SNVs), 365,135 indels, 2744 SVs, and 2.2 billion homozygous reference calls for our pilot genome. We found that our call set is highly sensitive and specific in comparison to independent reference data sets. We have also generated preliminary assemblies and structural variant calls for the next 2 trios from long read data and are currently integrating and validating these.Discussion: We combined the strengths of each of our input datasets to develop a comprehensive and accurate benchmark call set. In the short time it has been available, over 20 published or submitted papers have used our data. Many challenges exist in comparing to our benchmark calls, and thus we have worked with the Global Alliance for Genomics and Health to develop standardized methods, performance metrics, and software to assist in its use. Zook et al, Nat Biotech. 2014.
Structural variant calling combining Illumina and low-coverage Pacbio Detection of large genomic variation (structural variants) has proven challenging using short-read methods. Long-read approaches which can span these large events have promise to dramatically expand the ability to accurately call structural variants. Although sequencing with Pacific Biosciences (Pacbio) long-read technology has become increasingly high throughput, generating high coverage with the technology can still be limiting and investigators often would like to know what pacbio coverages are adequate to call structural variants. Here, we present a method to identify a substantially higher fraction of structural variants in the human genome using low-coverage pacbio data by multiple strategies for ensembling data types and algorithms. Algorithmically, we combine three structural variant callers: PBHoney by Adam English, Sniffles by Fritz Sedlazeck, and Parliament by Adam English (which we have modified to improve for speed). Parliament itself uses a combination of Pacbio and Illumina data with a number of short-read callers (Breakdancer, Pindel, Crest, CNVnator, Delly, and Lumpy). We show that the outputs of these three programs are largely complementary to each other, with each able to uniquely access different sets of structural variants at different coverages. Combining them together can more than double the recall of true structural variants from a truth set relative to sequencing with Illumina alone, with substantial improvements even at low pacbio coverages (3x – 7x). This allows us to present for the first time cost-benefit tradeoffs to investigators about how much pacbio sequencing will yield what improvements in SV-calling. This work also builds upon the foundational work of Genome in a Bottle led by Justin Zook in establishing a truth set for structural variants in the Ashkenazim-Jewish trio data recently released. This work demonstrates the power of this benchmark set – one of the first of its kind for structural variation data – to help understand and refine the accuracies of calling structural variants with a number of approaches.
Recent improvements in sequencing chemistry and instrument performance combine to create a new PacBio data type, Single Molecule High-Fidelity reads (HiFi reads). Increased read length and improvement in library construction enables average read lengths of 10-20 kb with average sequence identity greater than 99% from raw single molecule reads. The resulting reads have the accuracy comparable to short read NGS but with 50-100 times longer read length. Here we benchmark the performance of this data type by sequencing and genotyping the Genome in a Bottle (GIAB) HG0002 human reference sample from the National Institute of Standards and Technology (NIST). We further demonstrate the general utility of HiFi reads by analyzing multiple clones of Cabernet Sauvignon. Three different clones were sequenced and de novo assembled with the CANU assembly algorithm, generating draft assemblies of very high contiguity equal to or better than earlier assembly efforts using PacBio long reads. Using the Cabernet Sauvignon Clone 8 assembly as a reference, we mapped the HiFi reads generated from Clone 6 and Clone 47 to identify single nucleotide polymorphisms (SNPs) and structural variants (SVs) that are specific to each of the three samples.
New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. Translating these methods to routine research and clinical practice requires robust benchmark sets. We developed the first benchmark set for identification of both false negative and false positive germline SVs, which complements recent efforts emphasizing increasingly comprehensive characterization of SVs. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods, both alignment- and de novo assembly-based, from short-, linked-, and long-read sequencing, as well as optical and electronic mapping. The final benchmark set contains 12745 isolated, sequence-resolved insertion and deletion calls =50 base pairs (bp) discovered by at least 2 technologies or 5 callsets, genotyped as heterozygous or homozygous variants by long reads. The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.66 Gbp and 9641 SVs supported by at least one diploid assembly. Support for SVs was assessed using svviz with short-, linked-, and long-read sequence data. In general, there was strong support from multiple technologies for the benchmark SVs, with 90 % of the Tier 1 SVs having support in reads from more than one technology. The Mendelian genotype error rate was 0.3 %, and genotype concordance with manual curation was >98.7 %. We demonstrate the utility of the benchmark set by showing it reliably identifies both false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping.
Evaluation of the performance of copy number variant prediction tools for the detection of deletions from whole genome sequencing data.
Whole genome sequencing (WGS) has increased in popularity and decreased in cost over the past decade, rendering this approach as a viable and sensitive method for variant detection. In addition to its utility for single nucleotide variant detection, WGS data has the potential to detect Copy Number Variants (CNV) to fine resolution. Many CNV detection software packages have been developed exploiting four main types of data: read pair, split read, read depth, and assembly based methods. The aim of this study was to evaluate the efficiency of each of these main approaches in detecting germline deletions.WGS data and high confidence deletion calls for the individual NA12878 from the Genome in a Bottle consortium were the benchmark dataset. The performance of BreakDancer, CNVnator, Delly, FermiKit, and Pindel was assessed by comparing the accuracy and sensitivity of each software package in detecting deletions exceeding 1?kb.There was considerable variability in the outputs of the different WGS CNV detection programs. The best performance was seen from BreakDancer and Delly, with 92.6% and 96.7% sensitivity, respectively and 34.5% and 68.5% false discovery rate (FDR), respectively. In comparison, Pindel, CNVnator, and FermiKit were less effective with sensitivities of 69.1%, 66.0%, and 15.8%, respectively and FDR of 91.3%, 69.0%, and 31.7%, respectively. Concordance across software packages was poor, with only 27 of the total 612 benchmark deletions identified by all five methodologies.The WGS based CNV detection tools evaluated show disparate performance in identifying deletions =1?kb, particularly those utilising different input data characteristics. Software that exploits read pair based data had the highest sensitivity, namely BreakDancer and Delly. BreakDancer also had the second lowest false discovery rate. Therefore, in this analysis read pair methods (BreakDancer in particular) were the best performing approaches for the identification of deletions =1?kb, balancing accuracy and sensitivity. There is potential for improvement in the detection algorithms, particularly for reducing FDR. This analysis has validated the utility of WGS based CNV detection software to reliably identify deletions, and these findings will be of use when choosing appropriate software for deletion detection, in both research and diagnostic medicine.Copyright © 2019 Elsevier Inc. All rights reserved.
Single-molecule long-read sequencing datasets were generated for a son-father-mother trio of Han Chinese descent that is part of the Genome in a Bottle (GIAB) consortium portfolio. The dataset was generated using the Pacific Biosciences Sequel System. The son and each parent were sequenced to an average coverage of 60 and 30, respectively, with N50 subread lengths between 16 and 18?kb. Raw reads and reads aligned to both the GRCh37 and GRCh38 are available at the NCBI GIAB ftp site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/). The GRCh38 aligned read data are archived in NCBI SRA (SRX4739017, SRX4739121, and SRX4739122). This dataset is available for anyone to develop and evaluate long-read bioinformatics methods.
The 19th Annual Advances in Genome Biology and Technology (AGBT) meeting came back to Marco Island, Florida, and was held in the renovated venue from 27 February to 2 March 2019. The meeting showed a variety of new technology, both in wet lab and in bioinformatics. This year’s themes included single-cell technology and applications, spatially resolved gene expression measurements, new sequencing platforms, genome assembly and variation, and long and linked reads.
Reproducible integration of multiple sequencing datasets to form high-confidence SNP, indel, and reference calls for five human genome reference materials
Benchmark small variant calls from the Genome in a Bottle Consortium (GIAB) for the CEPH/HapMap genome NA12878 (HG001) have been used extensively for developing, optimizing, and demonstrating performance of sequencing and bioinformatics methods. Here, we develop a reproducible, cloud-based pipeline to integrate multiple sequencing datasets and form benchmark calls, enabling application to arbitrary human genomes. We use these reproducible methods to form high-confidence calls with respect to GRCh37 and GRCh38 for HG001 and 4 additional broadly-consented genomes from the Personal Genome Project that are available as NIST Reference Materials. These new genomes’ broad, open consent with few restrictions on availability of samples and data is enabling a uniquely diverse array of applications. Our new methods produce 17% more high-confidence SNPs, 176% more indels, and 12% larger regions than our previously published calls. To demonstrate that these calls can be used for accurate benchmarking, we compare other high-quality callsets to ours (e.g., Illumina Platinum Genomes), and we demonstrate that the majority of discordant calls are errors in the other callsets, We also highlight challenges in interpreting performance metrics when benchmarking against imperfect high-confidence calls. We show that benchmarking tools from the Global Alliance for Genomics and Health can be used with our calls to stratify performance metrics by variant type and genome context and elucidate strengths and weaknesses of a method.
There is great potential for genome sequencing to enhance patient care through improved diagnostic sensitivity and more precise therapeutic targeting. To maximize this potential, genomics strategies that have been developed for genetic discovery – including DNA-sequencing technologies and analysis algorithms – need to be adapted to fit clinical needs. This will require the optimization of alignment algorithms, attention to quality-coverage metrics, tailored solutions for paralogous or low-complexity areas of the genome, and the adoption of consensus standards for variant calling and interpretation. Global sharing of this more accurate genotypic and phenotypic data will accelerate the determination of causality for novel genes or variants. Thus, a deeper understanding of disease will be realized that will allow its targeting with much greater therapeutic precision.
DNA sequencing continues to evolve quickly even after > 30 years. Many new platforms suddenly appeared and former established systems have vanished in almost the same manner. Since establishment of next-generation sequencing devices, this progress gains momentum due to the continually growing demand for higher throughput, lower costs and better quality of data. In consequence of this rapid development, standardized procedures and data formats as well as comprehensive quality management considerations are still scarce. Here, we listed and summarized current standardization efforts and quality management initiatives from companies, organizations and societies in form of published studies and ongoing projects. These comprise on the one hand quality documentation issues like technical notes, accreditation checklists and guidelines for validation of sequencing workflows. On the other hand, general standard proposals and quality metrics are developed and applied to the sequencing workflow steps with the main focus on upstream processes. Finally, certain standard developments for downstream pipeline data handling, processing and storage are discussed in brief. These standardization approaches represent a first basis for continuing work in order to prospectively implement next-generation sequencing in important areas such as clinical diagnostics, where reliable results and fast processing is crucial. Additionally, these efforts will exert a decisive influence on traceability and reproducibility of sequence data.
Several recent benchmarking efforts provide reference datasets and samples to improve genome sequencing and calling of germline and somatic mutations.
Visualizing read alignments is the most effective way to validate candidate structural variants (SVs) with existing data. We present svviz, a sequencing read visualizer for SVs that sorts and displays only reads relevant to a candidate SV. svviz works by searching input bam(s) for potentially relevant reads, realigning them against the inferred sequence of the putative variant allele as well as the reference allele and identifying reads that match one allele better than the other. Separate views of the two alleles are then displayed in a scrollable web browser view, enabling a more intuitive visualization of each allele, compared with the single reference genome-based view common to most current read browsers. The browser view facilitates examining the evidence for or against a putative variant, estimating zygosity, visualizing affected genomic annotations and manual refinement of breakpoints. svviz supports data from most modern sequencing platforms.svviz is implemented in python and freely available from http://svviz.github.io/. Published by Oxford University Press 2015. This work is written by US Government employees and is in the public domain in the US.
The impact of structural variants (SVs) on a variety of organisms and diseases like cancer has become increasingly evident. Methods for SV detection when studying genomic differences across cells, individuals or populations are being actively developed. Currently, just a few methods are available to compare different SVs callsets, and no specialized methods are available to annotate SVs that account for the unique characteristics of these variant types. Here, we introduce SURVIVOR_ant, a tool that compares types and breakpoints for candidate SVs from different callsets and enables fast comparison of SVs to genomic features such as genes and repetitive regions, as well as to previously established SV datasets such as from the 1000 Genomes Project. As proof of concept we compared 16 SV callsets generated by different SV calling methods on a single genome, the Genome in a Bottle sample HG002 (Ashkenazi son), and annotated the SVs with gene annotations, 1000 Genomes Project SV calls, and four different types of repetitive regions. Computation time to annotate 134,528 SVs with 33,954 of annotations was 22 seconds on a laptop.