NIST Archives - PacBio

June 1, 2021 |

Genome in a Bottle: So you’ve sequenced a genome, how well did you do?

In 2012, NIST convened the Genome in a Bottle Consortium to develop the metrology infrastructure needed to enable confidence in human whole genome variant calls.

June 1, 2021 |

PBHoney: Detecting SVs with long-read sequencing

2015 SMRT Informatics Developers Conference Presentation Slides: Adam English, from the Human Genome Sequencing Center at Baylor College of Medicine presents on the structural variation tools being developed at Baylor.

June 1, 2021 |

Introduction to SMRT informatics developers conference

2015 SMRT Informatics Developers Conference Presentation Slides: Kevin Corcoran of PacBio provided a brief review of community involvement in the development of analysis tools and showed a preview of upcoming sample preparation, chemistry and informatics improvements.

June 1, 2021 |

Reference materials for clinical applications of human genome sequencing

The Genome in a Bottle Consortium is developing the reference materials, reference methods , and reference data n

June 1, 2021 |

Genome in a Bottle: You’ve sequenced. How well did you do?

Purpose: Clinical laboratories, research laboratories and technology developers all need DNA samples with reliably known genotypes in order to help validate and improve their methods. The Genome in a Bottle Consortium (genomeinabottle.org) has been developing Reference Materials with high-accuracy whole genome sequences to support these efforts.Methodology: Our pilot reference material is based on Coriell sample NA12878 and was released in May 2015 as NIST RM 8398 (tinyurl.com/giabpilot). To minimize bias and improve accuracy, 11 whole-genome and 3 exome data sets produced using 5 different technologies were integrated using a systematic arbitration method [1]. The Genome in a Bottle Analysis Group is adapting these methods and developing new methods to characterize 2 families, one Asian and one Ashkenazi Jewish from the Personal Genome Project, which are consented for public release of sequencing and phenotype data. We have generated a larger and even more diverse data set on these samples, including high-depth Illumina paired-end and mate-pair, Complete Genomics, and Ion Torrent short-read data, as well as Moleculo, 10X, Oxford Nanopore, PacBio, and BioNano Genomics long-read data. We are analyzing these data to provide an accurate assessment of not just small variants but also large structural variants (SVs) in both “easy” regions of the genome and in some “hard” repetitive regions. We have also made all of the input data sources publicly available for download, analysis, and publication.Results: Our arbitration method produced a reference data set of 2,787,291 single nucleotide variants (SNVs), 365,135 indels, 2744 SVs, and 2.2 billion homozygous reference calls for our pilot genome. We found that our call set is highly sensitive and specific in comparison to independent reference data sets. We have also generated preliminary assemblies and structural variant calls for the next 2 trios from long read data and are currently integrating and validating these.Discussion: We combined the strengths of each of our input datasets to develop a comprehensive and accurate benchmark call set. In the short time it has been available, over 20 published or submitted papers have used our data. Many challenges exist in comparing to our benchmark calls, and thus we have worked with the Global Alliance for Genomics and Health to develop standardized methods, performance metrics, and software to assist in its use.[1] Zook et al, Nat Biotech. 2014.

June 1, 2021 |

Phased human genome assemblies with Single Molecule, Real-Time Sequencing

In recent years, human genomic research has focused on comparing short-read data sets to a single human reference genome. However, it is becoming increasingly clear that significant structural variations present in individual human genomes are missed or ignored by this approach. Additionally, remapping short-read data limits the phasing of variation among individual chromosomes. This reduces the newly sequenced genome to a table of single nucleotide polymorphisms (SNPs) with little to no information as to the co-linearity (phasing) of these variants, resulting in a “mosaic” reference representing neither of the parental chromosomes. The variation between the homologous chromosomes is lost in this representation, including allelic variations, structural variations, or even genes present in only one chromosome, leading to lost information regarding allelic-specific gene expression and function. To address these limitations, we have made significant progress integrating haplotype information directly into genome assembly process with long reads. The FALCON-Unzip algorithm leverages a string graph assembly approach to facilitate identification and separation of heterozygosity during the assembly process to produce a highly contiguous assembly with phased haplotypes representing the genome in its diploid state. The outputs of the assembler are pairs of sequences (haplotigs) containing the allelic differences, including SNPs and structural variations, present in the two sets of chromosomes. The development and testing of our de-novo diploid assembler was facilitated and carefully validated using inbred reference model organisms and F1 progeny, which allowed us to ascertain the accuracy and concordance of haplotigs relative to the two inbred parental assemblies. Examination of the results confirmed that our haplotype-resolved assemblies are “Gold Level” reference genomes having a quality similar to that of Sanger-sequencing, BAC-based assembly approaches. We further sequenced and assembled two well-characterized human samples into their respective phased diploid genomes with gap-free contig N50 sizes greater than 23 Mb and haplotig N50 sizes greater than 380 kb. Results of these assemblies and a comparison between the haplotype sets are presented.

June 1, 2021 |

Single molecule high-fidelity (HiFi) Sequencing with >10 kb libraries

Recent improvements in sequencing chemistry and instrument performance combine to create a new PacBio data type, Single Molecule High-Fidelity reads (HiFi reads). Increased read length and improvement in library construction enables average read lengths of 10-20 kb with average sequence identity greater than 99% from raw single molecule reads. The resulting reads have the accuracy comparable to short read NGS but with 50-100 times longer read length. Here we benchmark the performance of this data type by sequencing and genotyping the Genome in a Bottle (GIAB) HG0002 human reference sample from the National Institute of Standards and Technology (NIST). We further demonstrate the general utility of HiFi reads by analyzing multiple clones of Cabernet Sauvignon. Three different clones were sequenced and de novo assembled with the CANU assembly algorithm, generating draft assemblies of very high contiguity equal to or better than earlier assembly efforts using PacBio long reads. Using the Cabernet Sauvignon Clone 8 assembly as a reference, we mapped the HiFi reads generated from Clone 6 and Clone 47 to identify single nucleotide polymorphisms (SNPs) and structural variants (SVs) that are specific to each of the three samples.

February 5, 2021 |

Podcast: Going beyond the $1,000 genome with Mark Gerstein

Mark Gerstein is the co-director of the Yale Computational Biology and Bioinformatics program where he focuses on better annotation of the human genome and better ways to mine big genomics…

February 5, 2021 |

Podcast: Marc Salit discusses creating the foundation of genomics

Marc Salit is the leader of the Genome Scale Measurement Group at the National Institute of Standards and Technology or NIST. In this Mendelspod podcast, he explains how NIST played…

February 5, 2021 |

Podcast: The 9 billion people problem – Rod Wing on plant genomics

By 2050, there will be 9 billion people on the planet. What will they eat? This is the question that led Rod Wing, Director of the Arizona Genomics Institute, into…

February 5, 2021 |

Podcast: The goal is de novo assembly in the clinic, says Jim Lupski, Baylor

Jim Lupski is a professor at Baylor College of Medicine where he’s on the frontline of incorporating genomic research into everyday clinical practice. The story begins with Jim’s own genome,…

February 5, 2021 |

Podcast: Long-read sequencing dramatically improves blood matching – Steven Marsh

One of the popular questions on the Mendelspod program is how those doing sequencing decide between the quality of PacBio’s long reads and the cheaper short read technology, such as…

February 5, 2021 |

Webinar: Sequence with Confidence – Introducing the Sequel II System

In this webinar, Jonas Korlach, Chief Scientific Officer, PacBio provides an overview of the features and the advantages of the new Sequel II System. Kiran Garimella, Senior Computational Scientist, Broad…

February 5, 2021 |

Webinar: Increasing solve rates for rare and Mendelian diseases with long-read sequencing

Dr. Wenger gives attendees an update on PacBio’s long-read sequencing and variant detection capabilities on the Sequel II System and shares recommendations on how to design your own study using…

April 21, 2020 |

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.

The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.

Auto Tag: NIST

Genome in a Bottle: So you’ve sequenced a genome, how well did you do?

PBHoney: Detecting SVs with long-read sequencing

Introduction to SMRT informatics developers conference

Reference materials for clinical applications of human genome sequencing

Genome in a Bottle: You’ve sequenced. How well did you do?

Phased human genome assemblies with Single Molecule, Real-Time Sequencing

Single molecule high-fidelity (HiFi) Sequencing with >10 kb libraries

Podcast: Going beyond the $1,000 genome with Mark Gerstein

Podcast: Marc Salit discusses creating the foundation of genomics

Podcast: The 9 billion people problem – Rod Wing on plant genomics

Podcast: The goal is de novo assembly in the clinic, says Jim Lupski, Baylor

Podcast: Long-read sequencing dramatically improves blood matching – Steven Marsh

Webinar: Sequence with Confidence – Introducing the Sequel II System

Webinar: Increasing solve rates for rare and Mendelian diseases with long-read sequencing

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.

Subscribe for blog updates:

Filter by topic

Talk with an expert

ALS case study

Subscribe for blog updates:

Filter by topic

Talk with an expert