Given a massive collection of sequences, it is infeasible to perform pairwise alignment for basic tasks like sequence clustering and search. To address this problem, we demonstrate that the MinHash technique, first applied to clustering web pages, can be applied to biological sequences with similar effect, and extend this idea to include biologically relevant distance and significance measures. Our new tool, Mash, uses MinHash locality-sensitive hashing to reduce large sequences to a representative sketch and rapidly estimate pairwise distances between genomes or metagenomes. Using Mash, we explored several use cases, including a 5,000-fold size reduction and clustering of all 55,000 NCBI RefSeq genomes in 46 CPU hours. The resulting 93 MB sketch database includes all RefSeq genomes, effectively delineates known species boundaries, reconstructs approximate phylogenies, and can be searched in seconds using assembled genomes or raw sequencing runs from Illumina, Pacific Biosciences, and Oxford Nanopore. For metagenomics, Mash scales to thousands of samples and can replicate Human Microbiome Project and Global Ocean Survey results in a fraction of the time. Other potential applications include any problem where an approximate, global sequence distance is acceptable, e.g. to triage and cluster sequence data, assign species labels to unknown genomes, quickly identify mis- tracked samples, and search massive genomic databases. In addition, the Mash distance metric is based on simple set intersections, which are compatible with homomorphic encryption schemes. To facilitate integration with other software, Mash is implemented as a lightweight C++ toolkit and freely released under a BSD license athttps://github.com/marbl/mash
The sensitivity, speed, and reduced cost associated with Next-Generation Sequencing (NGS) technologies have made them indispensable for the molecular profiling of cancer samples. For effective use, it is critical that the NGS methods used are not only robust but can also accurately detect low frequency somatic mutations. Single Molecule, Real-Time (SMRT) Sequencing offers several advantages, including the ability to sequence single molecules with very high accuracy (>QV40) using the circular consensus sequencing (CCS) approach. The availability of genetically defined, human genomic reference standards provides an industry standard for the development and quality control of molecular assays. Here we characterize SMRT Sequencing for the detection of low-frequency somatic variants using the Quantitative Multiplex DNA Reference Standard from Horizon Diagnostics, combined with amplification of the variants using the Multiplicom Tumor Hotspot MASTR Plus assay. The Horizon Diagnostics reference sample contains precise allelic frequencies from 1% to 24.5% for major oncology targets verified using digital PCR. It recapitulates the complexity of tumor composition and serves as a well-characterized control. The control sample was amplified using the Multiplicom Tumor Hotspot Master Plus assay that targets 252 amplicons (121-254 bp) from 26 relevant cancer genes, which includes all 11 variants in the control sample. The amplicons were sequenced and analyzed using SMRT Sequencing to identify the variants and determine the observed frequency. The random error profile and high accuracy CCS reads make it possible to accurately detect low frequency somatic variants.
Highly sensitive and cost-effective detection of somatic cancer variants using single-molecule, real-time sequencing
Next-Generation Sequencing (NGS) technologies allow for molecular profiling of cancer samples with high sensitivity and speed at reduced cost. For efficient profiling of cancer samples, it is important that the NGS methods used are not only robust, but capable of accurately detecting low-frequency somatic mutations. Single Molecule, Real-Time (SMRT) Sequencing offers several advantages, including the ability to sequence single molecules with very high accuracy (>QV40) using the circular consensus sequencing (CCS) approach. The availability of genetically defined, human genomic reference standards provides an industry standard for the development and quality control of molecular assays for studying cancer variants. Here we characterize SMRT Sequencing for the detection of low-frequency somatic variants using the Quantitative Multiplex DNA Reference Standards from Horizon Discovery, combined with amplification of the variants using the Multiplicom Tumor Hotspot MASTR Plus assay. First, we sequenced a reference standard containing precise allelic frequencies from 1% to 24.5% for major oncology targets verified using digital PCR. This reference material recapitulates the complexity of tumor composition and serves as a well-characterized control. The control sample was amplified using the Multiplicom Tumor Hotspot MASTR Plus assay that targets 252 amplicons (121-254 bp) from 26 relevant cancer genes, which includes all 11 variants in the control sample. Next, we sequenced control samples prepared by SeraCare Life Sciences, which contained a defined mutation at allelic frequencies from 10% down to 0.1%. The wild type and mutant amplicons were serially diluted, sequenced and analyzed using SMRT Sequencing to identify the variants and determine the observed frequency. The random error profile and high-accuracy CCS reads make it possible to accurately detect low-frequency somatic variants.
Plant and animal whole genome sequencing has proven to be challenging, particularly due to genome size, high density of repetitive elements and heterozygosity. The Sequel System delivers long reads, high consensus accuracy and uniform coverage, enabling more complete, accurate, and contiguous assemblies of these large complex genomes. The latest Sequel chemistry increases yield up to 8 Gb per SMRT Cell for long insert libraries >20 kb and up to 10 Gb per SMRT Cell for libraries >40 kb. In addition, the recently released SMRTbell Express Template Prep Kit reduces the time (~3 hours) and DNA input (~3 µg), making the workflow easy to use for multi- SMRT Cell projects. Here, we recommend the best practices for whole genome sequencing and de novo assembly of complex plant and animal genomes. Guidelines for constructing large-insert SMRTbell libraries (>30 kb) to generate optimal read lengths and yields using the latest Sequel chemistry are presented. We also describe ways to maximize library yield per preparation from as littles as 3 µg of sheared genomic DNA. The combination of these advances makes plant and animal whole genome sequencing a practical application of the Sequel System.
An important need in analyzing complex genomes is the ability to separate and phase haplotypes. While whole genome assembly can deliver this information, it cannot reveal whether there is allele-specific gene or isoform expression. The PacBio Iso-Seq method, which can produce high-quality transcript sequences of 10 kb and longer, has been used to annotate many important plant and animal genomes. We present an algorithm called IsoPhase that post-processes Iso-Seq data for transcript-based haplotyping. We applied IsoPhase to a maize Iso-Seq dataset consisting of two homozygous parents and two F1 cross hybrids. We validated the majority of the SNPs called with IsoPhase against matching short read data and identified cases of allele-specific, gene-level and isoform-level expression.
Library prep and bioinformatics improvements for full-length transcript sequencing on the PacBio Sequel System
The PacBio Iso-Seq method produces high-quality, full-length transcripts of up to 10 kb and longer and has been used to annotate many important plant and animal genomes. Here we describe an improved, simplified library workflow and analysis pipeline that reduces library preparation time, RNA input, and cost. The Iso-Seq V2 Express workflow is a one day protocol that requires only ~300 ng of total RNA input while also reducing the number of reverse transcription and amplification steps down to single reactions. Compared with the previous workflow, the Iso-Seq V2 Express workflow increases the percentage of full-length (FL) reads while achieving a higher average transcript length. At the same time, the Iso-Seq 3 analysis recently released in the SMRT Link 6.0 software is a major improvement over previous versions. Iso-Seq 3 is highly accurate at detecting and removing library artifacts (TSO and RT artifacts) as well as differentiating barcodes on multiplexed samples. Iso-Seq 3 achieves the same output performance in high-quality transcript sequences compared to previous versions while reducing the runtime and memory usage dramatically.
A workflow for the comprehensive detection and prioritization of variants in human genomes with PacBio HiFi reads
PacBio HiFi reads (minimum 99% accuracy, 15-25 kb read length) have emerged as a powerful data type for comprehensive variant detection in human genomes. The HiFi read length extends confident mapping and variant calling to repetitive regions of the genome that are not accessible with short reads. Read length also improves detection of structural variants (SVs), with recall exceeding that of short reads by over 30%. High read quality allows for accurate single nucleotide variant and small indel detection, with precision and recall matching that of short reads. While many tools have been developed to take advantage of these qualities of HiFi reads, there is no end-to-end workflow for the filtering and prioritization of variants uniquely detected with long reads for rare and undiagnosed disease research. We have developed a flexible, modular workflow and web portal for variant analysis from HiFi reads and applied it to a set of rare disease cases unsolved by short-read whole genome sequencing. We expect that broad application of long-read variant detection workflows will solve many more rare disease cases. We have made these tools available at https://github.com/williamrowell/pbRUGD-workflow, and we hope they serve a starting point for developing a robust analysis framework for long read variant detection for rare diseases.
User Group Meeting: From long reads to transcript function: Bioinformatics tools for Iso-transcriptomics analysis
In this PacBio User Group Meeting presentation, Ana Conesa Cegarra from the University of Florida spoke about Iso-Seq analysis tools developed by her group, which created the popular SQANTI tools…
In this webinar, Jonas Korlach, Chief Scientific Officer, PacBio provides an overview of the features and the advantages of the new Sequel II System. Kiran Garimella, Senior Computational Scientist, Broad…
In this webinar we present Single Molecule, Real-Time (SMRT) Sequencing and the Iso-Seq method, which allow you to generate full-length cDNA sequences — no assembly required — to characterize transcript…
ASHG CoLab: PacBio HiFi reads for comprehensive characterization of genomes and single-cell isoform expression
In this ASHG 2020 CoLab presentation hear Principal Scientists, Aaron Wenger and Elizabeth Tseng share how highly accurate long reads (HiFi reads) provide comprehensive variant detection for both genomes and…
Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads.
The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes. © 2019 John Wiley & Sons Ltd/University College London.
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
Characterization of Reference Materials for Genetic Testing of CYP2D6 Alleles: A GeT-RM Collaborative Project.
Pharmacogenetic testing increasingly is available from clinical and research laboratories. However, only a limited number of quality control and other reference materials currently are available for the complex rearrangements and rare variants that occur in the CYP2D6 gene. To address this need, the Division of Laboratory Systems, CDC-based Genetic Testing Reference Material Coordination Program, in collaboration with members of the pharmacogenetic testing and research communities and the Coriell Cell Repositories (Camden, NJ), has characterized 179 DNA samples derived from Coriell cell lines. Testing included the recharacterization of 137 genomic DNAs that were genotyped in previous Genetic Testing Reference Material Coordination Program studies and 42 additional samples that had not been characterized previously. DNA samples were distributed to volunteer testing laboratories for genotyping using a variety of commercially available and laboratory-developed tests. These publicly available samples will support the quality-assurance and quality-control programs of clinical laboratories performing CYP2D6 testing.Published by Elsevier Inc.
Chlorella vulgaris genome assembly and annotation reveals the molecular basis for metabolic acclimation to high light conditions.
Chlorella vulgaris is a fast-growing fresh-water microalga cultivated at the industrial scale for applications ranging from food to biofuel production. To advance our understanding of its biology and to establish genetics tools for biotechnological manipulation, we sequenced the nuclear and organelle genomes of Chlorella vulgaris 211/11P by combining next generation sequencing and optical mapping of isolated DNA molecules. This hybrid approach allowed to assemble the nuclear genome in 14 pseudo-molecules with an N50 of 2.8 Mb and 98.9% of scaffolded genome. The integration of RNA-seq data obtained at two different irradiances of growth (high light-HL versus low light -LL) enabled to identify 10,724 nuclear genes, coding for 11,082 transcripts. Moreover 121 and 48 genes were respectively found in the chloroplast and mitochondrial genome. Functional annotation and expression analysis of nuclear, chloroplast and mitochondrial genome sequences revealed peculiar features of Chlorella vulgaris. Evidence of horizontal gene transfers from chloroplast to mitochondrial genome was observed. Furthermore, comparative transcriptomic analyses of LL vs HL provide insights into the molecular basis for metabolic rearrangement in HL vs. LL conditions leading to enhanced de novo fatty acid biosynthesis and triacylglycerol accumulation. The occurrence of a cytosolic fatty acid biosynthetic pathway can be predicted and its upregulation upon HL exposure is observed, consistent with increased lipid amount under HL. These data provide a rich genetic resource for future genome editing studies, and potential targets for biotechnological manipulation of Chlorella vulgaris or other microalgae species to improve biomass and lipid productivity.This article is protected by copyright. All rights reserved.