AGBT 2013 Presentation Slides: Cold Spring Harbor Laboratory’s Michael Schatz presented strategies for de novo assembly of crop genomes with PacBio technolgy.
Third generation single molecule sequencing technology from Pacific Biosciences, Moleculo, Oxford Nanopore, and other companies are revolutionizing genomics by enabling the sequencing of long, individual molecules of DNA and RNA. One major advantage of these technologies over current short read sequencing is the ability to sequence much longer molecules, thousands or tens of thousands of nucleotides instead of mere hundreds. This capacity gives researchers substantially greater power to probe into microbial, plant, and animal genomes, but it remains unknown on how to best use these data. To answer this, we systematically evaluated the human genome and 25 other important genomes across the tree of life ranging in size from 1Mbp to 3Gbp in an attempt to answer how long the reads need to be and how much coverage is necessary to completely assemble their chromosomes with single molecule sequencing. We also present a novel error correction and assembly algorithm using a combination of PacBio and pre-assembled Illumina sequencing. This new algorithm greatly outperforms other published hybrid algorithms.
From Sequencing to Chromosomes: New de novo assembly and scaffolding methods improve the goat reference genome
Single-molecule sequencing is now routinely used to assemble complete, high-quality microbial genomes, but these assembly methods have not scaled well to large genomes. To address this problem, we previously introduced the MinHash Alignment Process (MHAP) for overlapping single-molecule reads using probabilistic, locality-sensitive hashing. Integrating MHAP with Celera Assembler (CA) has enabled reference-grade assemblies of model organisms, revealing novel heterochromatic sequences and filling low-complexity gap sequences in the GRCh38 human reference genome. We have applied our methods to assemble the San Clemente goat genome. Combining single-molecule sequencing from Pacific Biosciences and BioNano Genomics generates and assembly that is over 150-fold more contiguous than the latest Capra hircus reference. In combination with Hi-C sequencing, the assembly surpasses reference assemblies, de novo, with minimal manual intervention. The autosomes are each assembled into a single scaffold. Our assembly provides a more complete gene reconstruction, better alignments with Goat 52k chip, and improved allosome reconstruction. In addition to providing increased continuity of sequence, our assembly achieves a higher BUSCO completion score (84%) than the existing goat reference assembly suggesting better quality annotation of gene models. Our results demonstrate that single-molecule sequencing can produce near-complete eukaryotic genomes at modest cost and minimal manual effort.
Motivation: Third-generation sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results: In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research.
More than 3,000 species of octocorals (Cnidaria, Anthozoa) inhabit an expansive range of environments, from shallow tropical seas to the deep-ocean floor. They are important foundation species that create coral “forests,” which provide unique niches and 3-dimensional living space for other organisms. The octocoral genus Renilla inhabits sandy, continental shelves in the subtropical and tropical Atlantic and eastern Pacific Oceans. Renilla is especially interesting because it produces secondary metabolites for defense, exhibits bioluminescence, and produces a luciferase that is widely used in dual-reporter assays in molecular biology. Although several anthozoan genomes are currently available, the majority of these are hexacorals. Here, we present a de novo assembly of an azooxanthellate shallow-water octocoral, Renilla muelleri.We generated a hybrid de novo assembly using MaSuRCA v.3.2.6. The final assembly included 4,825 scaffolds and a haploid genome size of 172 megabases (Mb). A BUSCO assessment found 88% of metazoan orthologs present in the genome. An Augustus ab initio gene prediction found 23,660 genes, of which 66% (15,635) had detectable similarity to annotated genes from the starlet sea anemone, Nematostella vectensis, or to the Uniprot database. Although the R. muelleri genome may be smaller (172 Mb minimum size) than other publicly available coral genomes (256-448 Mb), the R. muelleri genome is similar to other coral genomes in terms of the number of complete metazoan BUSCOs and predicted gene models.The R. muelleri hybrid genome provides a novel resource for researchers to investigate the evolution of genes and gene families within Octocorallia and more widely across Anthozoa. It will be a key resource for future comparative genomics with other corals and for understanding the genomic basis of coral diversity. © The Author(s) 2019. Published by Oxford University Press.
Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data.
Long reads obtained from third-generation sequencing platforms can help overcome the long-standing challenge of the de novo assembly of sequences for the genomic analysis of non-model eukaryotic organisms. Numerous long-read-aided de novo assemblies have been published recently, which exhibited superior quality of the assembled genomes in comparison with those achieved using earlier second-generation sequencing technologies. Evaluating assemblies is important in guiding the appropriate choice for specific research needs. In this study, we evaluated 10 long-read assemblers using a variety of metrics on Pacific Biosciences (PacBio) data sets from different taxonomic categories with considerable differences in genome size. The results allowed us to narrow down the list to a few assemblers that can be effectively applied to eukaryotic assembly projects. Moreover, we highlight how best to use limited genomic resources for effectively evaluating the genome assemblies of non-model organisms. © The Author 2017. Published by Oxford University Press.
The commercial release of third-generation sequencing technologies (TGSTs), giving long and ultra-long sequencing reads, has stimulated the development of new tools for assembling highly contiguous genome sequences with unprecedented accuracy across complex repeat regions. We survey here a wide range of emerging sequencing platforms and analytical tools for de novo assembly, provide background information for each of their steps, and discuss the spectrum of available options. Our decision tree recommends workflows for the generation of a high-quality genome assembly when used in combination with the specific needs and resources of a project.Copyright © 2019 Elsevier Ltd. All rights reserved.
Potential of TLR-gene diversity in Czech indigenous cattle for resistance breeding as revealed by hybrid sequencing
A production herd of Czech Simmental cattle (Czech Red Pied, CRP), the conserved subpopulation of this breed, and the ancient local breed Czech Red cattle (CR) were screened for diversity in the antibacterial toll-like receptors (TLRs), which are members of the innate immune system. Polymerase chain reaction (PCR) amplicons of TLR1, TLR2, TLR4, TLR5, and TLR6 from pooled DNA samples were sequenced with PacBio technology, with 3–5×?coverage per gene per animal. To increase the reliability of variant detection, the gDNA pools were sequenced in parallel with the Illumina X-ten platform at low coverage (60× per gene). The diversity in conserved CRP and CR was similar to the diversity in conserved and modern CRP, representing 76.4?% and 70.9?% of its variants, respectively. Sixty-eight (54.4?%) polymorphisms in the five TLR genes were shared by the two breeds, whereas 38 (30.4?%) were specific to the production herd of CRP; 4 (3.2?%) were specific to the broad CRP population; 7 (5.6?%) were present in both conserved populations; 5 (4.0?%) were present solely for the conserved CRP; and 3 (2.4?%) were restricted to CR. Consequently, gene pool erosion related to intensive breeding did not occur in Czech Simmental cattle. Similarly, no considerable consequences were found from known bottlenecks in the history of Czech Red cattle. On the other hand, the distinctness of the conserved populations and their potential for resistance breeding were only moderate. This relationship might be transferable to other non-abundant historical cattle breeds that are conserved as genetic resources. The estimates of polymorphism impact using Variant Effect Predictor and SIFT software tools allowed for the identification of candidate single-nucleotide polymorphisms (SNPs) for association studies related to infection resistance and targeted breeding. Knowledge of TLR-gene diversity present in Czech Simmental populations may aid in the potential transfer of variant characteristics from other breeds.
Long sequencing reads offer unprecedented opportunities in analysis and reconstruction of complex genomic regions. However, the gain in sequence length is often traded for quality. Therefore, recently several approaches have been proposed (e.g. higher sequencing coverage, hybrid assembly or sequence correction) to enhance the quality of long sequencing reads. A simple and cost-effective approach includes use of the high quality 2nd generation sequencing data to improve the quality of long reads. We designed a dedicated testing procedure and selected universal programs for long read correction, which provide as the output sequences that can be used in further genomic and transcriptomic studies. Our results show that HALC is the best choice for correction of long PacBio reads, when both, read size and quality, are the main focus of the analysis. However, the tested tools show some unexpected behaviors, including read trimming and fragmentation.Copyright © 2017 Elsevier Inc. All rights reserved.
Genome Sequence of Jaltomata Addresses Rapid Reproductive Trait Evolution and Enhances Comparative Genomics in the Hyper-Diverse Solanaceae.
Within the economically important plant family Solanaceae, Jaltomata is a rapidly evolving genus that has extensive diversity in flower size and shape, as well as fruit and nectar color, among its ~80 species. Here, we report the whole-genome sequencing, assembly, and annotation, of one representative species (Jaltomata sinuosa) from this genus. Combining PacBio long reads (25×) and Illumina short reads (148×) achieved an assembly of ~1.45?Gb, spanning ~96% of the estimated genome. Ninety-six percent of curated single-copy orthologs in plants were detected in the assembly, supporting a high level of completeness of the genome. Similar to other Solanaceous species, repetitive elements made up a large fraction (~80%) of the genome, with the most recently active element, Gypsy, expanding across the genome in the last 1-2 Myr. Computational gene prediction, in conjunction with a merged transcriptome data set from 11 tissues, identified 34,725 protein-coding genes. Comparative phylogenetic analyses with six other sequenced Solanaceae species determined that Jaltomata is most likely sister to Solanum, although a large fraction of gene trees supported a conflicting bipartition consistent with substantial introgression between Jaltomata and Capsicum after these species split. We also identified gene family dynamics specific to Jaltomata, including expansion of gene families potentially involved in novel reproductive trait development, and loss of gene families that accompanied the loss of self-incompatibility. This high-quality genome will facilitate studies of phenotypic diversification in this rapidly radiating group and provide a new point of comparison for broader analyses of genomic evolution across the Solanaceae.
Hybrid sequencing-based personal full-length transcriptomic analysis implicates proteostatic stress in metastatic ovarian cancer.
Comprehensive molecular characterization of myriad somatic alterations and aberrant gene expressions at personal level is key to precision cancer therapy, yet limited by current short-read sequencing technology, individualized catalog of complete genomic and transcriptomic features is thus far elusive. Here, we integrated second- and third-generation sequencing platforms to generate a multidimensional dataset on a patient affected by metastatic epithelial ovarian cancer. Whole-genome and hybrid transcriptome dissection captured global genetic and transcriptional variants at previously unparalleled resolution. Particularly, single-molecule mRNA sequencing identified a vast array of unannotated transcripts, novel long noncoding RNAs and gene chimeras, permitting accurate determination of transcription start, splice, polyadenylation and fusion sites. Phylogenetic and enrichment inference of isoform-level measurements implicated early functional divergence and cytosolic proteostatic stress in shaping ovarian tumorigenesis. A complementary imaging-based high-throughput drug screen was performed and subsequently validated, which consistently pinpointed proteasome inhibitors as an effective therapeutic regime by inducing protein aggregates in ovarian cancer cells. Therefore, our study suggests that clinical application of the emerging long-read full-length analysis for improving molecular diagnostics is feasible and informative. An in-depth understanding of the tumor transcriptome complexity allowed by leveraging the hybrid sequencing approach lays the basis to reveal novel and valid therapeutic vulnerabilities in advanced ovarian malignancies.
PacBio full-length cDNA sequencing integrated with RNA-seq reads drastically improves the discovery of splicing transcripts in rice.
In eukaryotes, alternative splicing (AS) greatly expands the diversity of transcripts. However, it is challenging to accurately determine full-length splicing isoforms. Recently, more studies have taken advantage of Pacific Bioscience (PacBio) long-read sequencing to identify full-length transcripts. Nevertheless, the high error rate of PacBio reads seriously offsets the advantages of long reads, especially for accurately identifying splicing junctions. To best capitalize on the features of long reads, we used Illumina RNA-seq reads to improve PacBio circular consensus sequence (CCS) quality and to validate splicing patterns in the rice transcriptome. We evaluated the impact of CCS accuracy on the number and the validation rate of splicing isoforms, and integrated a comprehensive pipeline of splicing transcripts analysis by Iso-Seq and RNA-seq (STAIR) to identify the full-length multi-exon isoforms in rice seedling transcriptome (Oryza sativa L. ssp. japonica). STAIR discovered 11 733 full-length multi-exon isoforms, 6599 more than the SMRT Portal RS_IsoSeq pipeline did. Of these splicing isoforms identified, 4453 (37.9%) were missed in assembled transcripts from RNA-seq reads, and 5204 (44.4%), including 268 multi-exon long non-coding RNAs (lncRNAs), were not reported in the MSU_osa1r7 annotation. Some randomly selected unreported splicing junctions were verified by polymerase chain reaction (PCR) amplification. In addition, we investigated alternative polyadenylation (APA) events in transcripts and identified 829 major polyadenylation [poly(A)] site clusters (PACs). The analysis of splicing isoforms and APA events will facilitate the annotation of the rice genome and studies on the expression and polyadenylation of AS genes in different developmental stages or growth conditions of rice. © 2018 The Authors The Plant Journal © 2018 John Wiley & Sons Ltd.
In this genome report, we describe the sequencing and annotation of the genome of the wine grape Carménère (clone 02, VCR-702). Long considered extinct, this old French wine grape variety is now cultivated mostly in Chile where it was imported in the 1850s just before the European phylloxera epidemic. Genomic DNA was sequenced using Single Molecule Real Time technology and assembled with FALCON-Unzip, a diploid-aware assembly pipeline. To optimize the contiguity and completeness of the assembly, we tested about a thousand combinations of assembly parameters, sequencing coverage, error correction and repeat masking methods. The final scaffolds provide a complete and phased representation of the diploid genome of this wine grape. Comparison of the two haplotypes revealed numerous heterozygous variants, including loss-of-function ones, some of which in genes associated with polyphenol biosynthesis. Comparisons with other publicly available grape genomes and transcriptomes showed the impact of structural variation on gene content differences between Carménère and other wine grape cultivars. Among the putative cultivar-specific genes, we identified genes potentially involved in aroma production and stress responses. The genome assembly of Carménère expands the representation of the genomic variability in grapes and will enable studies that aim to understand its distinctive organoleptic and agronomical features and assess its still elusive extant genetic variability. A genome browser for Carménère, its annotation, and an associated blast tool are available at http://cantulab.github.io/data.Copyright © 2019 Minio et al.
PacBio sequencing reveals bacterial community diversity in cheeses collected from different regions.
Cheese is a fermented dairy product that is popular for its unique flavor and nutritional value. Recent studies have shown that microorganisms in cheese play an important role in the fermentation process and determine the quality of the cheese. We collected 12 cheese samples from different regions and studied the composition of their bacterial communities using PacBio small-molecule real-time sequencing (Pacific Biosciences, Menlo Park, CA). Our data revealed 144 bacterial genera (including Lactobacillus, Streptococcus, Lactococcus, and Staphylococcus) and 217 bacterial species (including Lactococcus lactis, Streptococcus thermophilus, Staphylococcus equorum, and Streptococcus uberis). We investigated the flavor quality of the cheese samples using an electronic nose system and we found differences in flavor-quality indices among samples from different regions. We found a clustering tendency based on flavor quality using principal component analysis. We found correlations between lactic acid bacteria and the flavor quality of the cheese samples. Biodegradation and metabolism of xenobiotics, and lipid-metabolism-related pathways, were predicted to contribute to differences in cheese flavor using Phylogenetic Investigation of Communities by Reconstruction of Unobserved States (PICRUSt). This preliminary study explored the bacterial communities in cheeses collected from different regions and their potential genome functions from the perspective of flavor quality.Copyright © 2020 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes.
Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation. © The Author 2017. Published by Oxford University Press.