Interested to learn about pangenomes? Explore this guide to learn how they provide a more complete picture of the core genes of a given species and how that can provide better biological understanding.
Highly accurate long reads – HiFi reads – with single-molecule resolution make Single Molecule, Real-Time (SMRT) Sequencing ideal for full-length 16S rRNA sequencing, shotgun metagenomic profiling, and metagenome assembly.
Comparative genomics of Shiga toxin-producing Escherichia coli O145:H28 strains associated with the 2007 Belgium and 2010 US outbreaks.
Shiga toxin-producing Escherichia coli (STEC) is an emerging pathogen. Recently there has been a global in the number of outbreaks caused by non-O157 STECs, typically involving six serogroups O26, O45, 0103, 0111, and 0145. STEC O145:H28 has been associated with severe human disease including hemolytic-uremic syndrome (HUS), and is demonstrated by the 2007 Belgian ice-cream-associated outbreak and 2010 US lettuce-associated outbreak, with over 10% of patients developing HUS in each. The goal of this work was to do comparative genomics of strains, clinical and environmental, to investigate genome diversity and virulence evolution of this important foodborne pathogen.
Understanding the genetic basis of infectious diseases is critical to enacting effective treatments, and several large-scale sequencing initiatives are underway to collect this information. Sequencing bacterial samples is typically performed by mapping sequence reads against genomes of known reference strains. While such resequencing informs on the spectrum of single nucleotide differences relative to the chosen reference, it can miss numerous other forms of variation known to influence pathogenicity: structural variations (duplications, inversions), acquisition of mobile elements (phages, plasmids), homonucleotide length variation causing phase variation, and epigenetic marks (methylation, phosphorothioation) that influence gene expression to switch bacteria from non-pathogenic to pathogenic states. Therefore, sequencing methods which provide complete, de novo genome assemblies and epigenomes are necessary to fully characterize infectious disease agents in an unbiased, hypothesis-free manner. Hybrid assembly methods have been described that combine long sequence reads from SMRT DNA sequencing with short, high-accuracy reads (SMRT (circular consensus sequencing) CCS or second-generation reads) to generate long, highly accurate reads that are then used for assembly. We have developed a new paradigm for microbial de novo assemblies in which long SMRT sequencing reads (average readlengths >5,000 bases) are used exclusively to close the genome through a hierarchical genome assembly process, thereby obviating the need for a second sample preparation, sequencing run and data set. We have applied this method to achieve closed de novo genomes with accuracies exceeding QV50 (>99.999%) to numerous disease outbreak samples, including E. coli, Salmonella, Campylobacter, Listeria, Neisseria, and H. pylori. The kinetic information from the same SMRT sequencing reads is utilized to determine epigenomes. Approximately 70% of all methyltransferase specificities we have determined to date represent previously unknown bacterial epigenetic signatures. The process has been automated and requires less than 1 day from an unknown DNA sample to its complete de novo genome and epigenome.
As the costs for genome sequencing have decreased the number of “genome” sequences have increased at a rapid pace. Unfortunately, the quality and completeness of these so–called “genome” sequences have suffered enormously. We prefer to call such genome assemblies as “gene assembly space” (GAS). We believe it is important to distinguish GAS assemblies from reference genome assemblies (RGAs) as all subsequent research that depends on accurate genome assemblies can be highly compromised if the only assembly available is a GAS assembly.
Lameness is a significant problem resulting in millions of dollars in lost revenue annually. In commercial broilers, the most common cause of lameness is bacterial chondronecrosis with osteomyelitis (BCO). We are using a wire flooring model to induce lameness attributable to BCO. We used 16S ribosomal DNA sequencing to determine that Staphylococcus spp. were the main species associated with BCO. Staphylococcus agnetis, which previously had not been isolated from poultry, was the principal species isolated from the majority of the bone lesion samples. Administering S. agnetis in the drinking water to broilers reared on wire flooring increased the incidence of BCO three-fold when compared with broilers drinking tap water (P = 0.001). We found that the minimum effective dose of Staphylococcus agnetis to induce BCO in broilers grown on wire flooring experiment is 105 cfu/ml. We used PacBio and Illumina sequencing to assemble a 2.4 Mbp contig representing the genome and a 34 kbp contig for the largest plasmid of S. agnetis. Annotation of this genome is underway through comparative genomics with other Staphylococcus genomes, and identification of virulence factors. Our goal is to elucidate genetic diversity, toxins, and pathogenicity determinants, for this poorly characterized species. Isolating pathogenic bacterial species, defining their likely route of transmission to broilers, and genomic analyses will contribute substantially to the development of measures for mitigating BCO losses in poultry.
Whole genome sequencing can provide comprehensive information important for determining the biochemical and genetic nature of all elements inside a genome. The high-quality genome references produced from past genome projects and advances in short-read sequencing technologies have enabled quick and cheap analysis for simple variants. However even with the focus on genome-wide resequencing for SNPs, the heritability of more than 50% of human diseases remains elusive. For non-human organisms, high-contiguity references are deficient, limiting the analysis of genomic features. The long and unbiased reads from single molecule, real-time (SMRT) Sequencing and new de novo assembly approaches have demonstrated the ability to detect more complicated variants and chromosome-level phasing. Moreover, with the recent advance of bioinformatics algorithms and tools, the computation tasks for completing high-quality de novo assembly of large genomes becomes feasible with commodity hardware. Ongoing development in sequencing technologies and bioinformatics will likely lead to routine generation of high-quality reference assemblies in the future. We discuss the current state of art and the challenges in bioinformatics toward such a goal. More specifically, explicit examples of pragmatic computational requirements for assembling mammalian-size genomes and algorithms suitable for processing diploid genomes are discussed.
Numerous whole genome sequencing projects already achieved or ongoing have highlighted the fact that obtaining a high quality genome sequence is necessary to address comparative genomics questions such as structural variations among genotypes and gain or loss of specific function. Despite the spectacular progress that has been done regarding sequencing technologies, accurate and reliable data are still challenging, at the whole genome scale but also when targeting specific genomic regions. These issues are even more noticeable for complex plant genomes. Most plant genomes are known to be particularly challenging due to their size, high density of repetitive elements and various levels of ploidy. To overcome these issues, we have developed a strategy in order to reduce the genome complexity by using the large insert BAC libraries combined with next generation sequencing technologies. We have compared two different technologies (Roche-454 and Pacific Biosciences PacBio RS II) to sequence pools of BAC clones in order to obtain the best quality sequence. We targeted nine BAC clones from different species (maize, wheat, strawberry, barley, sugarcane and sunflower) known to be complex in terms of sequence assembly. We sequenced the pools of the nine BAC clones with both technologies. We have compared results of assembly and highlighted differences due to the sequencing technologies used. We demonstrated that the long reads obtained with the PacBio RS II technology enables to obtain a better and more reliable assembly notably by preventing errors due to duplicated or repetitive sequences in the same region.
From RNA to full-length transcripts: The PacBio Iso-Seq method for transcriptome analysis and genome annotation
A single gene may encode a surprising number of proteins, each with a distinct biological function. This is especially true in complex eukaryotes. Short- read RNA sequencing (RNA-seq) works by physically shearing transcript isoforms into smaller pieces and bioinformatically reassembling them, leaving opportunity for misassembly or incomplete capture of the full diversity of isoforms from genes of interest. The PacBio Isoform Sequencing (Iso-Seq™) method employs long reads to sequence transcript isoforms from the 5’ end to their poly-A tails, eliminating the need for transcript reconstruction and inference. These long reads result in complete, unambiguous information about alternatively spliced exons, transcriptional start sites, and poly- adenylation sites. This allows for the characterization of the full complement of isoforms within targeted genes, or across an entire transcriptome. Here we present improved genome annotations for two avian models of vocal learning, Anna’s hummingbird (Calypte anna) and zebra finch (Taeniopygia guttata), using the Iso-Seq method. We present graphical user interface and command line analysis workflows for the data sets. From brain total RNA, we characterize more than 15,000 isoforms in each species, 9% and 5% of which were previously unannotated in hummingbird and zebra finch, respectively. We highlight one example where capturing full-length transcripts identifies additional exons and UTRs.
Incomplete annotation of genomes represents a major impediment to understanding biological processes, functional differences between species, and evolutionary mechanisms. Often, genes that are large, embedded within duplicated genomic regions, or associated with repeats are difficult to study by short-read expression profiling and assembly. In addition, most genes in eukaryotic organisms produce alternatively spliced isoforms, broadening the diversity of proteins encoded by the genome, which are difficult to resolve with short-read methods. Short-read RNA sequencing (RNA-seq) works by physically shearing transcript isoforms into smaller pieces and bioinformatically reassembling them, leaving opportunity for misassembly or incomplete capture of the full diversity of isoforms from genes of interest. In contrast, Single Molecule, Real-Time (SMRT) Sequencing directly sequences full-length transcripts without the need for assembly and imputation. Here we apply the Iso-Seq method (long-read RNA sequencing) to detect full-length isoforms and the new IsoPhase algorithm to retrieve allele-specific isoform information for two avian models of vocal learning, Anna’s hummingbird (Calypte anna) and zebra finch (Taeniopygia guttata).
Introduction: Long-read sequencing has revealed more than 20,000 structural variants spanning over 12 Mb in a healthy human genome. Short-read sequencing fails to detect most structural variants but has remained the more effective approach for small variants, due to 10-15% error rates in long reads, and copy-number variants (CNVs), due to lack of effective long-read variant callers. The development of PacBio highly accurate long reads (HiFi reads) with read lengths of 10-25 kb and quality >99% presents the opportunity to capture all classes of variation with one approach.Methods: We sequence the Genome in a Bottle benchmark sample HG002 and an individual with a presumed Mendelian disease with HiFi reads. We call SNVs and indels with DeepVariant and extend the structural variant caller pbsv to call CNVs using read depth and clipping signatures. Results: For 18-fold coverage with 13 kb HiFi reads, variant calling in HG002 achieves an F1 score of 99.7% for SNVs, 96.6% for indels, and 96.4% for structural variants. Additionally, we detect more than 300 CNVs spanning around 10 Mb. For the Mendelian disease case, HiFi reads reveal thousands of variants that were overlooked by short-read sequencing, including a candidate causative structural variant. Conclusions: These results illustrate the ability of HiFi reads to comprehensively detect variants, including those associated with human disease.
Fritz Sedlazeck, a postdoc at Johns Hopkins University, describes his structural variant detection tool Sniffles in this poster from AGBT 2016. Included: examples of structural variants that could not be…
To make improvements to crops like corn, soybeans, and canola, scientists at Corteva are building a compendium of crop genomics resources to provide actionable sequence info for genetic discovery, gene-editing,…
Understanding interactions among plants and the complex communities of organisms living on, in and around them requires more than one experimental approach. A new method for de novo metagenome assembly,…
In this LabRoots webinar, Jonas Korlach the CSO of PacBio provides an introduction to PacBio HiFi sequence reads, which are both long (up to 25 kb currently) and accurate (>99%)…