The newer hierarchical genome assembly process (HGAP) performs de novo assembly using data from a single PacBio long insert library. To assess the benefits of this method, DNA from several Salmonella enterica serovars was isolated from a pure culture. Genome sequencing was performed using Pacific Biosciences RS sequencing technology. The HGAP process enabled us to close sixteen Salmonella subsp. enterica genomes and their associated mobile elements: The ten serotypes include: Salmonella enterica subsp. enterica serovar Enteritidis (S. Enteritidis) S. Bareilly, S. Heidelberg, S. Cubana, S. Javiana and S. Typhimurium, S. Newport, S. Montevideo, S. Agona, and S. Tennessee. In addition, we were able to detect novel methyltransferases (MTases) by using the Pacific Biosciences kinetic score distributions showing that each serovar appears to have a novel methylation pattern. For example while all Salmonella serovars examined so far have methylase specific activity for 5’-GATC-3’/3’-CTAG-5’ and 5’-CAGAG-3’/3’-GTCTC-5’ (underlined base indicates a modification), S. Heidelberg is uniquely specific for 5’-ACCANCC-3’/3’-TGGTNGG-5’, while S. Typhimurium has uniquely methylase specific for 5′-GATCAG-3’/3′- CTAGTC-5′ sites, for the samples examined so far. We believe that this may be due to the unique environments and phages that these serotypes have been exposed to. Furthermore, our analysis identified and closed a variety of plasmids such as mobilization plasmids, antimicrobial resistance plasmids and IncX plasmids carrying a Type IV secretion system (T4SS). The VirB/D4 T4SS apparatus is important in that it assists with rapid dissemination of antibiotic resistance and virulence determinants. Presently, only limited information exists regarding the genotypic characterization of drug resistance in S. Heidelberg isolates derived from various host species. Here, we characterize two S. Heidelberg outbreak isolates from two different outbreaks. Both isolates contain the IncX plasmid of approximately 35 kb, and carried the genes virB1, virB2, virB3/4, virB5, virB6, virB7, virB8, virB9, virB10, virB11, virD2, and virD4, that are associated with the T4SS. In addition, the outbreak isolate associated with ground turkey carries a 4,473 bp mobilization plasmid and an incompatibility group (Inc) I1 antimicrobial resistance plasmid encoding resistance to gentamicin (aacC2), beta-lactam (bl2b_tem), streptomycin (aadAI) and tetracycline (tetA, tetR) while the outbreak isolate associated with chicken breast carries the IncI1 plasmid encoding resistance to gentamicin (aacC2), streptomycin (aadAI) and sulfisoxazole (sul1). Using this new technology we explored the genetic elements present in resistant pathogens which will achieve a better understanding of the evolution of Salmonella.
Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers in large genome complexities, such as long, highly repetitive, low-complexity regions and duplication events, and differentiating between transcript isoforms that are difficult to resolve with short-read technologies. We present solutions available for both reference genome improvement (>100 MB) and transcriptome research to best leverage long reads that have exceeded 20 Kb in length. Benefits for these applications are further realized with consistent use of size-selection of input sample using the BluePippin™ device from Sage Science. Highlights from our genome assembly projects using the latest P5-C3 chemistry on model organisms will be shared. Assembly contig N50 have exceeded 6 Mb and we observed longest contig exceeding 12.5 Mb with an average base quality of QV50. Additionally, the value of long, intact reads to provide a no-assembly approach to investigate transcript isoforms using our Iso-Seq Application will be presented.
Complex alternative splicing patterns in hematopoietic cell subpopulations revealed by third-generation long reads.
Background: Alternative splicing expands the repertoire of gene functions and is a signature for different cell populations. Here we characterize the transcriptome of human bone marrow subpopulations including progenitor cells to understand their contribution to homeostasis and pathological conditions such as atherosclerosis and tumor metastasis. To obtain full-length transcript structures, we utilized long reads in addition to RNA-seq for estimating isoform diversity and abundance. Method: Freshly harvested, viable human bone marrow tissues were extracted from discarded harvesting equipment and separated into total bone marrow (total), lineage-negative (lin-) progenitor cells and differentiated cells (lin+) by magnetic bead sorting with antibodies to surface markers of hematopoietic cell lineages. Sequencing was done with SOLiD, Illumina HiSeq (100bp paired-end reads), and PacBio RS II (full-length cDNA library protocol for 1 – 6 kb libraries). Short reads were assembled using both Trinity for de novo assembly and Cufflinks for genome-guided assembly. Full-length transcript consensus sequences were obtained for the PacBio data using the RS_IsoSeq protocol from PacBios SMRTAnalysis software. Quantitation for each sample was done independently for each sequencing platform using Sailfish to obtain the TPM (transcripts per million) using k-mer matching. Results: PacBios long read sequencing technology is capable of sequencing full-length transcripts up to 10 kb and reveals heretofore-unseen isoform diversity and complexity within the hematopoietic cell populations. A comparison of sequencing depth and de novo transcript assembly with short read, second-generation sequencing reveals that, while short reads provide precision in determining portions of isoform structure and supporting larger 5 and 3 UTR regions, it fails in providing a complete structure especially when multiple isoforms are present at the same locus. Increased breadth of isoform complexity is revealed by long reads that permits further elaboration of full isoform diversity and specific isoform abundance within each separate cell population. Sorting the distribution of major and minor isoforms reveals a cell population-specific balance focused on distinct genome loci and shows how tissue specificity and diversity are modulated by alternative splicing.
Second-generation sequencing has brought about tremendous insights into the genetic underpinnings of biology. However, there are many functionally important and medically relevant regions of genomes that are currently difficult or impossible to sequence, resulting in incomplete and fragmented views of genomes. Two main causes are (i) limitations to read DNA of extreme sequence content (GC-rich or AT-rich regions, low complexity sequence contexts) and (ii) insufficient read lengths which leave various forms of structural variation unresolved and result in mapping ambiguities.
Significant advances in bioinformatics tool development have been made to more efficiently leverage and deliver high-quality genome assemblies with PacBio long-read data. Current data throughput of SMRT Sequencing delivers average read lengths ranging from 10-15 kb with the longest reads exceeding 40 kb. This has resulted in consistent demonstration of a minimum 10-fold improvement in genome assemblies with contig N50 in the megabase range compared to assemblies generated using only short- read technologies. This poster highlights recent advances and resources available for advanced bioinformaticians and developers interested in the current state-of-the-art large genome solutions available as open-source code from PacBio and third-party solutions, including HGAP, MHAP, and ECTools. Resources and tools available on GitHub are reviewed, as well as datasets representing major model research organisms made publically available for community evaluation or interested developers.
Since the advent of Next-Generation Sequencing (NGS), the cost of de novo genome sequencing and assembly have dropped precipitately, which has spurred interest in genome sequencing overall. Unfortunately the contiguity of the NGS assembled sequences, as well as the accuracy of these assemblies have suffered. Additionally, most NGS de novo assemblies leave large portions of genomes unresolved, and repetitive regions are often collapsed. When compared to the reference quality genome sequences produced before the NGS era, the new sequences are highly fragmented and often prove to be difficult to properly annotate. In some cases the contiguous portions are smaller than the average gene size making the sequence not nearly as useful for biologists as the earlier reference quality genomes including of Human, Mouse, C. elegans, or Drosophila. Recently, new 3rd generation sequencing technologies, long-range molecular techniques, and new informatics tools have facilitated a return to high quality assembly. We will discuss the capabilities of the technologies and assess their impact on assembly projects across the tree of life from small microbial and fungal genomes through large plant and animal genomes. Beyond improvements to contiguity, we will focus on the additional biological insights that can be made with better assemblies, including more complete analysis genes in their flanking regulatory context, in-depth studies of transposable elements and other complex gene families, and long-range synteny analysis of entire chromosomes. We will also discuss the need for new algorithms for representing and analyzing collections of many complete genomes at once.
Complete resequencing of extended genomic regions using fosmid target capture and single molecule real-time (SMRT) long read sequencing technology.
A longstanding goal of genomic analysis is the identification of causal genetic factors contributing to disease. While the common disease/common variant hypothesis has been tested in many genome-wide association studies, few advancements in identifying causal variation have been realized, and instead recent findings point away from common variants towards aggregate rare variants as causal. A challenge is obtaining complete phased genomic sequences over extended genomic regions from sufficient numbers of cases and controls to identify all potential variation causal of a disease. To address this, we modified methods for targeted DNA isolation using fosmid technology and single-molecule, long-sequence-read generaton that combine for complete, haplotype-resolved resequencing across extended genomic subregions. As proof of principal, we validated the approach by resequencing four 800 kbp segments that span a major histocompatibility complex (MHC) common extended haplotype (CEH) associated with disease. The data revealed the extent of conservation exposing a near identity among four DR4 CEHs over conserved regions, detailing rare variation and measuring sequence accuracy. In a second test, we sequenced the complete KIR haplotypes from 8 individuals within a specific timeframe and cost. Single molecule long-read sequencing technology generated contiguous full-length fosmid sequences of 30 to 40 kb in a single read, allowing assembly of resolved haplotypes with very little data processing. All of the sequences produced from these projects were contiguous, phased, with accuracy above 99.99%. The results demonstrated that cost-effective scale-up is possible to generate scores to hundreds of phased chromosomal sequences of extended lengths that can encompass genomic regions associated with disease.
We have developed barcoding reagents and workflows for multiplexing amplicons or fragmented native genomic (DNA) prior to Single Molecule, Real-Time (SMRT) Sequencing. The long reads of PacBio’s SMRT Sequencing enable detection of linked mutations across multiple kilobases (kb) of sequence. This feature is particularly useful in the context of mutational analysis or SNP confirmation, where a large number of samples are generated routinely. To validate this workflow, a set of 384 1.7-kb amplicons, each derived from variants of the Phi29 DNA polymerase gene, were barcoded during amplification, pooled, and sequenced on a single SMRT Cell. To demonstrate the applicability of the method to longer inserts, a library of 96 5-kb clones derived from the E. coli genome was sequenced.
For highly complex and large genomes, a well-annotated genome may be computationally challenging and costly, yet the study of alternative splicing events and gene annotations usually rely on the existence of a genome. Long-read sequencing technology provides new opportunities to sequence full-length cDNAs, avoiding computational challenges that short read transcript assembly brings. The use of single molecule, real-time sequencing from Pacific Biosciences to sequence transcriptomes (the Iso-SeqTM method), which produces de novo, high-quality, full-length transcripts, has revealed an astonishing amount of alternative splicing in eukaryotic species. With the Iso-Seq method, it is now possible to reconstruct the transcribed regions of the genome using just the transcripts themselves. We present Cogent, a tool for finding gene families and reconstructing the coding genome in the absence of a reference genome. Cogent uses k-mer similarities to first partition the transcripts into different gene families. Then, for each gene family, the transcripts are used to build a splice graph. Cogent identifies bubbles resulting from sequencing errors, minor variants, and exon skipping events, and attempts to resolve each splice graph down to the minimal set of reconstructed contigs. We apply Cogent to a Cuttlefish Iso-Seq dataset, for which there is a highly fragmented, Illumina-based draft genome assembly and little annotation. We show that Cogent successfully discovers gene families and can reconstruct the coding region of gene loci. The reconstructed contigs can then be used to visualize alternative splicing events, identify minor variants, and even be used to improve genome assemblies.
Numerous whole genome sequencing projects already achieved or ongoing have highlighted the fact that obtaining a high quality genome sequence is necessary to address comparative genomics questions such as structural variations among genotypes and gain or loss of specific function. Despite the spectacular progress that has been done regarding sequencing technologies, accurate and reliable data are still challenging, at the whole genome scale but also when targeting specific genomic regions. These issues are even more noticeable for complex plant genomes. Most plant genomes are known to be particularly challenging due to their size, high density of repetitive elements and various levels of ploidy. To overcome these issues, we have developed a strategy in order to reduce the genome complexity by using the large insert BAC libraries combined with next generation sequencing technologies. We have compared two different technologies (Roche-454 and Pacific Biosciences PacBio RS II) to sequence pools of BAC clones in order to obtain the best quality sequence. We targeted nine BAC clones from different species (maize, wheat, strawberry, barley, sugarcane and sunflower) known to be complex in terms of sequence assembly. We sequenced the pools of the nine BAC clones with both technologies. We have compared results of assembly and highlighted differences due to the sequencing technologies used. We demonstrated that the long reads obtained with the PacBio RS II technology enables to obtain a better and more reliable assembly notably by preventing errors due to duplicated or repetitive sequences in the same region.
The complex immune regions of the genome, including MHC and KIR, contain large copy number variants (CNVs), a high density of genes, hyper-polymorphic gene alleles, and conserved extended haplotypes (CEH) with enormous linkage disequilibrium (LDs). This level of complexity and inherent biases of short-read sequencing make it challenging for extracting immune region haplotype information from reference-reliant, shotgun sequencing and GWAS methods. As NGS based genome and exome sequencing and SNP arrays have become a routine for population studies, numerous efforts are being made for developing software to extract and or impute the immune gene information from these datasets. Despite these efforts, the fine mapping of causal variants of immune genes for their well-documented association with cancer, drug-induced hypersensitivity and immune-related diseases, has been slower than expected. This has in many ways limited our understanding of the mechanisms leading to immune disease. In the present work, we demonstrate the advantages of long reads delivered by SMRT Sequencing for assembling complete haplotypes of MHC and KIR gene clusters, as well as calling correct genotypes of genes comprised within them. All the genotype information is detected at allele- level with full phasing information across SNP-poor regions. Genotypes were called correctly from targeted gene amplicons, haplotypes, as well as from a completely assembled 5 Mb contig of the MHC region from a de novo assembly of whole genome shotgun data. De novo analysis pipeline used in all these approaches allowed for reference-free analysis without imputation, a key for interrogation without prior knowledge about ethnic backgrounds. These methods are thus easily adoptable for previously uncharacterized human or non-human species.
Reconstruction of the spinach coding genome using full-length transcriptome without a reference genome
For highly complex and large genomes, a well-annotated genome may be computationally challenging and costly, yet the study of alternative splicing events and gene annotations usually rely on the existence of a genome. Long-read sequencing technology provides new opportunities to sequence full-length cDNAs, avoiding computational challenges that short read transcript assembly brings. The use of single molecule, real-time sequencing from PacBio to sequence transcriptomes (the Iso-Seq method), which produces de novo, high-quality, full-length transcripts, has revealed an astonishing amount of alternative splicing in eukaryotic species. With the Iso-Seq method, it is now possible to reconstruct the transcribed regions of the genome using just the transcripts themselves. We present Cogent, a tool for finding gene families and reconstructing the coding genome in the absence of a high-quality reference genome. Cogent uses k-mer similarities to first partition the transcripts into different gene families. Then, for each gene family, the transcripts are used to build a splice graph. Cogent identifies bubbles resulting from sequencing errors, minor variants, and exon skipping events, and attempts to resolve each splice graph down to the minimal set of reconstructed contigs. We apply Cogent to the Iso-Seq data for spinach, Spinacia oleracea, for which there is also a PacBio-based draft genome to validate the reconstruction. The Iso-Seq dataset consists of 68,263 fulllength, Quiver-polished transcript sequences ranging from 528 bp to 6 kbp long (mean: 2.1 kbp). Using the genome mapping as ground truth, we found that 95% (8045/8446) of the Cogent gene families found corresponded to a single genomic loci. For families that contained multiple loci, they were often homologous genes that would be categorized as belonging to the same gene family. Coding genome reconstruction was then performed individually for each gene family. A total of 86% (7283/8446) of the gene families were resolved to a single contig by Cogent, and was validated to be also a single contig in the genome. In 59 cases, Cogent reconstructed a single contig, however the contig corresponded to 2 or more loci in the genome, suggesting possible scaffolding opportunities. In 24 cases, the transcripts had no hits to the genome, though Pfam and BLAST searches of the transcripts show that they were indeed coding, suggesting that the genome is missing certain coding portions. Given the high quality of the spinach genome, we were not surprised to find that Cogent only minorly improved the genome space. However the ability of Cogent to accurately identify gene families and reconstruct the coding genome in a de novo fashion shows that it will be extremely powerful when applied to datasets for which there is no or low-quality reference genome.
Targeted sequencing employing PCR amplification is a fundamental approach to studying human genetic disease. PacBio’s Sequel System and supporting products provide an end-to-end solution for amplicon sequencing, offering better performance to Sanger technology in accuracy, read length, throughput, and breadth of informative data. Sample multiplexing is supported with three barcoding options providing the flexibility to incorporate unique sample identifiers during target amplification or library preparation. Multiplexing is key to realizing the full capacity of the 1 million individual reactions per Sequel SMRT Cell. Two analysis workflows that can generate high-accuracy results support a wide range of amplicon sizes in two ranges from 250 bp to 3 kb and from 3 kb to >10 kb. The Circular Consensus Sequencing workflow results in high accuracy through intra-molecular consensus generation, while high accuracy for the Long Amplicon Analysis workflow is achieved by clustering of individual long reads from multiple reactions. Here we present workflows and results for single- molecule sequencing of amplicons for human genetic analysis.
SMRT-Cappable-seq combines the isolation of full-length prokaryotic primary transcripts with long read sequencing technology. It is the first experimental methodology to sequence entire prokaryotic transcripts. It identifies the transcription start site and termination site, thereby directly defines the operon structures genome-wide in prokaryotes. Applied to E.coli, SMRT-Cappable-seq identifies a total of ~2300 operons, among which ~900 are novel. Importantly, our result reveals a pervasive read-through of previous experimentally validated transcription termination sites. Termination read-through represents a powerful strategy to control gene expression. Taken together this data provides a first glance at the complexity of the ‘operome’ in bacteria and presents an invaluable resource for understanding gene regulation and function in bacteria.
Listen to Mendelspod’s podcast with PacBio CEO, Mike Hunkapiller, as he recounts his memories from the first human draft genome project 15 years ago.