PacBio RS II sequencing chemistries provide read lengths beyond 20 kb with high consensus accuracy. The long read lengths of P4-C2 chemistry and demonstrated consensus accuracy of 99.999% are ideal for applications such as de novo assembly, targeted sequencing and isoform sequencing. The recently launched P5-C3 chemistry generates even longer reads with N50 often >10,000 bp, making it the best choice for scaffolding and spanning structural rearrangements. With these chemistry advances, PacBio’s read length performance is now primarily determined by the SMRTbell library itself. Size selection of a high-quality, sheared 20 kb library using the BluePippin™ System has been demonstrated to increase the N50 read length by as much as 5 kb with C3 chemistry. BluePippin size selection or a more stringent AMPure® PB selection cutoff can be used to recover long fragments from degraded genomic material. The selection of chemistries, P4-C2 versus P5-C3, is highly dependent on the final size distribution of the SMRTbell library and experimental goals. PacBio’s long read lengths also allow for the sequencing of full-length cDNA libraries at single-molecule resolution. However, longer transcripts are difficult to detect due to lower abundance, amplification bias, and preferential loading of smaller SMRTbell constructs. Without size selection, most sequenced transcripts are 1-1.5 kb. Size selection dramatically increases the number of transcripts >1.5 kb, and is essential for >3 kb transcripts.
Integrative biology of a fungus: Using PacBio SMRT Sequencing to interrogate the genome, epigenome, and transcriptome of Neurospora crassa.
PacBio SMRT Sequencing has the unique ability to directly detect base modifications in addition to the nucleotide sequence of DNA. Because eukaryotes use base modifications to regulate gene expression, the absence or presence of epigenetic events relative to the location of genes is critical to elucidate the function of the modification. Therefore an integrated approach that combines multiple omic-scale assays is necessary to study complex organisms. Here, we present an integrated analysis of three sequencing experiments: 1) DNA sequencing, 2) base-modification detection, and 3) Iso-seq analysis, in Neurospora crassa, a filamentous fungus that has been used to make many landmark discoveries in biochemistry and genetics. We show that de novo assembly of a new strain yields complete assemblies of entire chromosomes, and additionally contains entire centromeric sequences. Base-modification analyses reveal candidate sites of increased interpulse duration (IPD) ratio, that may signify regions of 5mC, 5hmC, or 6mA base modifications. Iso-seq method provides full-length transcript evidence for comprehensive gene annotation, as well as context to the base-modifications in the newly assembled genome. Projects that integrate multiple genome-wide assays could become common practice for identifying genomic elements and understanding their function in new strains and organisms.
Long Amplicon Analysis: Highly accurate, full-length, phased, allele-resolved gene sequences from multiplexed SMRT Sequencing data.
The correct phasing of genetic variations is a key challenge for many applications of DNA sequencing. Allele-level resolution is strongly preferred for histocompatibility sequencing where recombined genes can exhibit different compatibilities than their parents. In other contexts, gene complementation can provide protection if deleterious mutations are found on only one allele of a gene. These problems are especially pronounced in immunological domains given the high levels of genetic diversity and recombination seen in regions like the Major Histocompatibility Complex. A new tool for analyzing Single Molecule, Real-Time (SMRT) Sequencing data – Long Amplicon Analysis (LAA) – can generate highly accurate, phased and full-length consensus sequences for multiple genes in a single sequencing run.
Significant advances in bioinformatics tool development have been made to more efficiently leverage and deliver high-quality genome assemblies with PacBio long-read data. Current data throughput of SMRT Sequencing delivers average read lengths ranging from 10-15 kb with the longest reads exceeding 40 kb. This has resulted in consistent demonstration of a minimum 10-fold improvement in genome assemblies with contig N50 in the megabase range compared to assemblies generated using only short- read technologies. This poster highlights recent advances and resources available for advanced bioinformaticians and developers interested in the current state-of-the-art large genome solutions available as open-source code from PacBio and third-party solutions, including HGAP, MHAP, and ECTools. Resources and tools available on GitHub are reviewed, as well as datasets representing major model research organisms made publically available for community evaluation or interested developers.
For comprehensive metabolic reconstructions and a resulting understanding of the pathways leading to natural products, it is desirable to obtain complete information about the genetic blueprint of the organisms used. Traditional Sanger and next-generation, short-read sequencing technologies have shortcomings with respect to read lengths and DNA-sequence context bias, leading to fragmented and incomplete genome information. The development of long-read, single molecule, real-time (SMRT) DNA sequencing from Pacific Biosciences, with >10,000 bp average read lengths and a lack of sequence context bias, now allows for the generation of complete genomes in a fully automated workflow. In addition to the genome sequence, DNA methylation is characterized in the process of sequencing. PacBio® sequencing has also been applied to microbial transcriptomes. Long reads enable sequencing of full-length cDNAs allowing for identification of complete gene and operon sequences without the need for transcript assembly. We will highlight several examples where these capabilities have been leveraged in the areas of industrial microbiology, including biocommodities, biofuels, bioremediation, new bacteria with potential commercial applications, antibiotic discovery, and livestock/plant microbiome interactions.
Multiplexing human HLA class I & II genotyping with DNA barcode adapters for high throughput research.
Human MHC class I genes HLA-A, -B, -C, and class II genes HLA-DR, -DP and -DQ, play a critical role in the immune system as major factors responsible for organ transplant rejection. The have a direct or linkage-based association with several diseases, including cancer and autoimmune diseases, and are important targets for clinical and drug sensitivity research. HLA genes are also highly polymorphic and their diversity originates from exonic combinations as well as recombination events. A large number of new alleles are expected to be encountered if these genes are sequenced through the UTRs. Thus allele-level resolution is strongly preferred when sequencing HLA genes. Pacific Biosciences has developed a method to sequence the HLA genes in their entirety within the span of a single read taking advantage of long read lengths (average >10 kb) facilitated by SMRT technology. A highly accurate consensus sequence (=99.999 or QV50 demonstrated) is generated for each allele in a de novo fashion by our SMRT Analysis software. In the present work, we have combined this imputation-free, fully phased, allele-specific consensus sequence generation workflow and a newly developed DNA-barcode-tagged SMRTbell sample preparation approach to multiplex 96 individual samples for sequencing all of the HLA class I and II genes. Commercially available NGS-go reagents for full-length HLA class I and relevant exons of class II genes were amplified for hi-resolution HLA sequencing. The 96 samples included 72 that are part of UCLA reference panel and had pre-typing information available for 2 fields, based on gold standard SBT methods. SMRTbell adapters with 16 bp barcode tags were ligated to long amplicons in symmetric pairing. PacBio sequencing was highly effective in generating accurate, phased sequences of full-length alleles of HLA genes. In this work we demonstrate scalability of HLA sequencing using off the shelf assays for research applications to find biological significance in full-length sequencing.
Assembly of complete KIR haplotypes from a diploid individual by the direct sequencing of full-length fosmids.
We show that linearizing and directly sequencing full-length fosmids simplifies the assembly problem such that it is possible to unambiguously assemble individual haplotypes for the highly repetitive 100-200 kb killer Ig-like receptor (KIR) gene loci of chromosome 19. A tiling of targeted fosmids can be used to clone extended lengths of genomic DNA, 100s of kb in length, but repeat complexity in regions of particular interest, such as the KIR locus, means that sequence assembly of pooled samples into complete haplotypes is difficult and in many cases impossible. The current maximum read length generated by SMRT Sequencing exceeds the length of a 40 kb fosmid; it is therefore possible to span an entire fosmid in one sequencing read. Shearing, sequencing and assembling fosmids in a shotgun approach is prone to errors when the underlying sequence is highly repetitive. We show that it is possible to directly sequence linearized fosmids and generate a high-quality consensus by simple alignment, removing the need for an error-prone assembly step. The high-quality sequence of complete fosmids can then be tiled into full haplotypes. We demonstrate the method on DNA samples from a number of individuals and fully recover the sequence of both haplotypes from a pool of KIR fosmids. The ability to haplotype and sequence complex immunogenetic regions will bring exciting opportunities to explore the evolution of disease associations of the immune sub-genome. This simple and robust approach can be scaled-up allowing a complex genomic region to be sequenced at a population level. We expect such sequencing to be valuable in disease association research.
Plant and animal whole genome sequencing has proven to be challenging, particularly due to genome size, high density of repetitive elements and heterozygosity. The Sequel System delivers long reads, high consensus accuracy and uniform coverage, enabling more complete, accurate, and contiguous assemblies of these large complex genomes. The latest Sequel chemistry increases yield up to 8 Gb per SMRT Cell for long insert libraries >20 kb and up to 10 Gb per SMRT Cell for libraries >40 kb. In addition, the recently released SMRTbell Express Template Prep Kit reduces the time (~3 hours) and DNA input (~3 µg), making the workflow easy to use for multi- SMRT Cell projects. Here, we recommend the best practices for whole genome sequencing and de novo assembly of complex plant and animal genomes. Guidelines for constructing large-insert SMRTbell libraries (>30 kb) to generate optimal read lengths and yields using the latest Sequel chemistry are presented. We also describe ways to maximize library yield per preparation from as littles as 3 µg of sheared genomic DNA. The combination of these advances makes plant and animal whole genome sequencing a practical application of the Sequel System.
This animation depicts a process by which single molecule SMRTbells are loaded in the Zero Mode Waveguides (ZMWs) of the PacBio RS II sequencing system using the automated MagBead Station.
An animated overview of a PacBio RS instrument run, including loading samples and setting up a sequencing run.
Video: Using the Integrative Genomics Viewer (IGV) to visualize PacBio long-read SMRT Sequencing data
In this video, PacBio scientists present ongoing improvements to the Integrative Genomics Viewer (IGV) and demonstrate how multiple new features improve visualization support for PacBio long-read sequencing data. The video…
Webinar: Bioinformatics lunch & learn – Better assemblies of bacterial genomes and plasmids with the new microbial assembly pipeline in SMRT Link v8.0
Microbial Assembly is our latest pipeline, specifically designed to assemble bacterial genomes (between 2 and 10 Mb) and plasmids. This pipeline includes the implementation of a new, circular-aware read alignment…
Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads.
The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes. © 2019 John Wiley & Sons Ltd/University College London.
The recent advent of long-read sequencing technologies is expected to provide reasonable answers to genetic challenges unresolvable by short-read sequencing, primarily the inability to accurately study structural variations, copy number variations, and homologous repeats in complex parts of the genome. However, long-read sequencing comes along with higher rates of random short deletions and insertions, and single nucleotide errors. The relatively higher sequencing accuracy of short-read sequencing has kept it as the first choice of screening for single nucleotide variants and short deletions and insertions. Albeit, short-read sequencing still suffers from systematic errors that tend to occur at specific positions where a high depth of reads is not always capable to correct for these errors. In this study, we compared the genotyping of mitochondrial DNA variants in three samples using PacBio’s Sequel (Pacific Biosciences Inc., Menlo Park, CA, USA) long-read sequencing and illumina’s HiSeqX10 (illumine Inc., San Diego, CA, USA) short-read sequencing data. We concluded that, despite the differences in the type and frequency of errors in the long-reads sequencing, its accuracy is still comparable to that of short-reads for genotyping short nuclear variants; due to the randomness of errors in long reads, a lower coverage, around 37 reads, can be sufficient to correct for these random errors.
Genomic analysis of Marinobacter sp. NP-4 and NP-6 isolated from the deep-sea oceanic crust on the western flank of the Mid-Atlantic Ridge
Two Marinobacter sp. NP-4 and NP-6 were isolated from a deep oceanic basaltic crust at North Pond, located at the western flank of the Mid-Atlantic Ridge. These two strains are capable of using multiple carbon sources such as acetate, succinate, glucose and sucrose while take oxygen as a primary electron acceptor. The strain NP-4 is also able to grow anaerobically under 20?MPa, with nitrate as the electron acceptor, thus represents a piezotolerant. To explore the metabolic potentials of Marinobacter sp. NP-4 and NP-6, the complete genome of NP-4 and close-to-complete genome of NP-6 were sequenced. The genome of NP-4 contains one chromosome and two plasmids with the size of 4.6?Mb in total, and with average GC content of 57.0%. The genome of NP-6 is 4.5?Mb and consists of 6 scaffolds, with an average GC content of 57.1%. Complete glycolysis, citrate cycle and aromatics compounds degradation pathways are identified in genomes of these two strains, suggesting that they possess a heterotrophic life style. Additionally, one plasmid of NP-4 contains genes for alkane degradation, phosphonate ABC transporter and cation efflux system, enabling NP-4 extra surviving abilities. In total, genomic information of these two strains provide insights into the physiological features and adaptation strategies of Marinobacter spp. in the deep oceanic crust biosphere.