The assembly of metagenomes is dramatically improved by the long read lengths of SMRT Sequencing. This is demonstrated in an experimental design to sequence a mock community from the Human Microbiome Project, and assemble the data using the hierarchical genome assembly process (HGAP) at Pacific Biosciences. Results of this analysis are promising, and display much improved contiguity in the assembly of the mock community as compared to publicly available short-read data sets and assemblies. Additionally, the use of base modification information to make further associations between contigs provides additional data to improve assemblies, and to distinguish between members within a microbial community. The epigenetic approach is a novel validation method unique to SMRT Sequencing. In addition to whole-genome shotgun sequencing, SMRT Sequencing also offers improved classification resolution and reliability of metagenomic and microbiome samples by the full-length sequencing of 16S rRNA (~1500 bases long). Microbial communities can be detected at the species level in some cases, rather than being limited to the genus taxonomic classification as constrained by short-read technologies. The performance of SMRT Sequencing for these metagenomic samples achieved >99% predicted concordance to reference sequences in cecum, soil, water, and mock control investigations for bacterial 16S. Community samples are estimated to contain from 2.3 and up to 15 times as many species with abundance levels as low as 0.05% compared to the identification of phyla groups.
Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers to understand molecular mechanisms in evolution and gain insight into adaptive strategies. With read lengths exceeding 10 kb, we are able to sequence high-quality, closed microbial genomes with associated plasmids, and investigate large genome complexities, such as long, highly repetitive, low-complexity regions and multiple tandem-duplication events. Improved genome quality, observed at 99.9999% (QV60) consensus accuracy, and significant reduction of gap regions in reference genomes (up to and beyond 50%) allow researchers to better understand coding sequences with high confidence, investigate potential regulatory mechanisms in noncoding regions, and make inferences about evolutionary strategies that are otherwise missed by the coverage biases associated with short- read sequencing technologies. Additional benefits afforded by SMRT Sequencing include the simultaneous capability to detect epigenomic modifications and obtain full-length cDNA transcripts that obsolete the need for assembly. With direct sequencing of DNA in real-time, this has resulted in the identification of numerous base modifications and motifs, which genome-wide profiles have linked to specific methyltransferase activities. Our new offering, the Iso-Seq Application, allows for the accurate differentiation between transcript isoforms that are difficult to resolve with short-read technologies. PacBio reads easily span transcripts such that both 5’/3’ primers for cDNA library generation and the poly-A tail are observed. As such, exon configuration and intron retention events can be analyzed without ambiguity. This technological advance is useful for characterizing transcript diversity and improving gene structure annotations in reference genomes. We review solutions available with SMRT Sequencing, from targeted sequencing efforts to obtaining reference genomes (>100 Mb). This includes strategies for identifying microsatellites and conducting phylogenetic comparisons with targeted gene families. We highlight how to best leverage our long reads that have exceeded 20 kb in length for research investigations, as well as currently available bioinformatics strategies for analysis. Benefits for these applications are further realized with consistent use of size selection of input sample using the BluePippin™ device from Sage Science as demonstrated in our genome improvement projects. Using the latest P5-C3 chemistry on model organisms, these efforts have yielded an observed contig N50 of ~6 Mb, with the longest contig exceeding 12.5 Mb and an average base quality of QV50.
Complex alternative splicing patterns in hematopoietic cell subpopulations revealed by third-generation long reads.
Background: Alternative splicing expands the repertoire of gene functions and is a signature for different cell populations. Here we characterize the transcriptome of human bone marrow subpopulations including progenitor cells to understand their contribution to homeostasis and pathological conditions such as atherosclerosis and tumor metastasis. To obtain full-length transcript structures, we utilized long reads in addition to RNA-seq for estimating isoform diversity and abundance. Method: Freshly harvested, viable human bone marrow tissues were extracted from discarded harvesting equipment and separated into total bone marrow (total), lineage-negative (lin-) progenitor cells and differentiated cells (lin+) by magnetic bead sorting with antibodies to surface markers of hematopoietic cell lineages. Sequencing was done with SOLiD, Illumina HiSeq (100bp paired-end reads), and PacBio RS II (full-length cDNA library protocol for 1 – 6 kb libraries). Short reads were assembled using both Trinity for de novo assembly and Cufflinks for genome-guided assembly. Full-length transcript consensus sequences were obtained for the PacBio data using the RS_IsoSeq protocol from PacBios SMRTAnalysis software. Quantitation for each sample was done independently for each sequencing platform using Sailfish to obtain the TPM (transcripts per million) using k-mer matching. Results: PacBios long read sequencing technology is capable of sequencing full-length transcripts up to 10 kb and reveals heretofore-unseen isoform diversity and complexity within the hematopoietic cell populations. A comparison of sequencing depth and de novo transcript assembly with short read, second-generation sequencing reveals that, while short reads provide precision in determining portions of isoform structure and supporting larger 5 and 3 UTR regions, it fails in providing a complete structure especially when multiple isoforms are present at the same locus. Increased breadth of isoform complexity is revealed by long reads that permits further elaboration of full isoform diversity and specific isoform abundance within each separate cell population. Sorting the distribution of major and minor isoforms reveals a cell population-specific balance focused on distinct genome loci and shows how tissue specificity and diversity are modulated by alternative splicing.
Lameness is a significant problem resulting in millions of dollars in lost revenue annually. In commercial broilers, the most common cause of lameness is bacterial chondronecrosis with osteomyelitis (BCO). We are using a wire flooring model to induce lameness attributable to BCO. We used 16S ribosomal DNA sequencing to determine that Staphylococcus spp. were the main species associated with BCO. Staphylococcus agnetis, which previously had not been isolated from poultry, was the principal species isolated from the majority of the bone lesion samples. Administering S. agnetis in the drinking water to broilers reared on wire flooring increased the incidence of BCO three-fold when compared with broilers drinking tap water (P = 0.001). We found that the minimum effective dose of Staphylococcus agnetis to induce BCO in broilers grown on wire flooring experiment is 105 cfu/ml. We used PacBio and Illumina sequencing to assemble a 2.4 Mbp contig representing the genome and a 34 kbp contig for the largest plasmid of S. agnetis. Annotation of this genome is underway through comparative genomics with other Staphylococcus genomes, and identification of virulence factors. Our goal is to elucidate genetic diversity, toxins, and pathogenicity determinants, for this poorly characterized species. Isolating pathogenic bacterial species, defining their likely route of transmission to broilers, and genomic analyses will contribute substantially to the development of measures for mitigating BCO losses in poultry.
For comprehensive metabolic reconstructions and a resulting understanding of the pathways leading to natural products, it is desirable to obtain complete information about the genetic blueprint of the organisms used. Traditional Sanger and next-generation, short-read sequencing technologies have shortcomings with respect to read lengths and DNA-sequence context bias, leading to fragmented and incomplete genome information. The development of long-read, single molecule, real-time (SMRT) DNA sequencing from Pacific Biosciences, with >10,000 bp average read lengths and a lack of sequence context bias, now allows for the generation of complete genomes in a fully automated workflow. In addition to the genome sequence, DNA methylation is characterized in the process of sequencing. PacBio® sequencing has also been applied to microbial transcriptomes. Long reads enable sequencing of full-length cDNAs allowing for identification of complete gene and operon sequences without the need for transcript assembly. We will highlight several examples where these capabilities have been leveraged in the areas of industrial microbiology, including biocommodities, biofuels, bioremediation, new bacteria with potential commercial applications, antibiotic discovery, and livestock/plant microbiome interactions.
Background: Understanding the co-evolution of HIV populations and broadly neutralizing antibodies (bNAbs) may inform vaccine design. Novel long-read, next-generation sequencing methods allow, for the first time, full-length deep sequencing of HIV env populations. Methods: We longitudinally examined HIV-1 env populations (12 time points) in a subtype A infected individual from the IAVI primary infection cohort (Protocol C) who developed bNAbs (62% ID50>50 on a diverse panel of 105 viruses) targeting the V1/V2 loop region. We developed a PacBio single molecule, real-time sequencing protocol to deeply sequence full-length env from HIV RNA. Bioinformatics tools were developed to align env sequences, infer phylogenies, and interrogate escape dynamics of key residues and glycosylation sites. PacBio env sequences were compared to env sequences generated through amplification and cloning. Env dynamics and viral escape motif evolution were interpreted in the context of the development V1/V2-targeting broadly neutralizing antibodies. Results: We collected a median of 6799 (range: 1770-14727) high quality full-length HIV env circular consensus sequences (CCS) per SMRT Cell, per time point. Using only CCS reads comprised of 6 or more passes over the HIV env insert (= 16 kb read length) ensured that our median per-base accuracy was 99.7%. A phylogeny inferred with PacBio and 100 cloned env sequences (10 time points) found the cloned sequences evenly distributed among PacBio sequences. Viral escape from the V1/V2 targeted bNAbs was evident at V2 positions 160, 166, 167, 169 and 181 (HxB2 numbering), exhibiting several distinct escape pathways by 40 months post-infection. Conclusions: Our PacBio full-length env sequencing method allowed unprecedented view and ability to characterize HIV-1 env dynamics throughout the first four years of infection. Longitudinal full-length env deep sequencing allows accurate phylogenetic inference, provides a detailed picture of escape dynamics in epitope regions, and can identify minority variants, all of which will prove critical for increasing our understanding of how env evolution drives the development of antibody breadth.
A large number of distinct HIV-1 genomes can be present in a single clinical sample from a patient chronically infected with HIV-1. We examined samples containing complex mixtures of near-full-length HIV-1 genomes. Single molecules were sequenced as near-full-length (9.6 kb) amplicons directly from PCR products without shearing. Mathematical analysis techniques deconvolved the complex mixture of reads into estimates of distinct near-full-length viral genomes with their relative abundances. We correctly estimated the originating genomes to single-base resolution along with their relative abundances for mixtures where the truth was known exactly by independent sequencing methods. Correct estimates were made even when genomes diverged by a single base. Minor abundances of 5% were reliably detected. SMRT Sequencing data contained near-full-length continuous reads for each sample including some runs with greater than 10,000 near-full-length-genome reads in a three-hour collection time. SMRT Sequencing yields long- read sequencing results from individual DNA molecules with a rapid time-to-result. The single-molecule, full-length nature of the sequencing method allows us to estimate variant subspecies and relative abundances even from samples containing complex mixtures of genomes that differ by single bases. These results open the possibility of cost-effective full-genome sequencing of HIV-1 in mixed populations for applications such as incorporated-HIV-1 screening. In screening, genomes can differ by one to many thousands of bases and the ability to measure them can help scientifically inform treatment strategies.
In 2012, NIST convened the Genome in a Bottle Consortium to develop the metrology infrastructure needed to enable confidence in human whole genome variant calls.
Background: Understanding the co-evolution of HIV populations and broadly neutralizing antibody (bNAb) lineages may inform vaccine design. Novel long-read, next-generation sequencing methods allow, for the first time, full-length deep sequencing of HIV env populations. Methods: We longitudinally examined env populations (12 time points) in a subtype A infected individual from the IAVI primary infection cohort (Protocol C) who developed bNAbs (62% ID50>50 on a diverse panel of 105 viruses) targeting the V1/V2 region. We developed a Pacific Biosciences single molecule, real-time sequencing protocol to deeply sequence full-length env from HIV RNA. Bioinformatics tools were developed to align env sequences, infer phylogenies, and interrogate escape dynamics of key residues and glycosylation sites. PacBio env sequences were compared to env sequences generated through amplification and cloning. Env dynamics were interpreted in the context of the development of a V1/V2-targeting bNAb lineage isolated from the donor. Results: We collected a median of 6799 high quality full-length env sequences per timepoint (median per-base accuracy of 99.7%). A phylogeny inferred with PacBio and 100 cloned env sequences (10 time points) found cloned env sequences evenly distributed among PacBio sequences. Phylogenetic analyses also revealed a potential transient intra-clade superinfection visible as a minority variant (~5%) at 9 months post-infection (MPI), and peaking in prevalence at 12MPI (~64%), just preceding the development of heterologous neutralization. Viral escape from the bNAb lineage was evident at V2 positions 160, 166, 167, 169 and 181 (HxB2 numbering), exhibiting several distinct escape pathways by 40MPI. Conclusions: Our PacBio full-length env sequencing method allowed unprecedented characterization of env dynamics and revealed an intra-clade superinfection that was not detected through conventional methods. The importance of superinfection in the development of this donor’s V1/V2-directed bNAb lineage is under investigation. Longitudinal full-length env deep sequencing allows accurate phylogenetic inference, provides a detailed picture of escape dynamics in epitope regions, and can identify minority variants, all of which may prove useful for understanding how env evolution can drive the development of antibody breadth.
Background: The HIV-1 proviral reservoir is incredibly stable, even while undergoing antiretroviral therapy, and is seen as the major barrier to HIV-1 eradication. Identifying and comprehensively characterizing this reservoir will be critical to achieving an HIV cure. Historically, this has been a tedious and labor intensive process, requiring high-replicate single-genome amplification reactions, or overlapping amplicons that are then reconstructed into full-length genomes by algorithmic imputation. Here, we present a deep sequencing and analysis method able to determine the exact identity and relative abundances of near-full-length HIV genomes from samples containing mixtures of genomes without shearing or complex bioinformatic reconstruction. Methods: We generated clonal near-full-length (~9 kb) amplicons derived from single genome amplification (SGA) of primary proviral isolates or PCR of well-documented control strains. These clonal products were mixed at various abundances and sequenced as near-full-length (~9 kb) amplicons without shearing. Each mixture yielded many near-full-length HIV-1 reads. Mathematical analysis techniques resolved the complex mixture of reads into estimates of distinct near-full-length viral genomes with their relative abundances. Results: Single Molecule, Real-Time (SMRT) Sequencing data contained near-full-length (~9 kb) continuous reads for each sample including some runs with greater than 10,000 near-full-length-genome reads in a three-hour sequencing run. Our methods correctly recapitulated exactly the originating genomes at a single-base resolution and their relative abundances in both mixtures of clonal controls and SGAs, and these results were validated using independent sequencing methods. Correct resolution was achieved even when genomes differed only by a single base. Minor abundances of 5% were reliably detected. Conclusions: SMRT Sequencing yields long-read sequencing results from individual DNA molecules, a rapid time-to-result. The single-molecule, full-length nature of this sequencing method allows us to estimate variant subspecies and relative abundances with single-nucleotide resolution. This method allows for reference-agnostic and cost-effective full-genome sequencing of HIV-1, which could both further our understanding of latent infection and develop novel and improved tools for quantifying HIV provirus, which will be critical to cure HIV.
Microbial genome sequencing can be done quickly, easily, and efficiently with the PacBio sequencing instruments, resulting in complete de novo assemblies. Alternative protocols have been developed to reduce the amount of purified DNA required for SMRT Sequencing, to broaden applicability to lower-abundance samples. If 50-100 ng of microbial DNA is available, a 10-20 kb SMRTbell library can be made. A 2 kb SMRTbell library only requires a few ng of gDNA when carrier DNA is added to the library. The resulting libraries can be loaded onto multiple SMRT Cells, yielding more than enough data for complete assembly of microbial genomes using the SMRT Portal assembly program HGAP, plus base-modification analysis. The entire process can be done in less than 3 days by standard laboratory personnel. This approach is particularly important for the analysis of metagenomic communities, in which genomic DNA is often limited. From these samples, full-length 16S amplicons can be generated, prepped with the standard SMRTbell library prep protocol, and sequenced. Alternatively, a 2 kb sheared library, made from a few ng of input DNA, can also be used to elucidate the microbial composition of a community, and may provide information about biochemical pathways present in the sample. In both these cases, 1-2 kb reads with >99% accuracy can be obtained from Circular Consensus Sequencing.
Profiling metagenomic communities using circular consensus and Single Molecule, Real-Time Sequencing.
There are many sequencing-based approaches to understanding complex metagenomic communities spanning targeted amplification to whole-sample shotgun sequencing. While targeted approaches provide valuable data at low sequencing depth, they are limited by primer design and PCR amplification. Whole-sample shotgun experiments generally use short-read, second-generation sequencing, which results in data processing difficulties. For example, reads less than 1 kb in length will likely not cover a complete gene or region of interest, and will require assembly. This not only introduces the possibility of incorrectly combining sequence from different community members, it requires a high depth of coverage. As such, rare community members may not be represented in the resulting assembly. Circular-consensus, single molecule, real-time (SMRT) Sequencing reads in the 1-2 kb range, with >99% accuracy can be efficiently generated for low amounts of input DNA. 10 ng of input DNA sequenced in 4 SMRT Cells would generate >100,000 such reads. While throughput is low compared to second-generation sequencing, the reads are a true random sampling of the underlying community, since SMRT Sequencing has been shown to have no sequence-context bias. Long read lengths mean that that it would be reasonable to expect a high number of the reads to include gene fragments useful for analysis.
We have developed barcoding reagents and workflows for multiplexing amplicons or fragmented native genomic (DNA) prior to Single Molecule, Real-Time (SMRT) Sequencing. The long reads of PacBio’s SMRT Sequencing enable detection of linked mutations across multiple kilobases (kb) of sequence. This feature is particularly useful in the context of mutational analysis or SNP confirmation, where a large number of samples are generated routinely. To validate this workflow, a set of 384 1.7-kb amplicons, each derived from variants of the Phi29 DNA polymerase gene, were barcoded during amplification, pooled, and sequenced on a single SMRT Cell. To demonstrate the applicability of the method to longer inserts, a library of 96 5-kb clones derived from the E. coli genome was sequenced.
High-throughput sequencing of the complete 16S rRNA gene has become a valuable tool for characterizing microbial communities. However, the short reads produced by second-generation sequencing cannot provide taxonomic classification below the genus level. In this study, we demonstrate the capability of PacBio’s Single Molecule, Real-Time (SMRT) Sequencing to generate community profiles using mock microbial community samples from BEI Resources. We also evaluate multiplexing capabilities using PacBio barcodes on pooled samples comprising heterogeneous 16S amplicon populations representing soil, fecal, and mock communities.
The Genome in a Bottle Consortium is developing the reference materials, reference methods , and reference data n