Learn why it is critically important to understand accuracy in DNA sequencing to distinguish important biological information from sequencing errors.
Discover how HiFi reads enable every aspect of viral research, from understanding viral genomes to the host immune response.
The bacteria living on and within us can impact health, disease, and even our behavior, but there is still much to learn about the breadth of their effects. The torrent of new discoveries unleashed by high-throughput sequencing has captured the imagination of scientists and the public alike. Scientists at Second Genome are hoping to apply these insights to improve human health, leveraging their bioinformatics expertise to mine bacterial communities for potential therapeutics. Recently they teamed up with scientists at PacBio to explore how long-read sequencing might supplement their short-read-based pipeline for gene discovery, using an environmental sample as a test case. They were especially interested in identifying unique, complete, and error-free gene clusters in metagenomic assemblies.
To bring personalized medicine to all patients, cancer researchers need more reliable and comprehensive views of somatic variants of all sizes that drive cancer biology.
Discover the benefits of HiFi reads and learn how highly accurate long-read sequencing provides a single technology solution across a range of applications.
Highly accurate long reads – HiFi reads – with single-molecule resolution make Single Molecule, Real-Time (SMRT) Sequencing ideal for full-length 16S rRNA sequencing, shotgun metagenomic profiling, and metagenome assembly.
Single Molecule Real Time (SMRT) sequencing sensitively detects polyclonal and compound BCR-ABL in patients who relapse on kinase inhibitor therapy.
Secondary kinase domain (KD) mutations are the most well-recognized mechanism of resistance to tyrosine kinase inhibitors (TKIs) in chronic myeloid leukemia (CML) and other cancers. In some cases, multiple drug resistant KD mutations can coexist in an individual patient (“polyclonality”). Alternatively, more than one mutation can occur in tandem on a single allele (“compound mutations”) following response and relapse to sequentially administered TKI therapy. Distinguishing between these two scenarios can inform the clinical choice of subsequent TKI treatment. There is currently no clinically adaptable methodology that offers the ability to distinguish polyclonal from compound mutations. Due to the size of the BCR-ABL KD where TKI-resistant mutations are detected, next-generation platforms are unable to generate reads of sufficient length to determine if two mutations separated by 500 nucleotides reside on the same allele. Pacific Biosciences RS Single Molecule Real-Time (SMRT) circular consensus sequencing technology is a novel third generation deep sequencing technology capable of rapidly and reliably achieving average read lengths of ~1000 bp and frequently beyond 3000 bp, allowing sequencing of the entire ABL KD on single strand of DNA. We sought to address the ability of SMRT sequencing technology to distinguish polyclonal from compound mutations using clinical samples obtained from patients who have relapsed on BCR-ABL TKI treatment.
SMRT Sequencing of whole mitochondrial genomes and its utility in association studies of metabolic disease.
In this study we demonstrate the utility of Single-Molecule Real Time SMRT sequencing to detect variants and to recapitulate whole mitochondrial genomes in an association study of Metabolic syndrome using samples from a well-studied cohort from Micronesia. The Micronesian island of Kosrae is a rare genetic isolate that offers significant advantages for genetic studies of human disease. Kosrae suffers from one of the highest rates of MetS (41%), obesity (52%), and diabetes (17%) globally and has a homogeneous environment making this an excellent population in which to study these significant health problems. We are conducting family-based association analyses aimed at identifying specific mitochondrial variants that contribute to obesity and other co-morbid conditions. We sequenced whole mitochondrial genomes from 10 Kosraen individuals who represent greater than 25 % of the mitochondrial genetic diversity for the entire Kosraen population. Using Pacific Biosciences C2 chemistry, SMRTbell libraries were constructed from pooled, full-length, unsheared 5 kb PCR amplicons, tiling the entire 16.6 kb mtDNA genome. Average read lengths for each sample were between 2500-3000 bp, with 5% of reads between 6,000-8,000 bases, depending on movie lengths. The data generated in this study serve as proof of principle that SMRT Sequencing data can be utilized for identification of high-quality variants and complete mitochondrial genome sequences. These data will be leveraged to identify causative variants for Metabolic syndrome and associated disorders.
Complete HIV-1 genomes from single molecules: Diversity estimates in two linked transmission pairs using clustering and mutual information.
We sequenced complete HIV-1 genomes from single molecules using Single Molecule, Real- Time (SMRT) Sequencing and derive de novo full-length genome sequences. SMRT sequencing yields long-read sequencing results from individual DNA molecules with a rapid time-to-result. These attributes make it a useful tool for continuous monitoring of viral populations. The single-molecule nature of the sequencing method allows us to estimate variant subspecies and relative abundances by counting methods. We detail mathematical techniques used in viral variant subspecies identification including clustering distance metrics and mutual information. Sequencing was performed in order to better understand the relationships between the specific sequences of transmitted viruses in linked transmission pairs. Samples representing HIV transmission pairs were selected from the Zambia Emory HIV Research Project (Lusaka, Zambia) and sequenced. We examine Single Genome Amplification (SGA) prepped samples and samples containing complex mixtures of genomes. Whole genome consensus estimates for each of the samples were made. Genome reads were clustered using a simple distance metric on aligned reads. Appropriate thresholds were chosen to yield distinct clusters of HIV genomes within samples. Mutual information between columns in the genome alignments was used to measure dependence. In silico mixtures of reads from the SGA samples were made to simulate samples containing exactly controlled complex mixtures of genomes and our clustering methods were applied to these complex mixtures. SMRT Sequencing data contained multiple full-length (greater than 9 kb) continuous reads for each sample. Simple whole genome consensus estimates easily identified transmission pairs. The clustering of the genome reads showed diversity differences between the samples, allowing us to characterize the diversity of the individual quasi-species comprising the patient viral populations across the full genome. Mutual information identified possible dependencies of different positions across the full HIV-1 genome. The SGA consensus genomes agreed with prior Sanger sequencing. Our clustering methods correctly segregated reads to their correct originating genome for the synthetic SGA mixtures. The results open up the potential for reference-agnostic and cost effective full genome sequencing of HIV-1.
Background: To better understand the relationships among HIV-1 viruses in linked transmission pairs, we sequenced several samples representing HIV transmission pairs from the Zambia Emory HIV Research Project (Lusaka, Zambia) using Single Molecule, Real-Time (SMRT) Sequencing. Methods: Single molecules were sequenced as full-length (9.6 kb) amplicons directly from PCR products without shearing. This resulted in multiple, fully-phased, complete HIV-1 genomes for each patient. We examined Single Genome Amplification (SGA) prepped samples, as well as samples containing complex mixtures of genomes. We detail mathematical techniques used in viral variant subspecies identification, including clustering distance metrics and mutual information, which were used to derive multiple de novo full-length genome sequences for each patient. Whole genome consensus estimates for each sample were made. Genome reads were clustered using a simple distance metric on aligned reads. Appropriate thresholds were chosen to yield distinct clusters of HIV-1 genomes within samples. Mutual information between columns in the genome alignments was used to measure dependence. In silico mixtures of reads from the SGA samples were made to simulate samples containing exactly controlled complex mixtures of genomes and our clustering methods were applied to these complex mixtures. Results: SMRT Sequencing data contained multiple full-length (>9 kb) continuous reads for each sample. Simple whole-genome consensus estimates easily identified transmission pairs. Clustering of genome reads showed diversity differences between samples, allowing characterization of the quasi-species diversity comprising the patient viral populations across the full genome. Mutual information identified possible dependencies of different positions across the full HIV-1 genome. The SGA consensus genomes agreed with prior Sanger sequencing. Our clustering methods correctly segregated reads to their correct originating genome for the synthetic SGA mixtures. Conclusions: SMRT Sequencing yields long-read sequencing results from individual DNA molecules with a rapid time-to-result. These attributes make it a useful tool for continuous monitoring of viral populations. The single-molecule nature of the sequencing method allows us to estimate variant subspecies and relative abundances by counting methods. The results open up the potential for reference-agnostic and cost effective full genome sequencing of HIV-1.
A comparison of 454 GS FLX Ti and PacBio RS in the context of characterizing HIV-1 intra-host diversity.
PacBio 2013 User Group Meeting Presentation Slides: Lance Hepler from UC San Diego’s Center for AIDS Research used the PacBio RS to study intra-host diversity in HIV-1. He compared PacBio’s performance to that of 454® sequencer, the platform he and his team previously used. Hepler noted that in general, there was strong agreement between the platforms; where results differed, he said that PacBio data had significantly better reproducibility and accuracy. “PacBio does not suffer from local coverage loss post-processing, whereas 454 has homopolymer problems,” he noted. Hepler said they are moving away from using 454 in favor of the PacBio system.
Background: Genotypic testing of chronic viral infections is an important part of patient therapy and requires assays capable of detecting the entire spectrum of viral mutations. Single Molecule, Real-Time (SMRT) sequencing offers several advantages to other sequencing technologies, including superior resolution of mixed populations and long read lengths capable of spanning entire viral protein coding regions. We examined detection sensitivity of SMRT sequencing using a mixture of HIV-1 RT gene coding regions containing single NNRTI mutations. Methodology: SMRTbell templates were prepared from PCR products generated from a prospective reference material being developed by BC Center of Excellence for HIV/AIDS, and contained a mixture of fifteen infectious viruses containing single NNRTI resistance mutations (viz V90I, K101E, K103N, V108I, E138A/G/K/Q, V179D, Y181C, Y188C, G190A/S, M230L and P236L) built upon the HIV-1LAI molecular clone. Templates were sequenced on the PacBio RS II to obtain single molecule long reads using P4/C2 chemistry, using 180 minute movie collection without stage start. The relative abundances of the mutant viruses were then estimated using codon-aware analysis methods. Results: Sequencing of these templates produced average read lengths of 5.0 KB, comprising 40,000-fold coverage across the entire amplicon per SMRT Cell. All the expected mutations in the mixture of mutant viruses were accurately identified. Frequencies of NNRTI variants estimated ranged from 0.5% to 12.5%. Conclusions: Codon analysis revealed a number of variants across the amplicon with highly consistent results across SMRT Cells. From a single SMRT Cell, variants were accurately and reliably detected down to 0.5% with simple analyses. Long polymerase reads and high accuracy reads make it possible to call variants from just a few molecules. SMRT Sequencing can identify species comprising a mixed viral population, with granularity and low cost of consumables allowing for smaller multiplexing of samples and first-in-first-out processing.
Background: The use of next generation sequencing (NGS) to examine circulating HIV env variants has been limited due to env’s length (2.6 kb), extensive indel polymorphism, GC deficiency, and long homopolymeric regions. We developed and standardized protocols for isolation, RT-PCR amplification, single molecule real-time (SMRT) sequencing, and haplotype analysis of circulating HIV-1 env variants to evaluate viral diversity in primary infection. Methodology: HIV RNA was extracted from 7 blood plasma samples (1 mL) collected from 5 subjects (one individual sampled and sequenced at 3 time points) in the San Diego Primary Infection Cohort between 3-33 months from their estimated date of infection (EDI). Median viral load per sample was 50,118 HIV RNA copies/mL (range: 22,387-446,683). Full-length (3.2 kb) env amplicons were constructed into SMRTbell templates without shearing, and sequenced on the PacBio RS II using P4/C2 chemistry and 180 minute movie collection without stage start. To examine viral diversity in each sample, we determined haplotypes by clustering circular consensus sequences (CCS), and reconstructing a cluster consensus sequence using a partial order alignment approach. We measured sample diversity both as the mean pairwise distance among reads, and the fraction of reads containing indel polymorphisms. Results: We collected a median of 8,775 CCS reads per SMRT Cell (range: 4243-12234). A median of 7 haplotypes per subject (range: 1-55) were inferred at baseline. For the one subject with longitudinal samples analyzed, we observed an increasing number of distinct haplotypes (8 to 55 haplotypes over the course of 30 months), and an increasing mean pairwise distance among reads (from 0.8% to 1.6%, Tamura-Nei 93). We also observed significant indel polymorphism, with 16% of reads from one sample later in infection (33 months post-EDI) exhibiting deletions of more than 10% of env with respect to the reference strain, HXB2. Conclusions: This study developed a standardized NGS procedure (PacBio SMRT) to deep sequence full-length HIV RNA env variants from the circulating viral population, achieving good coverage, confirming low env diversity during primary infection that increased over time, and revealing significant indel polymorphism that highlights structural variation as important to env evolution. The long, accurate reads greatly simplified downstream bioinformatics analyses, especially haplotype phasing, increasing our confidence in the results. The sequencing methodology and analysis tools developed here could be successfully applied to any area for which full-length HIV env analysis would be useful.
Background: HIV-1 proviruses in peripheral blood mononuclear cells (PBMCs) are felt to be an important reservoir of HIV-1 infection. Given that this pool represents an archival library, it can be used to study virus evolution and CD4+ T cell survival. Accurate study of this pool is burdened by difficulties encountered in sequencing a full-length proviral genome, typically accomplished by assembling overlapping pieces and imputing the full genome. Methodology: Cryopreserved PBMCs collected from a total of 8 HIV+ patients from 1997-2001 were used for genomic DNA extraction. Patients had been receiving cART for 2-8 years at the time samples were obtained. 7 patients had pVL >50 copies/mL (mean: 312,282, range: 18,372-683,400) and 1 had pVL <50. Genomic DNA was subjected to limiting dilution prior to amplification of near-full-length genomes by a newly developed nested PCR. The predicted size of the PCR product was 9.0 kb, spanning from the 5’ LTR through the 3’ LTR. Single molecules were sequenced as near-full-length amplicons directly from PCR products without shearing using commercially available P4-C2 reagents and standard protocols on a PacBio RS II instrument. Quality of the genomes was validated by clonal positive controls and synthetic mixtures. Results: Near-full-length provirus genome sequences were successfully obtained from all 8 patients as continuous long reads from single molecules. PacBio sequencing required approximately 10% of the PCR product needed for Sanger sequencing and generated 325 MB per 3-hour run including 1,800 full-length intact genome reads on average. One patient’s sample was not at a limiting dilution and analysis revealed multiple subspecies. For 8 near-fulllength provirus genomes derived from the other 7 patients, large internal deletions were noted in 2 proviruses; APOBEC-mediated hypermutations were seen in 2 proviruses; and 4 proviruses appeared to be intact genomes. All of the defective proviruses showed a complete absence of resistance mutations in either RT or protease, even after 2-8 years of cART. On the contrary, all of the intact proviruses contained evidence of ART-resistance associated mutations suggesting that they represented relatively recent variants. Conclusions: Combining a novel protocol for full-length limiting dilution amplification of proviruses with PacBio SMRT sequencing allowed for the generation of near-full-length genomes with good quality and an ability to detect minor variants at the 1-10% level. Preliminary data analyses suggest that defective proviruses may represent archival variants that persist long-term in host cells, while intact proviruses within the PBMC pool showing evidence of active virus replication may represent more recent variants.
Background: Microbial ecology is reshaping our understanding of the natural world by revealing the large phylogenetic and functional diversity of microbial life. However the vast majority of these microorganisms remain poorly understood, as most cultivated representatives belong to just four phylogenetic groups and more than half of all identified phyla remain uncultivated. Characterization of this microbial ‘dark matter’ will thus greatly benefit from new metagenomic methods for in situ analysis. For example, sensitive high throughput methods for the characterization of community composition and structure from the sequencing of conserved marker genes. Methods: Here we utilize Single Molecule Real-Time (SMRT) sequencing of full-length 16S rRNA amplicons to phylogenetically profile microbial communities to below the genus-level. We test this method on a mock community of known composition, as well as a previously studied microbial community from a lake known to predominantly contain poorly characterized phyla. These results are compared to traditional 16S tag sequencing from short-read technologies and subsets of the full-length data corresponding to the same regions of the 16S gene. Results: We explore the benefits of using full-length amplicons for estimating community structure and diversity. In addition, we investigate the possible effects of context-specific and GC-content biases known to affect short-read sequencing technologies on the predicted community structure. We characterize the potential benefits of profiling metagenomic communities with full-length 16S rRNA genes from SMRT sequencing relative to standard methods.