June 1, 2021  |  

Harnessing kinetic information in Single-Molecule, Real-Time Sequencing.

Single-Molecule Real-Time (SMRT) DNA sequencing is unique in that nucleotide incorporation events are monitored in real time, leading to a wealth of kinetic information in addition to the extraction of the primary DNA sequence. The dynamics of the DNA polymerase that is observed adds an additional dimension of sequence-dependent information, and can be used to learn more about the molecule under study. First, the primary sequence itself can be determined more accurately. The kinetic data can be used to corroborate or overturn consensus calls and even enable calling bases in problematic sequence contexts. Second, using the kinetic information, we can detect and discriminate numerous chemical base modifications as a by-product of ordinary sequencing. Examples of applying these capabilities include (i) the characterization of the epigenome of microorganisms by directly sequencing the three common prokaryotic epigenetic base modifications of 4-methylcytosine, 5- methylcytosine and 6-methyladenine; (ii) the characterization of known and novel methyltransferase activities; (iii) the direct sequencing and differentiation of the four eukaryotic epigenetic forms of cytosine (5-methyl, 5-hydroxymethyl, 5-formyl, and 5-carboxylcytosine) with first applications to map them with single base-pair and DNA strand resolution across mammalian genomes; (iv) the direct sequencing and identification of numerous modified DNA bases arising from DNA damage; and (v) an exploration of the mitochondrial genome for known and novel base modifications. We will show our progress towards a generic, open-source algorithm for exploiting kinetic information for any of these purposes.


June 1, 2021  |  

Direct sequencing and identification of damaged DNA bases.

DNA is under constant stress from both endogenous and exogenous sources. DNA base modifications resulting from various types of DNA damage are wide-spread and play important roles in affecting physiological states and disease phenotypes. Examples include oxidative damage (8- oxoguanine, 8-oxoadenine; aging, Alzheimer’s, Parkinson’s), alkylation (1-methyladenine, 6-O- methylguanine; cancer), adduct formation (benzo[a]pyrene diol epoxide (BPDE), pyrimidine dimers; smoking, industrial chemical exposure, chemical UV light exposure, cancer), and ionizing radiation damage (5-hydroxycytosine, 5- hydroxyuracil, 5-hydroxymethyluracil; cancer). Currently, these and other products of DNA damage cannot be sequenced with existing sequencing methods. In contrast, single molecule, real-time (SMRT) DNA sequencing can report on modified DNA bases through an analysis of the DNA polymerase kinetics that is affected by a modified base in the template. We demonstrate the DNA strand-resolved sequencing of over 8 different DNA-damage associated base modifications, with base pair resolution and single DNA molecule sensitivity. We also report on the application of this sequencing capability to biological samples and the development of a generic, open-source algorithm to analyze kinetic information from SMRT sequencing.


June 1, 2021  |  

Complete HIV-1 genomes from single molecules: Diversity estimates in two linked transmission pairs using clustering and mutual information.

We sequenced complete HIV-1 genomes from single molecules using Single Molecule, Real- Time (SMRT) Sequencing and derive de novo full-length genome sequences. SMRT sequencing yields long-read sequencing results from individual DNA molecules with a rapid time-to-result. These attributes make it a useful tool for continuous monitoring of viral populations. The single-molecule nature of the sequencing method allows us to estimate variant subspecies and relative abundances by counting methods. We detail mathematical techniques used in viral variant subspecies identification including clustering distance metrics and mutual information. Sequencing was performed in order to better understand the relationships between the specific sequences of transmitted viruses in linked transmission pairs. Samples representing HIV transmission pairs were selected from the Zambia Emory HIV Research Project (Lusaka, Zambia) and sequenced. We examine Single Genome Amplification (SGA) prepped samples and samples containing complex mixtures of genomes. Whole genome consensus estimates for each of the samples were made. Genome reads were clustered using a simple distance metric on aligned reads. Appropriate thresholds were chosen to yield distinct clusters of HIV genomes within samples. Mutual information between columns in the genome alignments was used to measure dependence. In silico mixtures of reads from the SGA samples were made to simulate samples containing exactly controlled complex mixtures of genomes and our clustering methods were applied to these complex mixtures. SMRT Sequencing data contained multiple full-length (greater than 9 kb) continuous reads for each sample. Simple whole genome consensus estimates easily identified transmission pairs. The clustering of the genome reads showed diversity differences between the samples, allowing us to characterize the diversity of the individual quasi-species comprising the patient viral populations across the full genome. Mutual information identified possible dependencies of different positions across the full HIV-1 genome. The SGA consensus genomes agreed with prior Sanger sequencing. Our clustering methods correctly segregated reads to their correct originating genome for the synthetic SGA mixtures. The results open up the potential for reference-agnostic and cost effective full genome sequencing of HIV-1.


June 1, 2021  |  

Rapid sequencing of HIV-1 genomes as single molecules from simple and complex samples.

Background: To better understand the relationships among HIV-1 viruses in linked transmission pairs, we sequenced several samples representing HIV transmission pairs from the Zambia Emory HIV Research Project (Lusaka, Zambia) using Single Molecule, Real-Time (SMRT) Sequencing. Methods: Single molecules were sequenced as full-length (9.6 kb) amplicons directly from PCR products without shearing. This resulted in multiple, fully-phased, complete HIV-1 genomes for each patient. We examined Single Genome Amplification (SGA) prepped samples, as well as samples containing complex mixtures of genomes. We detail mathematical techniques used in viral variant subspecies identification, including clustering distance metrics and mutual information, which were used to derive multiple de novo full-length genome sequences for each patient. Whole genome consensus estimates for each sample were made. Genome reads were clustered using a simple distance metric on aligned reads. Appropriate thresholds were chosen to yield distinct clusters of HIV-1 genomes within samples. Mutual information between columns in the genome alignments was used to measure dependence. In silico mixtures of reads from the SGA samples were made to simulate samples containing exactly controlled complex mixtures of genomes and our clustering methods were applied to these complex mixtures. Results: SMRT Sequencing data contained multiple full-length (>9 kb) continuous reads for each sample. Simple whole-genome consensus estimates easily identified transmission pairs. Clustering of genome reads showed diversity differences between samples, allowing characterization of the quasi-species diversity comprising the patient viral populations across the full genome. Mutual information identified possible dependencies of different positions across the full HIV-1 genome. The SGA consensus genomes agreed with prior Sanger sequencing. Our clustering methods correctly segregated reads to their correct originating genome for the synthetic SGA mixtures. Conclusions: SMRT Sequencing yields long-read sequencing results from individual DNA molecules with a rapid time-to-result. These attributes make it a useful tool for continuous monitoring of viral populations. The single-molecule nature of the sequencing method allows us to estimate variant subspecies and relative abundances by counting methods. The results open up the potential for reference-agnostic and cost effective full genome sequencing of HIV-1.


June 1, 2021  |  

High-throughput analysis of full-length proviral HIV-1 genomes from PBMCs.

Background: HIV-1 proviruses in peripheral blood mononuclear cells (PBMCs) are felt to be an important reservoir of HIV-1 infection. Given that this pool represents an archival library, it can be used to study virus evolution and CD4+ T cell survival. Accurate study of this pool is burdened by difficulties encountered in sequencing a full-length proviral genome, typically accomplished by assembling overlapping pieces and imputing the full genome. Methodology: Cryopreserved PBMCs collected from a total of 8 HIV+ patients from 1997-2001 were used for genomic DNA extraction. Patients had been receiving cART for 2-8 years at the time samples were obtained. 7 patients had pVL >50 copies/mL (mean: 312,282, range: 18,372-683,400) and 1 had pVL <50. Genomic DNA was subjected to limiting dilution prior to amplification of near-full-length genomes by a newly developed nested PCR. The predicted size of the PCR product was 9.0 kb, spanning from the 5’ LTR through the 3’ LTR. Single molecules were sequenced as near-full-length amplicons directly from PCR products without shearing using commercially available P4-C2 reagents and standard protocols on a PacBio RS II instrument. Quality of the genomes was validated by clonal positive controls and synthetic mixtures. Results: Near-full-length provirus genome sequences were successfully obtained from all 8 patients as continuous long reads from single molecules. PacBio sequencing required approximately 10% of the PCR product needed for Sanger sequencing and generated 325 MB per 3-hour run including 1,800 full-length intact genome reads on average. One patient’s sample was not at a limiting dilution and analysis revealed multiple subspecies. For 8 near-fulllength provirus genomes derived from the other 7 patients, large internal deletions were noted in 2 proviruses; APOBEC-mediated hypermutations were seen in 2 proviruses; and 4 proviruses appeared to be intact genomes. All of the defective proviruses showed a complete absence of resistance mutations in either RT or protease, even after 2-8 years of cART. On the contrary, all of the intact proviruses contained evidence of ART-resistance associated mutations suggesting that they represented relatively recent variants. Conclusions: Combining a novel protocol for full-length limiting dilution amplification of proviruses with PacBio SMRT sequencing allowed for the generation of near-full-length genomes with good quality and an ability to detect minor variants at the 1-10% level. Preliminary data analyses suggest that defective proviruses may represent archival variants that persist long-term in host cells, while intact proviruses within the PBMC pool showing evidence of active virus replication may represent more recent variants.


June 1, 2021  |  

Data release for polymorphic genome assembly algorithm development.

Heterozygous and highly polymorphic diploid (2n) and higher polyploidy (n > 2) genomes have proven to be very difficult to assemble. One key to the successful assembly and phasing of polymorphic genomics is the very long read length (9-40 kb) provided by the PacBio RS II system. We recently released software and methods that facilitate the assembly and phasing of genomes with ploidy levels equal to or greater than 2n. In an effort to collaborate and spur on algorithm development for assembly and phasing of heterozygous polymorphic genomes, we have recently released sequencing datasets that can be used to test and develop highly polymorphic diploid and polyploidy assembly and phasing algorithms. These data sets include multiple species and ecotypes of Arabidopsis that can be combined to create synthetic in-silico F1 hybrids with varying levels of heterozygosity. Because the sequence of each individual line was generated independently, the data set provides a ‘ground truth’ answer for the expected results allowing the evaluation of assembly algorithms. The sequencing data, assembly of inbred and in-silico heterozygous samples (n=>2) and phasing statistics will be presented. The raw and processed data has been made available to aid other groups in the development of phasing and assembly algorithms.


June 1, 2021  |  

Sequencing complex mixtures of HIV-1 genomes with single-base resolution.

A large number of distinct HIV-1 genomes can be present in a single clinical sample from a patient chronically infected with HIV-1. We examined samples containing complex mixtures of near-full-length HIV-1 genomes. Single molecules were sequenced as near-full-length (9.6 kb) amplicons directly from PCR products without shearing. Mathematical analysis techniques deconvolved the complex mixture of reads into estimates of distinct near-full-length viral genomes with their relative abundances. We correctly estimated the originating genomes to single-base resolution along with their relative abundances for mixtures where the truth was known exactly by independent sequencing methods. Correct estimates were made even when genomes diverged by a single base. Minor abundances of 5% were reliably detected. SMRT Sequencing data contained near-full-length continuous reads for each sample including some runs with greater than 10,000 near-full-length-genome reads in a three-hour collection time. SMRT Sequencing yields long- read sequencing results from individual DNA molecules with a rapid time-to-result. The single-molecule, full-length nature of the sequencing method allows us to estimate variant subspecies and relative abundances even from samples containing complex mixtures of genomes that differ by single bases. These results open the possibility of cost-effective full-genome sequencing of HIV-1 in mixed populations for applications such as incorporated-HIV-1 screening. In screening, genomes can differ by one to many thousands of bases and the ability to measure them can help scientifically inform treatment strategies.


June 1, 2021  |  

High-accuracy, single-base resolution of near-full-length HIV genomes.

Background: The HIV-1 proviral reservoir is incredibly stable, even while undergoing antiretroviral therapy, and is seen as the major barrier to HIV-1 eradication. Identifying and comprehensively characterizing this reservoir will be critical to achieving an HIV cure. Historically, this has been a tedious and labor intensive process, requiring high-replicate single-genome amplification reactions, or overlapping amplicons that are then reconstructed into full-length genomes by algorithmic imputation. Here, we present a deep sequencing and analysis method able to determine the exact identity and relative abundances of near-full-length HIV genomes from samples containing mixtures of genomes without shearing or complex bioinformatic reconstruction. Methods: We generated clonal near-full-length (~9 kb) amplicons derived from single genome amplification (SGA) of primary proviral isolates or PCR of well-documented control strains. These clonal products were mixed at various abundances and sequenced as near-full-length (~9 kb) amplicons without shearing. Each mixture yielded many near-full-length HIV-1 reads. Mathematical analysis techniques resolved the complex mixture of reads into estimates of distinct near-full-length viral genomes with their relative abundances. Results: Single Molecule, Real-Time (SMRT) Sequencing data contained near-full-length (~9 kb) continuous reads for each sample including some runs with greater than 10,000 near-full-length-genome reads in a three-hour sequencing run. Our methods correctly recapitulated exactly the originating genomes at a single-base resolution and their relative abundances in both mixtures of clonal controls and SGAs, and these results were validated using independent sequencing methods. Correct resolution was achieved even when genomes differed only by a single base. Minor abundances of 5% were reliably detected. Conclusions: SMRT Sequencing yields long-read sequencing results from individual DNA molecules, a rapid time-to-result. The single-molecule, full-length nature of this sequencing method allows us to estimate variant subspecies and relative abundances with single-nucleotide resolution. This method allows for reference-agnostic and cost-effective full-genome sequencing of HIV-1, which could both further our understanding of latent infection and develop novel and improved tools for quantifying HIV provirus, which will be critical to cure HIV.


June 1, 2021  |  

MaSuRCA Mega-Reads Assembly Technique for haplotype resolved genome assembly of hybrid PacBio and Illumina Data

The developments in DNA sequencing technology over the past several years have enabled large number of scientists to obtain sequences for the genomes of their interest at a fairly low cost. Illumina Sequencing was the dominant whole genome sequencing technology over the past few years due to its low cost. The Illumina reads are short (up to 300bp) and thus most of those draft genomes produced from Illumina data are very fragmented which limits their usability in practical scenarios. Longer reads are needed for more contiguous genomes. Recently Pacbio sequencing made significant advances in developing cost-effective long-read (>10000bp) sequencing technology and their data, although several times more expensive than Illumina, can be used to produce high quality genomes. Pacbio data can be used for de novo assembly, however due to its high error rate high coverage of the genome is required this raising the cost barrier. A solution for cost-effective genomes is to combine Pacbio and Illumina data leveraging the low error rates of the short Illumina reads and the length of the Pacbio reads. We have developed MaSuRCA mega-reads assembler for efficient assembly of hybrid data sets and we demonstrate that it performs well compared to the other published hybrid techniques. Another important benefit of the long reads is their ability to link the haplotype differences. The mega-reads approach corrects each Pacbio read independently and thus haplotype differences are preserved. Thus, leveraging the accuracy of the Illumina data and the length of the Pacbio reads, MaSuRCA mega-reads can produce haplotype-resolved genome assemblies, where each contig has sequence from a single haplotype. We present preliminary results on haplotype-resolved genome assemblies of faux (proof-of-concept) and real data.


June 1, 2021  |  

An improved circular consensus algorithm with an application to detection of HIV-1 Drug-Resistance Associated Mutations (DRAMs)

Scientists who require confident resolution of heterogeneous populations across complex regions have been unable to transition to short-read sequencing methods. They continue to depend on Sanger Sequencing despite its cost and time inefficiencies. Here we present a new redesigned algorithm that allows the generation of circular consensus sequences (CCS) from individual SMRT Sequencing reads. With this new algorithm, dubbed CCS2, it is possible to reach arbitrarily high quality across longer insert lengths at a lower cost and higher throughput than Sanger Sequencing. We apply this new algorithm, dubbed CCS2, to the characterization of the HIV-1 K103N drug-resistance associated mutation, which is both important clinically, and represents a challenge due to regional sequence context. A mutation was introduced into the 3rd position of amino acid position 103 (A>C substitution) of the RT gene on a pNL4-3 backbone by site-directed mutagenesis. Regions spanning ~1,300 bp were PCR amplified from both the non-mutated and mutant (K103N) plasmids, and were sequenced individually and as a 50:50 mixture. Sequencing data were analyzed using the new CCS2 algorithm, which uses a fully-generative probabilistic model of our SMRT Sequencing process to polish consensus sequences to arbitrarily high accuracy. This result, previously demonstrated for multi-molecule consensus sequences with the Quiver algorithm, is made possible by incorporating per-Zero Mode Waveguide (ZMW) characteristics, thus accounting for the intrinsic changes in the sequencing process that are unique to each ZMW. With CCS2, we are able to achieve a per-read empirical quality of QV30 with 19X coverage. This yields ~5000 1.3 kb consensus sequences with a collective empirical quality of ~QV40. Additionally, we demonstrate a 0% miscall rate in both unmixed samples, and estimate a 48:52% frequency for the K103N mutation in the mixed sample, consistent with data produced by orthogonal platforms.


June 1, 2021  |  

An improved circular consensus algorithm with an application to detect HIV-1 Drug Resistance Associated Mutations (DRAMs)

Scientists who require confident resolution of heterogeneous populations across complex regions have been unable to transition to short-read sequencing methods. They continue to depend on Sanger sequencing despite its cost and time inefficiencies. Here we present a new redesigned algorithm that allows the generation of circular consensus sequences (CCS) from individual SMRT Sequencing reads. With this new algorithm, dubbed CCS2, it is possible to reach high quality across longer insert lengths at a lower cost and higher throughput than Sanger sequencing. We applied CCS2 to the characterization of the HIV-1 K103N drug-resistance associated mutation in both clonal and patient samples. This particular DRAM has previously proved to be clinically relevant, but challenging to characterize due to regional sequence context. First, a mutation was introduced into the 3rd position of amino acid position 103 (A>C substitution) of the RT gene on a pNL4-3 backbone by site-directed mutagenesis. Regions spanning ~1.3 kb were PCR amplified from both the non-mutated and mutant (K103N) plasmids, and were sequenced individually and as a 50:50 mixture. Additionally, the proviral reservoir of a subject with known dates of virologic failure of an Efavirenz-based regimen and with documented emergence of drug resistant (K103N) viremia was sequenced at several time points as a proof-of-concept study to determine the kinetics of retention and decay of K103N.Sequencing data were analyzed using the new CCS2 algorithm, which uses a fully-generative probabilistic model of our SMRT Sequencing process to polish consensus sequences to high accuracy. With CCS2, we are able to achieve a per-read empirical quality of QV30 (99.9% accuracy) at 19X coverage. A total of ~5000 1.3 kb consensus sequences with a collective empirical quality of ~QV40 (99.99%) were obtained for each sample. We demonstrate a 0% miscall rate in both unmixed control samples, and estimate a 48:52 frequency for the K103N mutation in the mixed (50:50) plasmid sample, consistent with data produced by orthogonal platforms. Additionally, the K103N escape variant was only detected in proviral samples from time points subsequent (19%) to the emergence of drug resistant viremia. This tool might be used to monitor the HIV reservoir for stable evolutionary changes throughout infection.


June 1, 2021  |  

T-cell receptor profiling using PacBio sequencing of SMARTer libraries

T-cells play a central part in the immune response in humans and related species. T-cell receptors (TCRs), heterodimers located on the T-cell surface, specifically bind foreign antigens displayed on the MHC complex of antigen-presenting cells. The wide spectrum of potential antigens is addressed by the diversity of TCRs created by V(D)J recombination. Profiling this repertoire of TCRs could be useful from, but not limited to, diagnosis, monitoring response to treatments, and examining T-cell development and diversification.


June 1, 2021  |  

Applying Sequel to Genomic Datasets

De novo assembly is a large part of JGI’s analysis portfolio. Repetitive DNA sequences are abundant in a wide range of organisms we sequence and pose a significant technical challenge for assembly. We are interested in long read technologies capable of spanning genomic repeats to produce better assemblies. We currently have three RS II and two Sequel PacBio machines. RS II machines are primarily used for fungal and microbial genome assembly as well as synthetic biology validation. Between microbes and fungi we produce hundreds of PacBio libraries a year and for throughput reasons the vast majority of these are >10 kb AMPure libraries. Throughput for RS II is about 1 Gb per SMRT Cell. This is ideal for microbial sized genomes but can be costly and labor intensive for larger projects which require multiple cells. JGI was an early access site for Sequel and began testing with real samples in January 2016. During that time we’ve had the opportunity to sequence microbes, fungi, metagenomes, and plants. Here we present our experience over the last 18 months using the Sequel platform and provide comparisons with RS II results.


June 1, 2021  |  

Scalability and reliability improvements to the Iso-Seq analysis pipeline enables higher throughput sequencing of full-length cancer transcripts

The characterization of gene expression profiles via transcriptome sequencing has proven to be an important tool for characterizing how genomic rearrangements in cancer affect the biological pathways involved in cancer progression and treatment response. More recently, better resolution of transcript isoforms has shown that this additional level of information may be useful in stratifying patients into cancer subtypes with different outcomes and responses to treatment.1 The Iso-Seq protocol developed at PacBio is uniquely able to deliver full-length, high-quality cDNA sequences, allowing the unambiguous determination of splice variants, identifying potential biomarkers and yielding new insights into gene fusion events. Recent improvements to the Iso-Seq bioinformatics pipeline increases the speed and scalability of data analysis while boosting the reliability of isoform detection and cross-platform usability. Here we report evaluation of Sequel Iso-Seq runs of human UHRR samples with spiked-in synthetic RNA controls and show that the new pipeline is more CPU efficient and recovers more human and synthetic isoforms while reducing the number of false positives. We also share the results of sequencing the well-characterized HCC-1954 breast cancer and normal breast cell lines, which will be made publicly available. Combined with the recent simplification of the Iso-Seq sample preparation2, the new analysis pipeline completes a streamlined workflow for revealing the most comprehensive picture of transcriptomes at the throughput needed to characterize cancer samples.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.