June 1, 2021  |  

Profiling the microbiome in fecal microbiota transplantation using circular consensus and Single Molecule, Real-Time Sequencing

There are many sequencing-based approaches to understanding complex metagenomic communities spanning targeted amplification to whole-sample shotgun sequencing. While targeted approaches provide valuable data at low sequencing depth, they are limited by primer design and PCR. Whole-sample shotgun experiments generally use short-read sequencing, which results in data processing difficulties. For example, reads less than 500bp in length will rarely cover a complete gene or region of interest, and will require assembly. This not only introduces the possibility of incorrectly combining sequence from different community members, it requires a high depth of coverage. As such, rare community members may not be represented in the resulting assembly. Circular-consensus, single molecule, real-time (SMRT®) Sequencing reads in the 1-3kb range, with >99% accuracy can be efficiently generated for low amounts of input DNA. 10 ng of input DNA sequenced in 4 SMRT Cells on the PacBio RS II would generate >100,000 such reads. While throughput is lower compared to short-read sequencing methods, the reads are a true random sampling of the underlying community since SMRT Sequencing has been shown to have very low sequence-context bias. With reads >1 kb at >99% accuracy it is reasonable to expect a high percentage of reads include gene fragments useful for analysis without the need for de novo assembly. Here we present the results of circular consensus sequencing for an individual’s microbiome, before and after undergoing fecal microbiota transplantation (FMT) in order to treat a chronic Clostridium difficile infection. We show that even with relatively low sequencing depth, the long-read, assembly-free, random sampling allows us to profile low abundance community members at the species level. We also show that using shotgun sampling with long reads allows a level of functional insight not possible with classic targeted 16S, or short read sequencing, due to entire genes being covered in single reads.


June 1, 2021  |  

Low-input long-read sequencing for complete microbial genomes and metagenomic community analysis

Microbial genome sequencing can be done quickly, easily, and efficiently with the PacBio sequencing instruments, resulting in complete de novo assemblies. Alternative protocols have been developed to reduce the amount of purified DNA required for SMRT Sequencing, to broaden applicability to lower-abundance samples. If 50-100 ng of microbial DNA is available, a 10-20 kb SMRTbell library can be made. The resulting library can be loaded onto multiple SMRT Cells, yielding more than enough data for complete assembly of microbial genomes using the SMRT Portal assembly program HGAP, plus base modification analysis. The entire process can be done in less than 3 days by standard laboratory personnel. This approach is particularly important for analysis of metagenomic communities, in which genomic DNA is often limited. From these samples, full-length 16S amplicons can be generated, prepped with the standard SMRTbell library prep protocol, and sequenced. Alternatively, a 2 kb sheared library, made from a few ng of input DNA, can also be used to elucidate the microbial composition of a community, and may provide information about biochemical pathways present in the sample. In both these cases, 1-2 kb reads with >99.9% accuracy can be obtained from Circular Consensus Sequencing.


June 1, 2021  |  

Minimization of chimera formation and substitution errors in full-length 16S PCR amplification

The constituents and intra-communal interactions of microbial populations have garnered increasing interest in areas such as water remediation, agriculture and human health. One popular, efficient method of profiling communities is to amplify and sequence the evolutionarily conserved 16S rRNA sequence. Currently, most targeted amplification focuses on short, hypervariable regions of the 16S sequence. Distinguishing information not spanned by the targeted region is lost and species-level classification is often not possible. SMRT Sequencing easily spans the entire 1.5 kb 16S gene, and in combination with highly-accurate single-molecule sequences, can improve the identification of individual species in a metapopulation. However, when amplifying a mixture of sequences with close similarities, the products may contain chimeras, or recombinant molecules, at rates as high as 20-30%. These PCR artifacts make it difficult to identify novel species, and reduce the amount of productive sequences. We investigated multiple factors that have been hypothesized to contribute to chimera formation, such as template damage, denaturing time before and during cycling, polymerase extension time, and reaction volume. Of the factors tested, we found two major related contributors to chimera formation: the amount of input template into the PCR reaction and the number of PCR cycles. Sequence errors generated during amplification and sequencing can also confound the analysis of complex populations. Circular Consensus Sequencing (CCS) can generate single-molecule reads with >99% accuracy, and the SMRT Analysis software provides filtering of these reads to >99.99% accuracies. Remaining substitution errors in these highly-filtered reads are likely dominated by mis-incorporations during amplification. Therefore, we compared the impact of several commercially-available high-fidelity PCR kits with full-length 16S amplification. We show results of our experiments and describe an optimized protocol for full-length 16S amplification for SMRT Sequencing. These optimizations have broader implications for other applications that use PCR amplification to phase variations across targeted regions and to generate highly accurate reference sequences.


June 1, 2021  |  

SMRT Sequencing for the detection of low-frequency somatic variants

The sensitivity, speed, and reduced cost associated with Next-Generation Sequencing (NGS) technologies have made them indispensable for the molecular profiling of cancer samples. For effective use, it is critical that the NGS methods used are not only robust but can also accurately detect low frequency somatic mutations. Single Molecule, Real-Time (SMRT) Sequencing offers several advantages, including the ability to sequence single molecules with very high accuracy (>QV40) using the circular consensus sequencing (CCS) approach. The availability of genetically defined, human genomic reference standards provides an industry standard for the development and quality control of molecular assays. Here we characterize SMRT Sequencing for the detection of low-frequency somatic variants using the Quantitative Multiplex DNA Reference Standard from Horizon Diagnostics, combined with amplification of the variants using the Multiplicom Tumor Hotspot MASTR Plus assay. The Horizon Diagnostics reference sample contains precise allelic frequencies from 1% to 24.5% for major oncology targets verified using digital PCR. It recapitulates the complexity of tumor composition and serves as a well-characterized control. The control sample was amplified using the Multiplicom Tumor Hotspot Master Plus assay that targets 252 amplicons (121-254 bp) from 26 relevant cancer genes, which includes all 11 variants in the control sample. The amplicons were sequenced and analyzed using SMRT Sequencing to identify the variants and determine the observed frequency. The random error profile and high accuracy CCS reads make it possible to accurately detect low frequency somatic variants.


June 1, 2021  |  

Long-read assembly of the Aedes aegypti Aag2 cell line genome resolves ancient endogenous viral elements

Transmission of arboviruses such as Dengue and Zika viruses by Aedes aegypti causes widespread and debilitating disease across the globe. Disease in humans can include severe acute symptoms such as hemorrhagic fever, organ failure, and encephalitis; and yet, mosquitoes tolerate high titers of virus in a persistent infection. The mechanisms responsible for tolerance to viral infection in mosquitoes are still unclear. Recent publications have highlighted the integration of genetic material from non-retroviral RNA viruses into the genome of the host during infection that relies upon endogenous retro-transcriptase activity from transposons. These endogenous viral elements (EVEs) found in the genome are predicted to be ancient and at least some EVEs are under purifying selection, which suggests that they are beneficial to the host. In order characterize EVE biogenesis in a tractable system we sequenced the Ae. aegypti cell line, Aag2, to 58X coverage and here present a de novo assembly of the genome. The assembly consists of 1.7 Gb of genomic and 255 Mb of alternative haplotype specific sequence, made up of contigs with a N50 of 1.4 Mb; a value that, when compared with other assemblies of the Aedes genus, is from 1-3 orders of magnitude longer. The Aag2 genome is highly repetitive (70%), most of which is classified as transposable elements (60%). We identify a plethora of EVEs in the genome homologous to a diverse range of extant viruses, many of which cluster in these regions of highly repetitive DNA. The highly contiguous nature of this assembly allows for a more comprehensive identification of the transposable elements and EVEs that are most likely to be lost in assemblies lacking the read length of SMRT Sequencing. Transmission of arboviruses such as Dengue Virus by Aedes aegypti causes widespread and debilitating disease across the globe. Disease in humans can include severe acute symptoms such as hemorrhagic fever, organ failure, and encephalitis; and yet, mosquitoes tolerate high titers of virus in a persistent infection. The mechanisms responsible for tolerance to viral infection in mosquitoes are still unclear. Recent publications have highlighted the integration of genetic material from non-retroviral RNA viruses into the genome of the host during infection that relies upon endogenous retro-transcriptase activity from transposons. These endogenous viral elements (EVEs) found in the genome are predicted to be ancient and at least some EVEs are under purifying selection, which suggests that they are beneficial to the host. In order characterize EVE biogenesis in a tractable system we sequenced the Ae. aegypti cell line, Aag2, to 58X coverage and here present a de novo assembly of the genome. The assembly consists of 1.7 Gb of genomic and 255 Mb of alternative haplotype specific sequence, made up of contigs with a N50 of 1.4 Mb; a value that, when compared with other assemblies of the Aedes genus, is from 1-3 orders of magnitude longer. The Aag2 genome is highly repetitive (70%), most of which is classified as transposable elements (60%). We identify a plethora of EVEs in the genome homologous to a diverse range of extant viruses, many of which cluster in these regions of highly repetitive DNA. The highly contiguous nature of this assembly allows for a more comprehensive identification of the transposable elements and EVEs that are most likely to be lost in assemblies lacking the read length of SMRT Sequencing. Transmission of arboviruses such as Dengue Virus by Aedes aegypti causes widespread and debilitating disease across the globe. Disease in humans can include severe acute symptoms such as hemorrhagic fever, organ failure, and encephalitis; and yet, mosquitoes tolerate high titers of virus in a persistent infection. The mechanisms responsible for tolerance to viral infection in mosquitoes are still unclear.


June 1, 2021  |  

Application specific barcoding strategies for SMRT Sequencing

Over the last few years, several advances were implemented in the PacBio RS II System to maximize throughput and efficiency while reducing the cost per sample. The number of useable bases per SMRT Cell now exceeds 1 Gb with the latest P6-C4 chemistry and 6-hour movies. For applications such as microbial sequencing, targeted sequencing, Iso-Seq (full-length isoform sequencing) and Nimblegen’s target enrichment method, current SMRT Cell yields could be an excess relative to project requirements. To this end, barcoding is a viable option for multiplexing samples. For microbial sequencing, multiplexing can be accomplished by tagging sheared genomic DNA during library construction with modified SMRTbell adapters. We studied the performance of 2- to 8-plex microbial sequencing. For full-length amplicon sequencing such as HLA typing, amplicons as large as 5 kb may be barcoded during amplification using barcoded locus-specific primers. Alternatively, amplicons may be barcoded during SMRTbell library construction using barcoded SMRTbell adapters. The preferred barcoding strategy depends on the user’s existing workflow and flexibility to changing and/or updating existing workflows. Using barcoded adapters, five Class I and II genes (3.3 – 5.8 kb) x 96 patients can be multiplexed and typed. For Iso-Seq full-length cDNA sequencing, barcodes are incorporated during 1st-strand synthesis and are enabled by tailing the oligo-dT primer with any PacBio published 16-bp barcode sequences. RNA samples from 6 maize tissues were multiplexed to generate barcoded cDNA libraries. The NimbleGen SeqCap Target Enrichment method, combined with PacBio’s long-read sequencing, provides comprehensive view of multi-kilobase contiguous regions, both exonic and intronic regions. To make this cost effective, we recommend barcoding samples for pooling prior to target enrichment and capture. Here, we present specific examples of strategies and best practices for multiplexing samples for different applications for SMRT Sequencing. Additionally, we describe recommendations for analyzing barcoded samples.


June 1, 2021  |  

Immune regions are no longer incomprehensible with SMRT Sequencing

The complex immune regions of the genome, including MHC and KIR, contain large copy number variants (CNVs), a high density of genes, hyper-polymorphic gene alleles, and conserved extended haplotypes (CEH) with enormous linkage disequilibrium (LDs). This level of complexity and inherent biases of short-read sequencing make it challenging for extracting immune region haplotype information from reference-reliant, shotgun sequencing and GWAS methods. As NGS based genome and exome sequencing and SNP arrays have become a routine for population studies, numerous efforts are being made for developing software to extract and or impute the immune gene information from these datasets. Despite these efforts, the fine mapping of causal variants of immune genes for their well-documented association with cancer, drug-induced hypersensitivity and immune-related diseases, has been slower than expected. This has in many ways limited our understanding of the mechanisms leading to immune disease. In the present work, we demonstrate the advantages of long reads delivered by SMRT Sequencing for assembling complete haplotypes of MHC and KIR gene clusters, as well as calling correct genotypes of genes comprised within them. All the genotype information is detected at allele- level with full phasing information across SNP-poor regions. Genotypes were called correctly from targeted gene amplicons, haplotypes, as well as from a completely assembled 5 Mb contig of the MHC region from a de novo assembly of whole genome shotgun data. De novo analysis pipeline used in all these approaches allowed for reference-free analysis without imputation, a key for interrogation without prior knowledge about ethnic backgrounds. These methods are thus easily adoptable for previously uncharacterized human or non-human species.


June 1, 2021  |  

An improved circular consensus algorithm with an application to detection of HIV-1 Drug-Resistance Associated Mutations (DRAMs)

Scientists who require confident resolution of heterogeneous populations across complex regions have been unable to transition to short-read sequencing methods. They continue to depend on Sanger Sequencing despite its cost and time inefficiencies. Here we present a new redesigned algorithm that allows the generation of circular consensus sequences (CCS) from individual SMRT Sequencing reads. With this new algorithm, dubbed CCS2, it is possible to reach arbitrarily high quality across longer insert lengths at a lower cost and higher throughput than Sanger Sequencing. We apply this new algorithm, dubbed CCS2, to the characterization of the HIV-1 K103N drug-resistance associated mutation, which is both important clinically, and represents a challenge due to regional sequence context. A mutation was introduced into the 3rd position of amino acid position 103 (A>C substitution) of the RT gene on a pNL4-3 backbone by site-directed mutagenesis. Regions spanning ~1,300 bp were PCR amplified from both the non-mutated and mutant (K103N) plasmids, and were sequenced individually and as a 50:50 mixture. Sequencing data were analyzed using the new CCS2 algorithm, which uses a fully-generative probabilistic model of our SMRT Sequencing process to polish consensus sequences to arbitrarily high accuracy. This result, previously demonstrated for multi-molecule consensus sequences with the Quiver algorithm, is made possible by incorporating per-Zero Mode Waveguide (ZMW) characteristics, thus accounting for the intrinsic changes in the sequencing process that are unique to each ZMW. With CCS2, we are able to achieve a per-read empirical quality of QV30 with 19X coverage. This yields ~5000 1.3 kb consensus sequences with a collective empirical quality of ~QV40. Additionally, we demonstrate a 0% miscall rate in both unmixed samples, and estimate a 48:52% frequency for the K103N mutation in the mixed sample, consistent with data produced by orthogonal platforms.


June 1, 2021  |  

Reconstruction of the spinach coding genome using full-length transcriptome without a reference genome

For highly complex and large genomes, a well-annotated genome may be computationally challenging and costly, yet the study of alternative splicing events and gene annotations usually rely on the existence of a genome. Long-read sequencing technology provides new opportunities to sequence full-length cDNAs, avoiding computational challenges that short read transcript assembly brings. The use of single molecule, real-time sequencing from PacBio to sequence transcriptomes (the Iso-Seq method), which produces de novo, high-quality, full-length transcripts, has revealed an astonishing amount of alternative splicing in eukaryotic species. With the Iso-Seq method, it is now possible to reconstruct the transcribed regions of the genome using just the transcripts themselves. We present Cogent, a tool for finding gene families and reconstructing the coding genome in the absence of a high-quality reference genome. Cogent uses k-mer similarities to first partition the transcripts into different gene families. Then, for each gene family, the transcripts are used to build a splice graph. Cogent identifies bubbles resulting from sequencing errors, minor variants, and exon skipping events, and attempts to resolve each splice graph down to the minimal set of reconstructed contigs. We apply Cogent to the Iso-Seq data for spinach, Spinacia oleracea, for which there is also a PacBio-based draft genome to validate the reconstruction. The Iso-Seq dataset consists of 68,263 fulllength, Quiver-polished transcript sequences ranging from 528 bp to 6 kbp long (mean: 2.1 kbp). Using the genome mapping as ground truth, we found that 95% (8045/8446) of the Cogent gene families found corresponded to a single genomic loci. For families that contained multiple loci, they were often homologous genes that would be categorized as belonging to the same gene family. Coding genome reconstruction was then performed individually for each gene family. A total of 86% (7283/8446) of the gene families were resolved to a single contig by Cogent, and was validated to be also a single contig in the genome. In 59 cases, Cogent reconstructed a single contig, however the contig corresponded to 2 or more loci in the genome, suggesting possible scaffolding opportunities. In 24 cases, the transcripts had no hits to the genome, though Pfam and BLAST searches of the transcripts show that they were indeed coding, suggesting that the genome is missing certain coding portions. Given the high quality of the spinach genome, we were not surprised to find that Cogent only minorly improved the genome space. However the ability of Cogent to accurately identify gene families and reconstruct the coding genome in a de novo fashion shows that it will be extremely powerful when applied to datasets for which there is no or low-quality reference genome.


June 1, 2021  |  

Targeted sequencing and chromosomal haplotype assembly using TLA and SMRT Sequencing

With the increasing availability of whole-genome sequencing, haplotype reconstruction of individual genomes, or haplotype assembly, remains unsolved. Like the de novo genome assembly problem, haplotype assembly is greatly simplified by having more long-range information. The Targeted Locus Amplification (TLA) technology from Cergentis has the unique capability of targeting a specific region of the genome using a single primer pair and yielding ~2 kb DNA circles that are comprised of ~500 bp fragments. Fragments from the same circle come from the same haplotype and follow an exponential decay in distance from the target region, with a span that reaches the multi-megabase range. Here, we apply TLA to the BRCA1 gene on NA12878 and then sequence the resulting 2 kb circles on a PacBio RS II. The multiple fragments per circle were iteratively mapped to hg19 and then haplotype assembled using HAPCUT. We show that the 80 kb length of BRCA1 is represented by a single haplotype block, which was validated against GIAB data. We then explored chromosomal-scale haplotype assembly by combining these data with whole genome shotgun PacBio long reads, and demonstrate haplotype blocks approaching the length of chromosome 17 on which BRCA1 lies. Finally, by performing TLA without the amplification step and size selecting for reads >5 kb to maximize the number of fragments per read, we target whole genome haplotype assembly across all chromosomes.


June 1, 2021  |  

Multiplexing strategies for microbial whole genome SMRT Sequencing

The increased throughput of the RS II and Sequel Systems enables multiple microbes to be sequenced on a single SMRT Cell. This multiplexing can be readily achieved by simply incorporating a unique barcode for each microbe into the SMRTbell adapters after shearing genomic DNA using a streamlined library construction process. Incorporating a barcode without the requirement for PCR amplification prevents the loss of epigenetic information (e.g., methylation signatures), and the generation of chimeric sequences, while the modified protocol eliminates the need to build several individual SMRTbell libraries. We multiplexed up to 8 unique strains of H. pylori. Each strain was sheared, and processed through adapter ligation in a single, addition only reaction. The barcoded strains were then pooled in equimolar quantities, and processed through the remainder of the library preparation and purification steps. We demonstrate successful de novo microbial assembly and epigenetic analysis from all multiplexes (2 through 8-plex) using standard tools within SMRT Link Analysis using data generated from a single SMRTbell library, run on a single SMRT Cell. This process facilitates the sequencing of multiple microbial genomes in a single day, greatly increasing throughput and reducing costs per genome assembly.


June 1, 2021  |  

Enrichment of unamplified DNA and long-read SMRT Sequencing in unlocking the underlying biological disease mechanisms of repeat expansion disorders

For many of the repeat expansion disorders, the disease gene has been discovered, however the underlying biological mechanisms have not yet been fully understood. This is mainly due to technological limitations that do not allow for the needed base-pair resolution of the long, repetitive genomic regions. We have developed a novel, amplification-free enrichment technique that uses the CRISPR/Cas9 system to target large repeat expansions. This method, in conjunction with PacBio’s long reads and uniform coverage, enables sequencing of these complex genomic regions. By using a PCR-free amplification method, we are able to access not only the repetitive elements and interruption sequences accurately, but also the epigenetic information.


June 1, 2021  |  

SMRT Sequencing of DNA and RNA samples extracted from formalin-fixed and paraffin-embedded tissues

Recent advances in next-generation sequencing have led to the increased use of formalin-fixed and paraffin-embedded (FFPE) tissues for medical samples in disease and scientific research. Single Molecule, Real-Time (SMRT) Sequencing offers a unique advantage in that it allows direct analysis of FFPE samples without amplification. However, obtaining ample long-read information from FFPE samples has been a challenge due to the quality and quantity of the extracted DNA. DNA samples extracted from FFPE often contain damaged sites, including breaks in the backbone and missing or altered nucleotide bases, which directly impact sequencing and amplification. Additionally, the quality and quantity of the recovered DNA also vary depending on the extraction methods used. We have evaluated the Adaptive Focused Acoustics (AFA™) system by Covaris as a method for obtaining high molecular weight DNA suitable for SMRTbell template preparation and subsequent single molecule sequencing. Using this method, genomic DNA was extracted from normal kidney FFPE scrolls acquired from Cooperative Human Tissue Network (CHTN), University of Pennsylvania. Damaged sites present in the extracted DNA were repaired using a DNA Damage Repair step, and the treated DNA was constructed into SMRTbell libraries suitable for sequencing on the PacBio RS II System. Using the same repaired DNA, we also tested PCR efficiency of target gene regions of up to 5 kb. The resulting amplicons were constructed into SMRTbell templates for full-length sequencing on the PacBio RS II System. We found the Adaptive Focused Acoustics (AFA) system combined with truXTRAC™ by Covaris to be effective and efficient. This system is easy and simple to use, and the resulting DNA is compatible with SMRTbell library preparation for targeted and whole genome SMRT Sequencing. The data presented here demonstrates single molecule sequencing of DNA samples extracted from tissues embedded in FFPE.


June 1, 2021  |  

Highly sensitive and cost-effective detection of somatic cancer variants using single-molecule, real-time sequencing

Next-Generation Sequencing (NGS) technologies allow for molecular profiling of cancer samples with high sensitivity and speed at reduced cost. For efficient profiling of cancer samples, it is important that the NGS methods used are not only robust, but capable of accurately detecting low-frequency somatic mutations. Single Molecule, Real-Time (SMRT) Sequencing offers several advantages, including the ability to sequence single molecules with very high accuracy (>QV40) using the circular consensus sequencing (CCS) approach. The availability of genetically defined, human genomic reference standards provides an industry standard for the development and quality control of molecular assays for studying cancer variants. Here we characterize SMRT Sequencing for the detection of low-frequency somatic variants using the Quantitative Multiplex DNA Reference Standards from Horizon Discovery, combined with amplification of the variants using the Multiplicom Tumor Hotspot MASTR Plus assay. First, we sequenced a reference standard containing precise allelic frequencies from 1% to 24.5% for major oncology targets verified using digital PCR. This reference material recapitulates the complexity of tumor composition and serves as a well-characterized control. The control sample was amplified using the Multiplicom Tumor Hotspot MASTR Plus assay that targets 252 amplicons (121-254 bp) from 26 relevant cancer genes, which includes all 11 variants in the control sample. Next, we sequenced control samples prepared by SeraCare Life Sciences, which contained a defined mutation at allelic frequencies from 10% down to 0.1%. The wild type and mutant amplicons were serially diluted, sequenced and analyzed using SMRT Sequencing to identify the variants and determine the observed frequency. The random error profile and high-accuracy CCS reads make it possible to accurately detect low-frequency somatic variants.


June 1, 2021  |  

Candidate gene screening using long-read sequencing

We have developed several candidate gene screening applications for both Neuromuscular and Neurological disorders. The power behind these applications comes from the use of long-read sequencing. It allows us to access previously unresolvable and even unsequencable genomic regions. SMRT Sequencing offers uniform coverage, a lack of sequence context bias, and very high accuracy. In addition, it is also possible to directly detect epigenetic signatures and characterize full-length gene transcripts through assembly-free isoform sequencing. In addition to calling the bases, SMRT Sequencing uses the kinetic information from each nucleotide to distinguish between modified and native bases.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.