Alleles of the FMR1 gene with more than 200 CGG repeats generally undergo methylation-coupled gene silencing, resulting in fragile X syndrome, the leading heritable form of cognitive impairment. Smaller expansions (55-200 CGG repeats) result in elevated levels of FMR1 mRNA, which is directly responsible for the late-onset neurodegenerative disorder, fragile X-associated tremor/ataxia syndrome (FXTAS). For mechanistic studies and genetic counseling, it is important to know with precision the number of CGG repeats; however, no existing DNA sequencing method is capable of sequencing through more than ~100 CGG repeats, thus limiting the ability to precisely characterize the disease-causing alleles. The recent development of single molecule, real-time sequencing represents a novel approach to DNA sequencing that couples the intrinsic processivity of DNA polymerase with the ability to read polymerase activity on a single-molecule basis. Further, the accuracy of the method is improved through the use of circular templates, such that each molecule can be read multiple times to produce a circular consensus sequence (CCS). We have succeeded in generating CCS reads representing multiple passes through both strands of repeat tracts exceeding 700 CGGs (>2 kb of 100 percent CG) flanked by native FMR1 sequence, with single-molecule readlengths exceeding 12 kb. This sequencing approach thus enables us to fully characterize the previously intractable CGG-repeat sequence, leading to a better understanding of the distinct associated molecular pathologies. Real-time kinetic data also provides insight into the activity of DNA polymerase inside this unique sequence. The methodology should be widely applicable for studies of the molecular pathogenesis of an increasing number of repeat expansion-associated neurodegenerative and neurodevelopmental disorders, and for the efficient identification of such disorders in the clinical setting.
In today’s clinical diagnostic laboratories, the detection of the disease causing mutations is either done through genotyping or Sanger sequencing. Whether done singly or in a multiplex assay, genotyping works only if the exact molecular change is known. Sanger sequencing is the gold standard method that captures both known and novel molecular changes in the disease gene of interest. Most clinical Sanger sequencing assays involve PCR-amplifying the coding sequences of the disease target gene followed by bi-directional sequencing of the amplified products. Therefore for every patient sample, one generates multiple amplicons singly and each amplicon leads to two separate sequencing reactions. Single Molecule, Real-Time (SMRT) sequencing offers several advantages to Sanger sequencing including long read lengths, first-in-first-out processing, fast time to result, high-levels of multiplexing and substantially reduced costs. For our first proof-of-concept experiment, we queried 3 known disease-associated mutations in de-identified clinical samples. We started off with 3 autosomal recessive diseases found at an increased frequency in the Ashkenazi Jewish population: Tay Sachs disease, Niemann-Pick disease and Canavan disease. The mutated gene in Tays Sachs is HEXA, Niemann-Pick is SMPD1 and Canavan is ASPA. Coding exons were amplified in multiple (6-13) amplicons for each gene from both non-carrier and carriers. Amplicons were purified, concentrations normalized, and combined prior to SMRTbell™ Library prep. A single SMRTbell library was sequenced for each gene from each patient using standard Pacific Biosciences C2 chemistry and protocols. Average read lengths of 4,000 bp across samples allowed for high-quality Circular Consensus Sequences (CCS) across all amplicons (all less than 1 kb). This high quality CCS data permitted the clean partitioning of reads from a patient in the presence of heterozygous events. Using non-carrier sequencing as a control, we were able to correctly identify the known events in carrier genes. This suggests the potential utility of SMRT sequencing in a clinical setting, enabling a cost-effective method of replacing targeted mutation detection with sequencing of the entire gene.
Background: To better understand the relationships among HIV-1 viruses in linked transmission pairs, we sequenced several samples representing HIV transmission pairs from the Zambia Emory HIV Research Project (Lusaka, Zambia) using Single Molecule, Real-Time (SMRT) Sequencing. Methods: Single molecules were sequenced as full-length (9.6 kb) amplicons directly from PCR products without shearing. This resulted in multiple, fully-phased, complete HIV-1 genomes for each patient. We examined Single Genome Amplification (SGA) prepped samples, as well as samples containing complex mixtures of genomes. We detail mathematical techniques used in viral variant subspecies identification, including clustering distance metrics and mutual information, which were used to derive multiple de novo full-length genome sequences for each patient. Whole genome consensus estimates for each sample were made. Genome reads were clustered using a simple distance metric on aligned reads. Appropriate thresholds were chosen to yield distinct clusters of HIV-1 genomes within samples. Mutual information between columns in the genome alignments was used to measure dependence. In silico mixtures of reads from the SGA samples were made to simulate samples containing exactly controlled complex mixtures of genomes and our clustering methods were applied to these complex mixtures. Results: SMRT Sequencing data contained multiple full-length (>9 kb) continuous reads for each sample. Simple whole-genome consensus estimates easily identified transmission pairs. Clustering of genome reads showed diversity differences between samples, allowing characterization of the quasi-species diversity comprising the patient viral populations across the full genome. Mutual information identified possible dependencies of different positions across the full HIV-1 genome. The SGA consensus genomes agreed with prior Sanger sequencing. Our clustering methods correctly segregated reads to their correct originating genome for the synthetic SGA mixtures. Conclusions: SMRT Sequencing yields long-read sequencing results from individual DNA molecules with a rapid time-to-result. These attributes make it a useful tool for continuous monitoring of viral populations. The single-molecule nature of the sequencing method allows us to estimate variant subspecies and relative abundances by counting methods. The results open up the potential for reference-agnostic and cost effective full genome sequencing of HIV-1.
Allele-level sequencing and phasing of full-length HLA class I and II genes using SMRT Sequencing technology
The three classes of genes that comprise the MHC gene family are actively involved in determining donor-recipient compatibility for organ transplant, as well as susceptibility to autoimmune diseases via cross-reacting immunization. Specifically, Class I genes HLA-A, -B, -C, and class II genes HLA-DR, -DQ and -DP are considered medically important for genetic analysis to determine histocompatibility. They are highly polymorphic and have thousands of alleles implicated in disease resistance and susceptibility. The importance of full-length HLA gene sequencing for genotyping, detection of null alleles, and phasing is now widely acknowledged. While DNA-sequencing-based HLA genotyping has become routine, only 7% of the HLA genes have been characterized by allele-level sequencing, while 93% are still defined by partial sequences. The gold-standard Sanger sequencing technology is being quickly replaced by second-generation, high- throughput sequencing methods due to its inability to generate unambiguous phased reads from heterozygous alleles. However, although these short, high-throughput, clonal sequencing methods are better at heterozygous allele detection, they are inadequate at generating full-length haploid gene sequences. Thus, full-length gene sequencing from an enhancer-promoter region to a 3’UTR that includes phasing information without the need for imputation still remains a technological challenge. The best way to overcome these challenges is to sequence these genes with a technology that is clonal in nature and has the longest possible read lengths. We have employed Single Molecule Real-Time (SMRT) sequencing technology from Pacific Biosciences for sequencing full-length HLA class I and II genes.
Background: Genotypic testing of chronic viral infections is an important part of patient therapy and requires assays capable of detecting the entire spectrum of viral mutations. Single Molecule, Real-Time (SMRT) sequencing offers several advantages to other sequencing technologies, including superior resolution of mixed populations and long read lengths capable of spanning entire viral protein coding regions. We examined detection sensitivity of SMRT sequencing using a mixture of HIV-1 RT gene coding regions containing single NNRTI mutations. Methodology: SMRTbell templates were prepared from PCR products generated from a prospective reference material being developed by BC Center of Excellence for HIV/AIDS, and contained a mixture of fifteen infectious viruses containing single NNRTI resistance mutations (viz V90I, K101E, K103N, V108I, E138A/G/K/Q, V179D, Y181C, Y188C, G190A/S, M230L and P236L) built upon the HIV-1LAI molecular clone. Templates were sequenced on the PacBio RS II to obtain single molecule long reads using P4/C2 chemistry, using 180 minute movie collection without stage start. The relative abundances of the mutant viruses were then estimated using codon-aware analysis methods. Results: Sequencing of these templates produced average read lengths of 5.0 KB, comprising 40,000-fold coverage across the entire amplicon per SMRT Cell. All the expected mutations in the mixture of mutant viruses were accurately identified. Frequencies of NNRTI variants estimated ranged from 0.5% to 12.5%. Conclusions: Codon analysis revealed a number of variants across the amplicon with highly consistent results across SMRT Cells. From a single SMRT Cell, variants were accurately and reliably detected down to 0.5% with simple analyses. Long polymerase reads and high accuracy reads make it possible to call variants from just a few molecules. SMRT Sequencing can identify species comprising a mixed viral population, with granularity and low cost of consumables allowing for smaller multiplexing of samples and first-in-first-out processing.
Background: The use of next generation sequencing (NGS) to examine circulating HIV env variants has been limited due to env’s length (2.6 kb), extensive indel polymorphism, GC deficiency, and long homopolymeric regions. We developed and standardized protocols for isolation, RT-PCR amplification, single molecule real-time (SMRT) sequencing, and haplotype analysis of circulating HIV-1 env variants to evaluate viral diversity in primary infection. Methodology: HIV RNA was extracted from 7 blood plasma samples (1 mL) collected from 5 subjects (one individual sampled and sequenced at 3 time points) in the San Diego Primary Infection Cohort between 3-33 months from their estimated date of infection (EDI). Median viral load per sample was 50,118 HIV RNA copies/mL (range: 22,387-446,683). Full-length (3.2 kb) env amplicons were constructed into SMRTbell templates without shearing, and sequenced on the PacBio RS II using P4/C2 chemistry and 180 minute movie collection without stage start. To examine viral diversity in each sample, we determined haplotypes by clustering circular consensus sequences (CCS), and reconstructing a cluster consensus sequence using a partial order alignment approach. We measured sample diversity both as the mean pairwise distance among reads, and the fraction of reads containing indel polymorphisms. Results: We collected a median of 8,775 CCS reads per SMRT Cell (range: 4243-12234). A median of 7 haplotypes per subject (range: 1-55) were inferred at baseline. For the one subject with longitudinal samples analyzed, we observed an increasing number of distinct haplotypes (8 to 55 haplotypes over the course of 30 months), and an increasing mean pairwise distance among reads (from 0.8% to 1.6%, Tamura-Nei 93). We also observed significant indel polymorphism, with 16% of reads from one sample later in infection (33 months post-EDI) exhibiting deletions of more than 10% of env with respect to the reference strain, HXB2. Conclusions: This study developed a standardized NGS procedure (PacBio SMRT) to deep sequence full-length HIV RNA env variants from the circulating viral population, achieving good coverage, confirming low env diversity during primary infection that increased over time, and revealing significant indel polymorphism that highlights structural variation as important to env evolution. The long, accurate reads greatly simplified downstream bioinformatics analyses, especially haplotype phasing, increasing our confidence in the results. The sequencing methodology and analysis tools developed here could be successfully applied to any area for which full-length HIV env analysis would be useful.
The long read lengths of PacBio’s SMRT Sequencing enable detection of linked mutations across multiple kilobases of sequence. This feature is particularly useful in the context of protein engineering, where large numbers of similar constructs are generated routinely to explore the effects of mutations on function and stability. We have developed a PCR-based barcoded sequencing method to generate high quality, full-length sequence data for batches of constructs generated in a common backbone. Individual barcodes are coupled to primers targeting a common region of the vector of interest. The amplified products are pooled into a single DNA library, and sequencing data are clustered by barcode to generate multi-molecule consensus sequences for each construct present in the pool. As a proof-of-concept dataset, we have generated a library of 384 randomly mutated variants of the Phi29 DNA polymerase, a 575 amino acid protein encoded by a 1.7 kb gene. These variants were amplified with a set of barcoded primers, and the resulting library was sequenced on a single SMRT Cell. The data produced sequences that were completely concordant with independent Sanger sequencing, for a 100% accurate reconstruction of the set of clones.
Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers in large genome complexities, such as long, highly repetitive, low-complexity regions and duplication events, and differentiating between transcript isoforms that are difficult to resolve with short-read technologies. We present solutions available for both reference genome improvement (>100 MB) and transcriptome research to best leverage long reads that have exceeded 20 Kb in length. Benefits for these applications are further realized with consistent use of size-selection of input sample using the BluePippin™ device from Sage Science. Highlights from our genome assembly projects using the latest P5-C3 chemistry on model organisms will be shared. Assembly contig N50 have exceeded 6 Mb and we observed longest contig exceeding 12.5 Mb with an average base quality of QV50. Additionally, the value of long, intact reads to provide a no-assembly approach to investigate transcript isoforms using our Iso-Seq Application will be presented.
A comparison of assemblers and strategies for complex, large-genome sequencing with PacBio long reads.
PacBio sequencing holds promise for addressing large-genome complexities, such as long, highly repetitive, low-complexity regions and duplication events that are difficult to resolve with short-read technologies. Several strategies, with varying outcomes, are available for de novo sequencing and assembling of larger genomes. Using a diploid fungal genome, estimated to be ~80 Mb in size, as the basis dataset for comparison, we highlight assembly options when using only PacBio sequencing or a combined strategy leveraging data sets from multiple sequencing technologies. Data generated from SMRT Sequencing was subjected to assembly using different large-genome assemblers, and comparisons of the results will be shown. These include results generated with HGAP, Celera Assembler, MIRA, PBJelly, and other assembly tools currently in development. Improvements observed include a near 50% reduction in the number of contigs coupled with at least a doubling of contig N50 size in genome assemblies incorporating SMRT Sequencing data. We further show how incorporating long reads also highlights new challenges and missed insights of short-read assemblies arising from heterozygosity inherent in multiploid genomes.
Background: Microbial ecology is reshaping our understanding of the natural world by revealing the large phylogenetic and functional diversity of microbial life. However the vast majority of these microorganisms remain poorly understood, as most cultivated representatives belong to just four phylogenetic groups and more than half of all identified phyla remain uncultivated. Characterization of this microbial ‘dark matter’ will thus greatly benefit from new metagenomic methods for in situ analysis. For example, sensitive high throughput methods for the characterization of community composition and structure from the sequencing of conserved marker genes. Methods: Here we utilize Single Molecule Real-Time (SMRT) sequencing of full-length 16S rRNA amplicons to phylogenetically profile microbial communities to below the genus-level. We test this method on a mock community of known composition, as well as a previously studied microbial community from a lake known to predominantly contain poorly characterized phyla. These results are compared to traditional 16S tag sequencing from short-read technologies and subsets of the full-length data corresponding to the same regions of the 16S gene. Results: We explore the benefits of using full-length amplicons for estimating community structure and diversity. In addition, we investigate the possible effects of context-specific and GC-content biases known to affect short-read sequencing technologies on the predicted community structure. We characterize the potential benefits of profiling metagenomic communities with full-length 16S rRNA genes from SMRT sequencing relative to standard methods.
Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers to understand molecular mechanisms in evolution and gain insight into adaptive strategies. With read lengths exceeding 10 kb, we are able to sequence high-quality, closed microbial genomes with associated plasmids, and investigate large genome complexities, such as long, highly repetitive, low-complexity regions and multiple tandem-duplication events. Improved genome quality, observed at 99.9999% (QV60) consensus accuracy, and significant reduction of gap regions in reference genomes (up to and beyond 50%) allow researchers to better understand coding sequences with high confidence, investigate potential regulatory mechanisms in noncoding regions, and make inferences about evolutionary strategies that are otherwise missed by the coverage biases associated with short- read sequencing technologies. Additional benefits afforded by SMRT Sequencing include the simultaneous capability to detect epigenomic modifications and obtain full-length cDNA transcripts that obsolete the need for assembly. With direct sequencing of DNA in real-time, this has resulted in the identification of numerous base modifications and motifs, which genome-wide profiles have linked to specific methyltransferase activities. Our new offering, the Iso-Seq Application, allows for the accurate differentiation between transcript isoforms that are difficult to resolve with short-read technologies. PacBio reads easily span transcripts such that both 5’/3’ primers for cDNA library generation and the poly-A tail are observed. As such, exon configuration and intron retention events can be analyzed without ambiguity. This technological advance is useful for characterizing transcript diversity and improving gene structure annotations in reference genomes. We review solutions available with SMRT Sequencing, from targeted sequencing efforts to obtaining reference genomes (>100 Mb). This includes strategies for identifying microsatellites and conducting phylogenetic comparisons with targeted gene families. We highlight how to best leverage our long reads that have exceeded 20 kb in length for research investigations, as well as currently available bioinformatics strategies for analysis. Benefits for these applications are further realized with consistent use of size selection of input sample using the BluePippin™ device from Sage Science as demonstrated in our genome improvement projects. Using the latest P5-C3 chemistry on model organisms, these efforts have yielded an observed contig N50 of ~6 Mb, with the longest contig exceeding 12.5 Mb and an average base quality of QV50.
HLA sequencing using SMRT Technology – High resolution and high throughput HLA genotyping in a clinical setting
Sequence based typing (SBT) is considered the gold standard method for HLA typing. Current SBT methods are rather laborious and are prone to phase ambiguity problems and genotyping uncertainties. As a result, the NGS community is rapidly seeking to remedy these challenges, to produce high resolution and high throughput HLA sequencing conducive to a clinical setting. Today, second generation NGS technologies are limited in their ability to yield full length HLA sequences required for adequate phasing and identification of novel alleles. Here we present the use of single molecule real time (SMRT) sequencing as a means of determining full length/long HLA sequences. Moreover we reveal the scalability of this method through multiplexing approches and determine HLA genotyping calls through the use of third party Gendx NGSengine® software.
Single Molecule, Real-Time (SMRT) Sequencing provides efficient, streamlined solutions to address new frontiers in plant genomes and transcriptomes. Inherent challenges presented by highly repetitive, low-complexity regions and duplication events are directly addressed with multi- kilobase read lengths exceeding 8.5 kb on average, with many exceeding 20 kb. Differentiating between transcript isoforms that are difficult to resolve with short-read technologies is also now possible. We present solutions available for both reference genome and transcriptome research that best leverage long reads in several plant projects including algae, Arabidopsis, rice, and spinach using only the PacBio platform. Benefits for these applications are further realized with consistent use of size-selection of input sample using the BluePippin™ device from Sage Science. We will share highlights from our genome projects using the latest P5- C3 chemistry to generate high-quality reference genomes with the highest contiguity, contig N50 exceeding 1 Mb, and average base quality of QV50. Additionally, the value of long, intact reads to provide a no-assembly approach to investigate transcript isoforms using our Iso-Seq protocol will be presented for full transcriptome characterization and targeted surveys of genes with complex structures. PacBio provides the most comprehensive assembly with annotation when combining offerings for both genome and transcriptome research efforts. For more focused investigation, PacBio also offers researchers opportunities to easily investigate and survey genes with complex structures.
Highly sensitive, non-invasive detection of colorectal cancer mutations using single molecule, third generation sequencing.
Colorectal cancer (CRC) represents one of the most prevalent and lethal malignant neoplasms and every individual of age 50 and above should undergo regular CRC screening. Currently, the most effective procedure to detect adenomas, the precursors to CRC, is colonoscopy, which reduces CRC incidence by 80%. However, it is an invasive approach that is unpleasant for the patient, expensive, and poses some risk of complications such as colon perforation. A non-invasive screening approach with detection rates comparable to those of colonoscopy has not yet been established. The current study applies Pacific Biosciences third generation, single molecule sequencing to the inspection of CRC-driving mutations. Our approach combines the screening power and the extremely high accuracy of circular consensus (CCS) third generation sequencing with the non-invasiveness of using stool DNA to detect CRC-associated mutations present at extremely low frequencies and establishes a foundation for a non-invasive, highly sensitive assay to screen the population for CRC and early stage adenomas. We performed a series of experiments using a pool of fifteen amplicons covering the genes most frequently mutated in CRC (APC, Beta Catenin, KRAS, BRAF, and TP53), ensuring a theoretical screening coverage of over 97% for both CRC and adenomas. The assay was able to detect mutations in DNA isolated from stool samples from patients diagnosed with CRC at frequencies below 0.5 % with no false positives. The mutations were then confirmed by sequencing DNA isolated from the excised tumor samples. Our assay should be sensitive enough to allow the early identification of adenomatous polyps using stool DNA as analyte. In conclusion, we have developed an assay to detect mutations in the genes associated with CRC and adenomas using Pacific Biosciences RS Single Molecule, Real Time Circular Consensus Sequencing (SMRT-CCS). With no systematic bias and a much higher raw base-calling quality (CCS) compared to other sequencing methods, the assay was able to detect mutations in stool DNA at frequencies below 0.5 % with no false positives. This level of sensitivity should be sufficient to allow the detection of most adenomatous polyps using stool DNA as analyte, a feature that would make our approach the first non-invasive assay with a sensitivity comparable to that of colonoscopy and a strong candidate for the non-invasive preventive CRC screening of the general population.
Fully phased allele-level sequencing of highly polymorphic HLA genes is greatly facilitated by SMRT Sequencing technology. In the present work, we have evaluated multiple DNA barcoding strategies for multiplexing several loci from multiple individuals, using three different tagging methods. Specifically MHC class I genes HLA-A, -B, and –C were indexed via DNA Barcodes by either tailed primers or barcoded SMRTbell adapters. Eight different 16-bp barcode sequences were used in symmetric & asymmetric pairing. Eight DNA barcoded adapters in symmetric pairing were independently ligated to a pool of HLA-A, -B and –C for eight different individuals, one at a time and pooled for sequencing on a single SMRT Cell. Amplicons generated from barcoded primers were pooled upfront for library generation. Eight symmetric barcoded primers were generated for HLA class I genes. These primers facilitated multiplexing of 8 samples and also allowed generation of unique asymmetric pairings for simultaneous amplification from 28 reference genomic DNA samples. The data generated from all 3 methods was analyzed using LAA protocol in SMRT analysis V2.3. Consensus sequences generated were typed using GenDx NGS engine HLA-typing software.