Alleles of the FMR1 gene with more than 200 CGG repeats generally undergo methylation-coupled gene silencing, resulting in fragile X syndrome, the leading heritable form of cognitive impairment. Smaller expansions (55-200 CGG repeats) result in elevated levels of FMR1 mRNA, which is directly responsible for the late-onset neurodegenerative disorder, fragile X-associated tremor/ataxia syndrome (FXTAS). For mechanistic studies and genetic counseling, it is important to know with precision the number of CGG repeats; however, no existing DNA sequencing method is capable of sequencing through more than ~100 CGG repeats, thus limiting the ability to precisely characterize the disease-causing alleles. The recent development of single molecule, real-time sequencing represents a novel approach to DNA sequencing that couples the intrinsic processivity of DNA polymerase with the ability to read polymerase activity on a single-molecule basis. Further, the accuracy of the method is improved through the use of circular templates, such that each molecule can be read multiple times to produce a circular consensus sequence (CCS). We have succeeded in generating CCS reads representing multiple passes through both strands of repeat tracts exceeding 700 CGGs (>2 kb of 100 percent CG) flanked by native FMR1 sequence, with single-molecule readlengths exceeding 12 kb. This sequencing approach thus enables us to fully characterize the previously intractable CGG-repeat sequence, leading to a better understanding of the distinct associated molecular pathologies. Real-time kinetic data also provides insight into the activity of DNA polymerase inside this unique sequence. The methodology should be widely applicable for studies of the molecular pathogenesis of an increasing number of repeat expansion-associated neurodegenerative and neurodevelopmental disorders, and for the efficient identification of such disorders in the clinical setting.
Evaluating the potential of new sequencing technologies for genotyping and variation discovery in human data.
A first look at Pacific Biosciences RS data Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome these limitations by providing significantly longer reads (now averaging >1kb), enabling more unique seeds for reference alignment. In addition, the lack of amplification in the library construction step avoids a common source of base composition bias. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical resequencing projects by assessing the quality of the raw sequencing data, as well as its use for SNP discovery and genotyping using the Genome Analysis Toolkit (GATK).
Sequencing and de novo assembly of the 17q21.31 disease associated region using long reads generated by Pacific Biosciences SMRT Sequencing technology.
Assessment of genome-wide variation revealed regions of the genome with complex, structurally diverse haplotypes that are insufficiently represented in the human reference genome. The 17q21.31 region is one of the most dynamic and complex regions of the human genome. Different haplotypes exist, in direct and inverted orientation, showing evidence of positive selection and predisposing to microdeletion associated with mental retardation. Sequencing of different haplotypes is extremely important to characterize the spectrum of structural variation at this locus. However, de novo assembly with second-generation sequencing reads is still problematic. Using PacBio technology we have sequenced and de novo assembled a tiling path of eight BAC clones (~1.6 Mb region) across this medically relevant region from the library of a hydatidiform mole. Complete hydatidiform moles arise from the fertilization of an enucleated egg from a single sperm and therefore carry a haploid complement of the human genome, eliminating allelic variation that may confound mapping and assembly. The PacBio RS system enables single molecule real time sequencing, featuring long reads and fast turnaround times. With deep sequencing, PacBio reads were able to generate a very uniform sequencing coverage with close to 100% coverage of most of the target interval regions covered. Due to long read lengths, the PacBio RS data could be accurately assembled.
In today’s clinical diagnostic laboratories, the detection of the disease causing mutations is either done through genotyping or Sanger sequencing. Whether done singly or in a multiplex assay, genotyping works only if the exact molecular change is known. Sanger sequencing is the gold standard method that captures both known and novel molecular changes in the disease gene of interest. Most clinical Sanger sequencing assays involve PCR-amplifying the coding sequences of the disease target gene followed by bi-directional sequencing of the amplified products. Therefore for every patient sample, one generates multiple amplicons singly and each amplicon leads to two separate sequencing reactions. Single Molecule, Real-Time (SMRT) sequencing offers several advantages to Sanger sequencing including long read lengths, first-in-first-out processing, fast time to result, high-levels of multiplexing and substantially reduced costs. For our first proof-of-concept experiment, we queried 3 known disease-associated mutations in de-identified clinical samples. We started off with 3 autosomal recessive diseases found at an increased frequency in the Ashkenazi Jewish population: Tay Sachs disease, Niemann-Pick disease and Canavan disease. The mutated gene in Tays Sachs is HEXA, Niemann-Pick is SMPD1 and Canavan is ASPA. Coding exons were amplified in multiple (6-13) amplicons for each gene from both non-carrier and carriers. Amplicons were purified, concentrations normalized, and combined prior to SMRTbell™ Library prep. A single SMRTbell library was sequenced for each gene from each patient using standard Pacific Biosciences C2 chemistry and protocols. Average read lengths of 4,000 bp across samples allowed for high-quality Circular Consensus Sequences (CCS) across all amplicons (all less than 1 kb). This high quality CCS data permitted the clean partitioning of reads from a patient in the presence of heterozygous events. Using non-carrier sequencing as a control, we were able to correctly identify the known events in carrier genes. This suggests the potential utility of SMRT sequencing in a clinical setting, enabling a cost-effective method of replacing targeted mutation detection with sequencing of the entire gene.
Complete HIV-1 genomes from single molecules: Diversity estimates in two linked transmission pairs using clustering and mutual information.
We sequenced complete HIV-1 genomes from single molecules using Single Molecule, Real- Time (SMRT) Sequencing and derive de novo full-length genome sequences. SMRT sequencing yields long-read sequencing results from individual DNA molecules with a rapid time-to-result. These attributes make it a useful tool for continuous monitoring of viral populations. The single-molecule nature of the sequencing method allows us to estimate variant subspecies and relative abundances by counting methods. We detail mathematical techniques used in viral variant subspecies identification including clustering distance metrics and mutual information. Sequencing was performed in order to better understand the relationships between the specific sequences of transmitted viruses in linked transmission pairs. Samples representing HIV transmission pairs were selected from the Zambia Emory HIV Research Project (Lusaka, Zambia) and sequenced. We examine Single Genome Amplification (SGA) prepped samples and samples containing complex mixtures of genomes. Whole genome consensus estimates for each of the samples were made. Genome reads were clustered using a simple distance metric on aligned reads. Appropriate thresholds were chosen to yield distinct clusters of HIV genomes within samples. Mutual information between columns in the genome alignments was used to measure dependence. In silico mixtures of reads from the SGA samples were made to simulate samples containing exactly controlled complex mixtures of genomes and our clustering methods were applied to these complex mixtures. SMRT Sequencing data contained multiple full-length (greater than 9 kb) continuous reads for each sample. Simple whole genome consensus estimates easily identified transmission pairs. The clustering of the genome reads showed diversity differences between the samples, allowing us to characterize the diversity of the individual quasi-species comprising the patient viral populations across the full genome. Mutual information identified possible dependencies of different positions across the full HIV-1 genome. The SGA consensus genomes agreed with prior Sanger sequencing. Our clustering methods correctly segregated reads to their correct originating genome for the synthetic SGA mixtures. The results open up the potential for reference-agnostic and cost effective full genome sequencing of HIV-1.
AGBT 2013 Presentation Slides: Cold Spring Harbor Laboratory’s Michael Schatz presented strategies for de novo assembly of crop genomes with PacBio technolgy.
Advances in sequence consensus and clustering algorithms for effective de novo assembly and haplotyping applications.
One of the major applications of DNA sequencing technology is to bring together information that is distant in sequence space so that understanding genome structure and function becomes easier on a large scale. The Single Molecule Real Time (SMRT) Sequencing platform provides direct sequencing data that can span several thousand bases to tens of thousands of bases in a high-throughput fashion. In contrast to solving genomic puzzles by patching together smaller piece of information, long sequence reads can decrease potential computation complexity by reducing combinatorial factors significantly. We demonstrate algorithmic approaches to construct accurate consensus when the differences between reads are dominated by insertions and deletions. High-performance implementations of such algorithms allow more efficient de novo assembly with a pre-assembly step that generates highly accurate, consensus-based reads which can be used as input for existing genome assemblers. In contrast to recent hybrid assembly approach, only a single ~10 kb or longer SMRTbell library is necessary for the hierarchical genome assembly process (HGAP). Meanwhile, with a sensitive read-clustering algorithm with the consensus algorithms, one is able to discern haplotypes that differ by less than 1% different from each other over a large region. One of the related applications is to generate accurate haplotype sequences for HLA loci. Long sequence reads that can cover the whole 3 kb to 4 kb diploid genomic regions will simplify the haplotyping process. These algorithms can also be applied to resolve individual populations within mixed pools of DNA molecules that are similar to each, e.g., by sequencing viral quasi-species samples.
Understanding the genetic basis of infectious diseases is critical to enacting effective treatments, and several large-scale sequencing initiatives are underway to collect this information. Sequencing bacterial samples is typically performed by mapping sequence reads against genomes of known reference strains. While such resequencing informs on the spectrum of single-nucleotide differences relative to the chosen reference, it can miss numerous other forms of variation known to influence pathogenicity: structural variations (duplications, inversions), acquisition of mobile elements (phages, plasmids), homonucleotide length variation causing phase variation, and epigenetic marks (methylation, phosphorothioation) that influence gene expression to switch bacteria from non- pathogenic to pathogenic states. Therefore, sequencing methods which provide complete, de novo genome assemblies and epigenomes are necessary to fully characterize infectious disease agents in an unbiased, hypothesis-free manner. Hybrid assembly methods have been described that combine long sequence reads from SMRT DNA Sequencing with short reads (SMRT CCS (circular consensus) or second-generation reads), wherein the short reads are used to error-correct the long reads which are then used for assembly. We have developed a new paradigm for microbial de novo assemblies in which SMRT sequencing reads from a single long insert library are used exclusively to close the genome through a hierarchical genome assembly process, thereby obviating the need for a second sample preparation, sequencing run, and data set. We have applied this method to achieve closed de novo genomes with accuracies exceeding QV50 (>99.999%) for numerous disease outbreak samples, including E. coli, Salmonella, Campylobacter, Listeria, Neisseria, and H. pylori. The kinetic information from the same SMRT Sequencing reads is utilized to determine epigenomes. Approximately 70% of all methyltransferase specificities we have determined to date represent previously unknown bacterial epigenetic signatures. With relatively short sequencing run times and automated analysis pipelines, it is possible to go from an unknown DNA sample to its complete de novo genome and epigenome in about a day.
Understanding the genetic basis of infectious diseases is critical to enacting effective treatments, and several large-scale sequencing initiatives are underway to collect this information. Sequencing bacterial samples is typically performed by mapping sequence reads against genomes of known reference strains. While such resequencing informs on the spectrum of single nucleotide differences relative to the chosen reference, it can miss numerous other forms of variation known to influence pathogenicity: structural variations (duplications, inversions), acquisition of mobile elements (phages, plasmids), homonucleotide length variation causing phase variation, and epigenetic marks (methylation, phosphorothioation) that influence gene expression to switch bacteria from non-pathogenic to pathogenic states. Therefore, sequencing methods which provide complete, de novo genome assemblies and epigenomes are necessary to fully characterize infectious disease agents in an unbiased, hypothesis-free manner. Hybrid assembly methods have been described that combine long sequence reads from SMRT DNA sequencing with short, high-accuracy reads (SMRT (circular consensus sequencing) CCS or second-generation reads) to generate long, highly accurate reads that are then used for assembly. We have developed a new paradigm for microbial de novo assemblies in which long SMRT sequencing reads (average readlengths >5,000 bases) are used exclusively to close the genome through a hierarchical genome assembly process, thereby obviating the need for a second sample preparation, sequencing run and data set. We have applied this method to achieve closed de novo genomes with accuracies exceeding QV50 (>99.999%) to numerous disease outbreak samples, including E. coli, Salmonella, Campylobacter, Listeria, Neisseria, and H. pylori. The kinetic information from the same SMRT sequencing reads is utilized to determine epigenomes. Approximately 70% of all methyltransferase specificities we have determined to date represent previously unknown bacterial epigenetic signatures. The process has been automated and requires less than 1 day from an unknown DNA sample to its complete de novo genome and epigenome.
Isoform sequencing: Unveiling the complex landscape of the eukaryotic transcriptome on the PacBio RS II.
Alternative splicing of RNA is an important mechanism that increases protein diversity and is pervasive in the most complex biological functions. While advances in RNA sequencing methods have accelerated our understanding of the transcriptome, isoform discovery remains computationally challenging due to short read lengths. Here, we describe the Isoform Sequencing (Iso-Seq) method using long reads generated by the PacBio RS II. We sequenced rat heart and lung RNA using the Clontech® SMARTer® cDNA preparation kit followed by size selection using agarose gel. Additionally, we tested the BluePippin™ device from Sage Science for efficiently extracting longer transcripts = 3 kb. Post-sequencing, we developed a novel isoform-level clustering algorithm to generate high-quality transcript consensus sequences. We show that our method recovered alternative splice forms as well as alternative stop sites, antisense transcription, and retained introns. To conclude, the Iso-Seq method provides a new opportunity for researchers to study the complex eukaryotic transcriptome even in the absence of reference genomes or annotated transcripts.
Allele-level sequencing and phasing of full-length HLA class I and II genes using SMRT Sequencing technology
The three classes of genes that comprise the MHC gene family are actively involved in determining donor-recipient compatibility for organ transplant, as well as susceptibility to autoimmune diseases via cross-reacting immunization. Specifically, Class I genes HLA-A, -B, -C, and class II genes HLA-DR, -DQ and -DP are considered medically important for genetic analysis to determine histocompatibility. They are highly polymorphic and have thousands of alleles implicated in disease resistance and susceptibility. The importance of full-length HLA gene sequencing for genotyping, detection of null alleles, and phasing is now widely acknowledged. While DNA-sequencing-based HLA genotyping has become routine, only 7% of the HLA genes have been characterized by allele-level sequencing, while 93% are still defined by partial sequences. The gold-standard Sanger sequencing technology is being quickly replaced by second-generation, high- throughput sequencing methods due to its inability to generate unambiguous phased reads from heterozygous alleles. However, although these short, high-throughput, clonal sequencing methods are better at heterozygous allele detection, they are inadequate at generating full-length haploid gene sequences. Thus, full-length gene sequencing from an enhancer-promoter region to a 3’UTR that includes phasing information without the need for imputation still remains a technological challenge. The best way to overcome these challenges is to sequence these genes with a technology that is clonal in nature and has the longest possible read lengths. We have employed Single Molecule Real-Time (SMRT) sequencing technology from Pacific Biosciences for sequencing full-length HLA class I and II genes.
PacBio RS II sequencing chemistries provide read lengths beyond 20 kb with high consensus accuracy. The long read lengths of P4-C2 chemistry and demonstrated consensus accuracy of 99.999% are ideal for applications such as de novo assembly, targeted sequencing and isoform sequencing. The recently launched P5-C3 chemistry generates even longer reads with N50 often >10,000 bp, making it the best choice for scaffolding and spanning structural rearrangements. With these chemistry advances, PacBio’s read length performance is now primarily determined by the SMRTbell library itself. Size selection of a high-quality, sheared 20 kb library using the BluePippin™ System has been demonstrated to increase the N50 read length by as much as 5 kb with C3 chemistry. BluePippin size selection or a more stringent AMPure® PB selection cutoff can be used to recover long fragments from degraded genomic material. The selection of chemistries, P4-C2 versus P5-C3, is highly dependent on the final size distribution of the SMRTbell library and experimental goals. PacBio’s long read lengths also allow for the sequencing of full-length cDNA libraries at single-molecule resolution. However, longer transcripts are difficult to detect due to lower abundance, amplification bias, and preferential loading of smaller SMRTbell constructs. Without size selection, most sequenced transcripts are 1-1.5 kb. Size selection dramatically increases the number of transcripts >1.5 kb, and is essential for >3 kb transcripts.
The newer hierarchical genome assembly process (HGAP) performs de novo assembly using data from a single PacBio long insert library. To assess the benefits of this method, DNA from several Salmonella enterica serovars was isolated from a pure culture. Genome sequencing was performed using Pacific Biosciences RS sequencing technology. The HGAP process enabled us to close sixteen Salmonella subsp. enterica genomes and their associated mobile elements: The ten serotypes include: Salmonella enterica subsp. enterica serovar Enteritidis (S. Enteritidis) S. Bareilly, S. Heidelberg, S. Cubana, S. Javiana and S. Typhimurium, S. Newport, S. Montevideo, S. Agona, and S. Tennessee. In addition, we were able to detect novel methyltransferases (MTases) by using the Pacific Biosciences kinetic score distributions showing that each serovar appears to have a novel methylation pattern. For example while all Salmonella serovars examined so far have methylase specific activity for 5’-GATC-3’/3’-CTAG-5’ and 5’-CAGAG-3’/3’-GTCTC-5’ (underlined base indicates a modification), S. Heidelberg is uniquely specific for 5’-ACCANCC-3’/3’-TGGTNGG-5’, while S. Typhimurium has uniquely methylase specific for 5′-GATCAG-3’/3′- CTAGTC-5′ sites, for the samples examined so far. We believe that this may be due to the unique environments and phages that these serotypes have been exposed to. Furthermore, our analysis identified and closed a variety of plasmids such as mobilization plasmids, antimicrobial resistance plasmids and IncX plasmids carrying a Type IV secretion system (T4SS). The VirB/D4 T4SS apparatus is important in that it assists with rapid dissemination of antibiotic resistance and virulence determinants. Presently, only limited information exists regarding the genotypic characterization of drug resistance in S. Heidelberg isolates derived from various host species. Here, we characterize two S. Heidelberg outbreak isolates from two different outbreaks. Both isolates contain the IncX plasmid of approximately 35 kb, and carried the genes virB1, virB2, virB3/4, virB5, virB6, virB7, virB8, virB9, virB10, virB11, virD2, and virD4, that are associated with the T4SS. In addition, the outbreak isolate associated with ground turkey carries a 4,473 bp mobilization plasmid and an incompatibility group (Inc) I1 antimicrobial resistance plasmid encoding resistance to gentamicin (aacC2), beta-lactam (bl2b_tem), streptomycin (aadAI) and tetracycline (tetA, tetR) while the outbreak isolate associated with chicken breast carries the IncI1 plasmid encoding resistance to gentamicin (aacC2), streptomycin (aadAI) and sulfisoxazole (sul1). Using this new technology we explored the genetic elements present in resistant pathogens which will achieve a better understanding of the evolution of Salmonella.
Background: Genotypic testing of chronic viral infections is an important part of patient therapy and requires assays capable of detecting the entire spectrum of viral mutations. Single Molecule, Real-Time (SMRT) sequencing offers several advantages to other sequencing technologies, including superior resolution of mixed populations and long read lengths capable of spanning entire viral protein coding regions. We examined detection sensitivity of SMRT sequencing using a mixture of HIV-1 RT gene coding regions containing single NNRTI mutations. Methodology: SMRTbell templates were prepared from PCR products generated from a prospective reference material being developed by BC Center of Excellence for HIV/AIDS, and contained a mixture of fifteen infectious viruses containing single NNRTI resistance mutations (viz V90I, K101E, K103N, V108I, E138A/G/K/Q, V179D, Y181C, Y188C, G190A/S, M230L and P236L) built upon the HIV-1LAI molecular clone. Templates were sequenced on the PacBio RS II to obtain single molecule long reads using P4/C2 chemistry, using 180 minute movie collection without stage start. The relative abundances of the mutant viruses were then estimated using codon-aware analysis methods. Results: Sequencing of these templates produced average read lengths of 5.0 KB, comprising 40,000-fold coverage across the entire amplicon per SMRT Cell. All the expected mutations in the mixture of mutant viruses were accurately identified. Frequencies of NNRTI variants estimated ranged from 0.5% to 12.5%. Conclusions: Codon analysis revealed a number of variants across the amplicon with highly consistent results across SMRT Cells. From a single SMRT Cell, variants were accurately and reliably detected down to 0.5% with simple analyses. Long polymerase reads and high accuracy reads make it possible to call variants from just a few molecules. SMRT Sequencing can identify species comprising a mixed viral population, with granularity and low cost of consumables allowing for smaller multiplexing of samples and first-in-first-out processing.
A novel analytical pipeline for de novo haplotype phasing and amplicon analysis using SMRT Sequencing technology.
While the identification of individual SNPs has been readily available for some time, the ability to accurately phase SNPs and structural variation across a haplotype has been a challenge. With individual reads of an average length of 9 kb (P5-C3), and individual reads beyond 30 kb in length, SMRT Sequencing technology allows the identification of mutation combinations such as microdeletions, insertions, and substitutions without any predetermined reference sequence. Long- amplicon analysis is a novel protocol that identifies and reports the abundance of differing clusters of sequencing reads within a single library. Graphs generated via hierarchical clustering of individual sequencing reads are used to generate Markov models representing the consensus sequence of individual clusters found to be significantly different. Long-amplicon analysis is capable of differentiating between underlying sequences that are 99.9% similar, which is suitable for haplotyping and differentiating pseudogenes from coding transcripts. This protocol allows for the identification of structural variation in the MUC5AC gene sequence, despite the presence of a gap in the current genome assembly, and can also be used for HLA haplotyping. Clustering can also been applied to identify full length transcripts for the purpose of estimating consensus sequences and enumerating isoform types. Long-amplicon analysis allows for the elucidation of complex regions otherwise missed by other sequencing technologies, which may contribute to the diagnosis and understanding of otherwise complex diseases.