To bring personalized medicine to all patients, cancer researchers need more reliable and comprehensive views of somatic variants of all sizes that drive cancer biology.
2015 SMRT Informatics Developers Conference Presentation Slides: Shinichi Morishita of the University of Tokyo presented on how his team has been using SMRT Sequencing to better understand methylomes, metagenomes and structural variation of various eukaryotic genomes.
In addition to the genome and transcriptome, epigenetic information is essential to understand biological processes and their regulation, and their misregulation underlying disease. Traditionally, epigenetic DNA modifications are detected using upfront sample preparation steps such as bisulfite conversion, followed by sequencing. Bisulfite sequencing has provided a wealth of knowledge about human epigenetics, however it does not access the entire genome due to limitations in read length and GC- bias of the sequencing technologies used. In contrast, Single Molecule, Real-Time (SMRT) DNA Sequencing is unique in that it can detect DNA base modifications as part of the sequencing process. It can thereby leverage the long read lengths and lack of GC bias for more comprehensive views of the human epigenome. I will highlight several examples of this capability towards the generation of new biological insights, including the resolution of methylation states in repetitive and GC-rich regions of the genome, and large-scale changes in the methylation status across a cancer genome as a function of drug sensitivity.
Purpose: Clinical laboratories, research laboratories and technology developers all need DNA samples with reliably known genotypes in order to help validate and improve their methods. The Genome in a Bottle Consortium (genomeinabottle.org) has been developing Reference Materials with high-accuracy whole genome sequences to support these efforts.Methodology: Our pilot reference material is based on Coriell sample NA12878 and was released in May 2015 as NIST RM 8398 (tinyurl.com/giabpilot). To minimize bias and improve accuracy, 11 whole-genome and 3 exome data sets produced using 5 different technologies were integrated using a systematic arbitration method . The Genome in a Bottle Analysis Group is adapting these methods and developing new methods to characterize 2 families, one Asian and one Ashkenazi Jewish from the Personal Genome Project, which are consented for public release of sequencing and phenotype data. We have generated a larger and even more diverse data set on these samples, including high-depth Illumina paired-end and mate-pair, Complete Genomics, and Ion Torrent short-read data, as well as Moleculo, 10X, Oxford Nanopore, PacBio, and BioNano Genomics long-read data. We are analyzing these data to provide an accurate assessment of not just small variants but also large structural variants (SVs) in both “easy” regions of the genome and in some “hard” repetitive regions. We have also made all of the input data sources publicly available for download, analysis, and publication.Results: Our arbitration method produced a reference data set of 2,787,291 single nucleotide variants (SNVs), 365,135 indels, 2744 SVs, and 2.2 billion homozygous reference calls for our pilot genome. We found that our call set is highly sensitive and specific in comparison to independent reference data sets. We have also generated preliminary assemblies and structural variant calls for the next 2 trios from long read data and are currently integrating and validating these.Discussion: We combined the strengths of each of our input datasets to develop a comprehensive and accurate benchmark call set. In the short time it has been available, over 20 published or submitted papers have used our data. Many challenges exist in comparing to our benchmark calls, and thus we have worked with the Global Alliance for Genomics and Health to develop standardized methods, performance metrics, and software to assist in its use. Zook et al, Nat Biotech. 2014.
Quantitative and multiplexed DNA methylation analysis using long-read single-molecule real-time bisulfite sequencing (SMRT-BS).
DNA methylation has essential roles in transcriptional regulation, imprinting, X chromosome inactivation and other cellular processes, and aberrant CpG methylation is directly involved in the pathogenesis of human imprinting disorders and many cancers. To address the need for a quantitative and highly multiplexed bisulfite sequencing method with long read lengths for targeted CpG methylation analysis, we developed single-molecule real-time bisulfite sequencing (SMRT-BS).Optimized bisulfite conversion and PCR conditions enabled the amplification of DNA fragments up to ~1.5 kb, and subjecting overlapping 625-1491 bp amplicons to SMRT-BS indicated high reproducibility across all amplicon lengths (r?=?0.972) and low standard deviations (=0.10) between individual CpG sites sequenced in triplicate. Higher variability in CpG methylation quantitation was correlated with reduced sequencing depth, particularly for intermediately methylated regions. SMRT-BS was validated by orthogonal bisulfite-based microarray (r?=?0.906; 42 CpG sites) and second generation sequencing (r?=?0.933; 174 CpG sites); however, longer SMRT-BS amplicons (>1.0 kb) had reduced, but very acceptable, correlation with both orthogonal methods (r?=?0.836-0.897 and r?=?0.892-0.927, respectively) compared to amplicons less than ~1.0 kb (r?=?0.940-0.951 and r?=?0.948-0.963, respectively). Multiplexing utility was assessed by simultaneously subjecting four distinct CpG island amplicons (702-866 bp; 325 CpGs) and 30 hematological malignancy cell lines to SMRT-BS (average depth of 110X), which identified a spectrum of highly quantitative methylation levels across all interrogated CpG sites and cell lines.SMRT-BS is a novel, accurate and cost-effective targeted CpG methylation method that is amenable to a high degree of multiplexing with minimal clonal PCR artifacts. Increased sequencing depth is necessary when interrogating longer amplicons (>1.0 kb) and the previously reported bisulfite sequencing PCR bias towards unmethylated DNA should be considered when measuring intermediately methylated regions. Coupled with an optimized bisulfite PCR protocol, SMRT-BS is capable of interrogating ~1.5 kb amplicons, which theoretically can cover ~91% of CpG islands in the human genome.
CGGBP1 is a repetitive DNA-binding transcription regulator with target sites at CpG-rich sequences such as CGG repeats and Alu-SINEs and L1-LINEs. The role of CGGBP1 as a possible mediator of CpG methylation however remains unknown. At CpG-rich sequences cytosine methylation is a major mechanism of transcriptional repression. Concordantly, gene-rich regions typically carry lower levels of CpG methylation than the repetitive elements. It is well known that at interspersed repeats Alu-SINEs and L1-LINEs high levels of CpG methylation constitute a transcriptional silencing and retrotransposon inactivating mechanism.Here, we have studied genome-wide CpG methylation with or without CGGBP1-depletion. By high throughput sequencing of bisulfite-treated genomic DNA we have identified CGGBP1 to be a negative regulator of CpG methylation at repetitive DNA sequences. In addition, we have studied CpG methylation alterations on Alu and L1 retrotransposons in CGGBP1-depleted cells using a novel bisulfite-treatment and high throughput sequencing approach.The results clearly show that CGGBP1 is a possible bidirectional regulator of CpG methylation at Alus, and acts as a repressor of methylation at L1 retrotransposons.
Determining the methylation state of regions with high copy numbers is challenging for second-generation sequencing, because the read length is insufficient to map reads uniquely, especially when repetitive regions are long and nearly identical to each other. Single-molecule real-time (SMRT) sequencing is a promising method for observing such regions, because it is not vulnerable to GC bias, it produces long read lengths, and its kinetic information is sensitive to DNA modifications.We propose a novel linear-time algorithm that combines the kinetic information for neighboring CpG sites and increases the confidence in identifying the methylation states of those sites. Using a practical read coverage of ~30-fold from an inbred strain medaka (Oryzias latipes), we observed that both the sensitivity and precision of our method on individual CpG sites were ~93.7%. We also observed a high correlation coefficient (R?=?0.884) between our method and bisulfite sequencing, and for 92.0% of CpG sites, methylation levels ranging over [0, 1] were in concordance within an acceptable difference 0.25. Using this method, we characterized the landscape of the methylation status of repetitive elements, such as LINEs, in the human genome, thereby revealing the strong correlation between CpG density and hypomethylation and detecting hypomethylation hot spots of LTRs and LINEs. We uncovered the methylation states for nearly identical active transposons, two novel LINE insertions of identity ~99% and length 6050 base pairs (bp) in the human genome, and 16 Tol2 elements of identity >99.8% and length 4682?bp in the medaka genome.AgIn (Aggregate on Intervals) is available at: https://github.com/hacone/AgIn CONTACT: email@example.com, firstname.lastname@example.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. © The Author(s) 2016. Published by Oxford University Press.
For the past two decades, bisulfite sequencing has been a widely used method for quantitative CpG methylation detection of genomic DNA. Coupled with PCR amplicon cloning, bisulfite Sanger sequencing allows for allele-specific CpG methylation assessment; however, its time-consuming protocol and inability to multiplex has recently been overcome by next-generation bisulfite sequencing techniques. Although high-throughput sequencing platforms have enabled greater accuracy in CpG methylation quantitation as a result of increased bisulfite sequencing depth, most common sequencing platforms generate reads that are similar in length to the typical bisulfite PCR size range (~300-500 bp). Using the Pacific Biosciences (PacBio) sequencing platform, we developed single molecule real-time bisulfite sequencing (SMRT-BS), which is an accurate targeted CpG methylation analysis method capable of a high degree of multiplexing and long read lengths. SMRT-BS is reproducible and was found to be concordant with other lower throughput quantitative CpG methylation methods. Moreover, the ability to sequence up to ~1.5-2.0 kb amplicons, when coupled with an optimized bisulfite-conversion protocol, allows for more thorough assessment of CpG islands and increases the capacity for studying the relationship between single nucleotide variants and allele-specific CpG methylation.