Alleles of the FMR1 gene with more than 200 CGG repeats generally undergo methylation-coupled gene silencing, resulting in fragile X syndrome, the leading heritable form of cognitive impairment. Smaller expansions (55-200 CGG repeats) result in elevated levels of FMR1 mRNA, which is directly responsible for the late-onset neurodegenerative disorder, fragile X-associated tremor/ataxia syndrome (FXTAS). For mechanistic studies and genetic counseling, it is important to know with precision the number of CGG repeats; however, no existing DNA sequencing method is capable of sequencing through more than ~100 CGG repeats, thus limiting the ability to precisely characterize the disease-causing alleles. The recent development of single molecule, real-time sequencing represents a novel approach to DNA sequencing that couples the intrinsic processivity of DNA polymerase with the ability to read polymerase activity on a single-molecule basis. Further, the accuracy of the method is improved through the use of circular templates, such that each molecule can be read multiple times to produce a circular consensus sequence (CCS). We have succeeded in generating CCS reads representing multiple passes through both strands of repeat tracts exceeding 700 CGGs (>2 kb of 100 percent CG) flanked by native FMR1 sequence, with single-molecule readlengths exceeding 12 kb. This sequencing approach thus enables us to fully characterize the previously intractable CGG-repeat sequence, leading to a better understanding of the distinct associated molecular pathologies. Real-time kinetic data also provides insight into the activity of DNA polymerase inside this unique sequence. The methodology should be widely applicable for studies of the molecular pathogenesis of an increasing number of repeat expansion-associated neurodegenerative and neurodevelopmental disorders, and for the efficient identification of such disorders in the clinical setting.
Evaluating the potential of new sequencing technologies for genotyping and variation discovery in human data.
A first look at Pacific Biosciences RS data Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome these limitations by providing significantly longer reads (now averaging >1kb), enabling more unique seeds for reference alignment. In addition, the lack of amplification in the library construction step avoids a common source of base composition bias. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical resequencing projects by assessing the quality of the raw sequencing data, as well as its use for SNP discovery and genotyping using the Genome Analysis Toolkit (GATK).
Single-Molecule Real-Time (SMRT) DNA sequencing is unique in that nucleotide incorporation events are monitored in real time, leading to a wealth of kinetic information in addition to the extraction of the primary DNA sequence. The dynamics of the DNA polymerase that is observed adds an additional dimension of sequence-dependent information, and can be used to learn more about the molecule under study. First, the primary sequence itself can be determined more accurately. The kinetic data can be used to corroborate or overturn consensus calls and even enable calling bases in problematic sequence contexts. Second, using the kinetic information, we can detect and discriminate numerous chemical base modifications as a by-product of ordinary sequencing. Examples of applying these capabilities include (i) the characterization of the epigenome of microorganisms by directly sequencing the three common prokaryotic epigenetic base modifications of 4-methylcytosine, 5- methylcytosine and 6-methyladenine; (ii) the characterization of known and novel methyltransferase activities; (iii) the direct sequencing and differentiation of the four eukaryotic epigenetic forms of cytosine (5-methyl, 5-hydroxymethyl, 5-formyl, and 5-carboxylcytosine) with first applications to map them with single base-pair and DNA strand resolution across mammalian genomes; (iv) the direct sequencing and identification of numerous modified DNA bases arising from DNA damage; and (v) an exploration of the mitochondrial genome for known and novel base modifications. We will show our progress towards a generic, open-source algorithm for exploiting kinetic information for any of these purposes.
DNA is under constant stress from both endogenous and exogenous sources. DNA base modifications resulting from various types of DNA damage are wide-spread and play important roles in affecting physiological states and disease phenotypes. Examples include oxidative damage (8- oxoguanine, 8-oxoadenine; aging, Alzheimer’s, Parkinson’s), alkylation (1-methyladenine, 6-O- methylguanine; cancer), adduct formation (benzo[a]pyrene diol epoxide (BPDE), pyrimidine dimers; smoking, industrial chemical exposure, chemical UV light exposure, cancer), and ionizing radiation damage (5-hydroxycytosine, 5- hydroxyuracil, 5-hydroxymethyluracil; cancer). Currently, these and other products of DNA damage cannot be sequenced with existing sequencing methods. In contrast, single molecule, real-time (SMRT) DNA sequencing can report on modified DNA bases through an analysis of the DNA polymerase kinetics that is affected by a modified base in the template. We demonstrate the DNA strand-resolved sequencing of over 8 different DNA-damage associated base modifications, with base pair resolution and single DNA molecule sensitivity. We also report on the application of this sequencing capability to biological samples and the development of a generic, open-source algorithm to analyze kinetic information from SMRT sequencing.
Advances in sequence consensus and clustering algorithms for effective de novo assembly and haplotyping applications.
One of the major applications of DNA sequencing technology is to bring together information that is distant in sequence space so that understanding genome structure and function becomes easier on a large scale. The Single Molecule Real Time (SMRT) Sequencing platform provides direct sequencing data that can span several thousand bases to tens of thousands of bases in a high-throughput fashion. In contrast to solving genomic puzzles by patching together smaller piece of information, long sequence reads can decrease potential computation complexity by reducing combinatorial factors significantly. We demonstrate algorithmic approaches to construct accurate consensus when the differences between reads are dominated by insertions and deletions. High-performance implementations of such algorithms allow more efficient de novo assembly with a pre-assembly step that generates highly accurate, consensus-based reads which can be used as input for existing genome assemblers. In contrast to recent hybrid assembly approach, only a single ~10 kb or longer SMRTbell library is necessary for the hierarchical genome assembly process (HGAP). Meanwhile, with a sensitive read-clustering algorithm with the consensus algorithms, one is able to discern haplotypes that differ by less than 1% different from each other over a large region. One of the related applications is to generate accurate haplotype sequences for HLA loci. Long sequence reads that can cover the whole 3 kb to 4 kb diploid genomic regions will simplify the haplotyping process. These algorithms can also be applied to resolve individual populations within mixed pools of DNA molecules that are similar to each, e.g., by sequencing viral quasi-species samples.
Isoform sequencing: Unveiling the complex landscape of the eukaryotic transcriptome on the PacBio RS II.
Alternative splicing of RNA is an important mechanism that increases protein diversity and is pervasive in the most complex biological functions. While advances in RNA sequencing methods have accelerated our understanding of the transcriptome, isoform discovery remains computationally challenging due to short read lengths. Here, we describe the Isoform Sequencing (Iso-Seq) method using long reads generated by the PacBio RS II. We sequenced rat heart and lung RNA using the Clontech® SMARTer® cDNA preparation kit followed by size selection using agarose gel. Additionally, we tested the BluePippin™ device from Sage Science for efficiently extracting longer transcripts = 3 kb. Post-sequencing, we developed a novel isoform-level clustering algorithm to generate high-quality transcript consensus sequences. We show that our method recovered alternative splice forms as well as alternative stop sites, antisense transcription, and retained introns. To conclude, the Iso-Seq method provides a new opportunity for researchers to study the complex eukaryotic transcriptome even in the absence of reference genomes or annotated transcripts.
Integrative biology of a fungus: Using PacBio SMRT Sequencing to interrogate the genome, epigenome, and transcriptome of Neurospora crassa.
PacBio SMRT Sequencing has the unique ability to directly detect base modifications in addition to the nucleotide sequence of DNA. Because eukaryotes use base modifications to regulate gene expression, the absence or presence of epigenetic events relative to the location of genes is critical to elucidate the function of the modification. Therefore an integrated approach that combines multiple omic-scale assays is necessary to study complex organisms. Here, we present an integrated analysis of three sequencing experiments: 1) DNA sequencing, 2) base-modification detection, and 3) Iso-seq analysis, in Neurospora crassa, a filamentous fungus that has been used to make many landmark discoveries in biochemistry and genetics. We show that de novo assembly of a new strain yields complete assemblies of entire chromosomes, and additionally contains entire centromeric sequences. Base-modification analyses reveal candidate sites of increased interpulse duration (IPD) ratio, that may signify regions of 5mC, 5hmC, or 6mA base modifications. Iso-seq method provides full-length transcript evidence for comprehensive gene annotation, as well as context to the base-modifications in the newly assembled genome. Projects that integrate multiple genome-wide assays could become common practice for identifying genomic elements and understanding their function in new strains and organisms.
A novel analytical pipeline for de novo haplotype phasing and amplicon analysis using SMRT Sequencing technology.
While the identification of individual SNPs has been readily available for some time, the ability to accurately phase SNPs and structural variation across a haplotype has been a challenge. With individual reads of an average length of 9 kb (P5-C3), and individual reads beyond 30 kb in length, SMRT Sequencing technology allows the identification of mutation combinations such as microdeletions, insertions, and substitutions without any predetermined reference sequence. Long- amplicon analysis is a novel protocol that identifies and reports the abundance of differing clusters of sequencing reads within a single library. Graphs generated via hierarchical clustering of individual sequencing reads are used to generate Markov models representing the consensus sequence of individual clusters found to be significantly different. Long-amplicon analysis is capable of differentiating between underlying sequences that are 99.9% similar, which is suitable for haplotyping and differentiating pseudogenes from coding transcripts. This protocol allows for the identification of structural variation in the MUC5AC gene sequence, despite the presence of a gap in the current genome assembly, and can also be used for HLA haplotyping. Clustering can also been applied to identify full length transcripts for the purpose of estimating consensus sequences and enumerating isoform types. Long-amplicon analysis allows for the elucidation of complex regions otherwise missed by other sequencing technologies, which may contribute to the diagnosis and understanding of otherwise complex diseases.
Isoform sequencing: Unveiling the complex landscape in eukaryotic transcriptome on the PacBio RS II.
Advances in RNA sequencing have accelerated our understanding of the transcriptome, however isoform discovery remains challenging due to short read lengths. The Iso-Seq Application provides a new alternative to sequence full-length cDNA libraries using long reads from the PacBio RS II. Identification of long and often rare isoforms is demonstrated with rat heart and lung RNA prepared using the Clontech® SMARTer® cDNA preparation kit, followed by agarose-gel size selection in fractions of 1-2 kb, 2-3 kb and 3-6 kb. For each tissue, 1.8 and 1.2 million reads were obtained from 32 and 26 SMRT Cells, respectively. Filtering for reads with both adapters and polyA tail signals yielded >50% putative full-length transcripts. To improve consensus accuracy, we developed an isoform-level clustering algorithm ICE (Iterative Clustering for Error Correction), and polished full-length consensus sequences from ICE using Quiver. This method generated full-length transcripts up to 4.5 kb with = 99% post-correction accuracy. Compared with known rat genes, the Iso-Seq method not only recovered the majority of currently annotated isoforms, but also several unannotated novel isoforms with identified homologs in the RefSeq database. Additionally, alternative stop sites, extended UTRs, and retained introns were detected.
Third generation single molecule sequencing technology from Pacific Biosciences, Moleculo, Oxford Nanopore, and other companies are revolutionizing genomics by enabling the sequencing of long, individual molecules of DNA and RNA. One major advantage of these technologies over current short read sequencing is the ability to sequence much longer molecules, thousands or tens of thousands of nucleotides instead of mere hundreds. This capacity gives researchers substantially greater power to probe into microbial, plant, and animal genomes, but it remains unknown on how to best use these data. To answer this, we systematically evaluated the human genome and 25 other important genomes across the tree of life ranging in size from 1Mbp to 3Gbp in an attempt to answer how long the reads need to be and how much coverage is necessary to completely assemble their chromosomes with single molecule sequencing. We also present a novel error correction and assembly algorithm using a combination of PacBio and pre-assembled Illumina sequencing. This new algorithm greatly outperforms other published hybrid algorithms.
An interactive workflow for the analysis of contigs from the metagenomic shotgun assembly of SMRT Sequencing data.
The data throughput of next-generation sequencing allows whole microbial communities to be analyzed using a shotgun sequencing approach. Because a key task in taking advantage of these data is the ability to cluster reads that belong to the same member in a community, single-molecule long reads of up to 30 kb from SMRT Sequencing provide a unique capability in identifying those relationships and pave the way towards finished assemblies of community members. Long reads become even more valuable as samples get more complex with lower intra-species variation, a larger number of closely related species, or high intra-species variation. Here we present a collection of tools tailored for PacBio data for the analysis of these fragmented metagenomic assembles, allowing improvements in the assembly results, and greater insight into the communities themselves. Supervised classification is applied to a large set of sequence characteristics, e.g., GC content, raw-read coverage, k-mer frequency, and gene prediction information, allowing the clustering of contigs from single or highly related species. A unique feature of SMRT Sequencing data is the availability of base modification / methylation information, which can be used to further analyze clustered contigs expected to be comprised of single or very closely related species. Here we show base modification information can be used to further study variation, based on differences in the methylated DNA motifs involved in the restriction modification system. Application of these techniques is demonstrated on a monkey intestinal microbiome sample and an in silico mix of real sequencing data from distinct bacterial samples.
Heterozygous and highly polymorphic diploid (2n) and higher polyploidy (n > 2) genomes have proven to be very difficult to assemble. One key to the successful assembly and phasing of polymorphic genomics is the very long read length (9-40 kb) provided by the PacBio RS II system. We recently released software and methods that facilitate the assembly and phasing of genomes with ploidy levels equal to or greater than 2n. In an effort to collaborate and spur on algorithm development for assembly and phasing of heterozygous polymorphic genomes, we have recently released sequencing datasets that can be used to test and develop highly polymorphic diploid and polyploidy assembly and phasing algorithms. These data sets include multiple species and ecotypes of Arabidopsis that can be combined to create synthetic in-silico F1 hybrids with varying levels of heterozygosity. Because the sequence of each individual line was generated independently, the data set provides a ‘ground truth’ answer for the expected results allowing the evaluation of assembly algorithms. The sequencing data, assembly of inbred and in-silico heterozygous samples (n=>2) and phasing statistics will be presented. The raw and processed data has been made available to aid other groups in the development of phasing and assembly algorithms.
Unique haplotype structure determination in human genome using Single Molecule, Real-Time (SMRT) Sequencing of targeted full-length fosmids.
Determination of unique individual haplotypes is an essential first step toward understanding how identical genotypes having different phases lead to different biological interpretations of function, phenotype, and disease. Genome-wide methods for identifying individual genetic variation have been limited in their ability to acquire phased, extended, and complete genomic sequences that are long enough to assemble haplotypes with high confidence. We explore a recombineering approach for isolation and sequencing of a tiling of targeted fosmids to capture interesting regions from human genome. Each individual fosmid contains large genomic fragments (~35?kb) that are sequenced with long-read SMRT technology to generate contiguous long reads. These long reads can be easily de novo assembled for targeted haplotype resolution within an individual’s genomes. The P5-C3 chemistry for SMRT Sequencing generated contiguous, full-length fosmid sequences of 30 to 40 kb in a single read, allowing assembly of resolved haplotypes with minimal data processing. The phase preserved in fosmid clones spanned at least two heterozygous variant loci, providing the essential detail of precise haplotype structures. We show complete assembly of haplotypes for various targeted loci, including the complex haplotypes of the KIR locus (~150 to 200 kb) and conserved extended haplotypes (CEHs) of the MHC region. This method is easily applicable to other regions of the human genome, as well as other genomes.
Arabica coffee, revered for its taste and aroma, has a complex genome. It is an allotetraploid (2n=4x=44) with a genome size of approximately 1.3 Gb, derived from the recent (< 0.6 Mya) hybridization of two diploid progenitors (2n=2x=22), C. canephora (710 Mb) and C. eugenioides (670 Mb). Both parental species diverged recently (< 4.2Mya) and their genomes are highly homologous. To facilitate assembly, a dihaploid plant was chosen for sequencing. Initial genome assembly attempts with short read data produced an assembly covering 1,031 Mb of the C. arabica genome with a contig L50 of 9kb. By implementation of long read PacBio at greater than 50x coverage and cutting-edge PacBio software, a de novo PacBio-only genome assembly was constructed that covers 1,042 Mb of the genome with an L50 of 267 kb. The two assemblies were assessed and compared to determine gene content, chimeric regions, and the ability to separate the parental genomes. A genetic map that contains 600 SSRs is being used for anchoring the contigs and improve the sub-genome differentiation together with the search of sub-genome specific SNPs. PacBio transcriptome sequencing is currently being added to finalize gene annotation of the polished assembly. The finished genome assembly will be used to guide re-sequencing assemblies of parental genomes (C. canephora and C. eugenioides) as well as a template for GBS analysis and whole genome re-sequencing of a set of C. arabica accessions representative of the species diversity. The obtained data will provide powerful genomic tools to enable more efficient coffee breeding strategies for this crop, which is highly susceptible to climate change and is the main source of income for millions of small farmers in producing countries.
Significant advances in bioinformatics tool development have been made to more efficiently leverage and deliver high-quality genome assemblies with PacBio long-read data. Current data throughput of SMRT Sequencing delivers average read lengths ranging from 10-15 kb with the longest reads exceeding 40 kb. This has resulted in consistent demonstration of a minimum 10-fold improvement in genome assemblies with contig N50 in the megabase range compared to assemblies generated using only short- read technologies. This poster highlights recent advances and resources available for advanced bioinformaticians and developers interested in the current state-of-the-art large genome solutions available as open-source code from PacBio and third-party solutions, including HGAP, MHAP, and ECTools. Resources and tools available on GitHub are reviewed, as well as datasets representing major model research organisms made publically available for community evaluation or interested developers.