Alleles of the FMR1 gene with more than 200 CGG repeats generally undergo methylation-coupled gene silencing, resulting in fragile X syndrome, the leading heritable form of cognitive impairment. Smaller expansions (55-200 CGG repeats) result in elevated levels of FMR1 mRNA, which is directly responsible for the late-onset neurodegenerative disorder, fragile X-associated tremor/ataxia syndrome (FXTAS). For mechanistic studies and genetic counseling, it is important to know with precision the number of CGG repeats; however, no existing DNA sequencing method is capable of sequencing through more than ~100 CGG repeats, thus limiting the ability to precisely characterize the disease-causing alleles. The recent development of single molecule, real-time sequencing represents a novel approach to DNA sequencing that couples the intrinsic processivity of DNA polymerase with the ability to read polymerase activity on a single-molecule basis. Further, the accuracy of the method is improved through the use of circular templates, such that each molecule can be read multiple times to produce a circular consensus sequence (CCS). We have succeeded in generating CCS reads representing multiple passes through both strands of repeat tracts exceeding 700 CGGs (>2 kb of 100 percent CG) flanked by native FMR1 sequence, with single-molecule readlengths exceeding 12 kb. This sequencing approach thus enables us to fully characterize the previously intractable CGG-repeat sequence, leading to a better understanding of the distinct associated molecular pathologies. Real-time kinetic data also provides insight into the activity of DNA polymerase inside this unique sequence. The methodology should be widely applicable for studies of the molecular pathogenesis of an increasing number of repeat expansion-associated neurodegenerative and neurodevelopmental disorders, and for the efficient identification of such disorders in the clinical setting.
Sequencing and de novo assembly of the 17q21.31 disease associated region using long reads generated by Pacific Biosciences SMRT Sequencing technology.
Assessment of genome-wide variation revealed regions of the genome with complex, structurally diverse haplotypes that are insufficiently represented in the human reference genome. The 17q21.31 region is one of the most dynamic and complex regions of the human genome. Different haplotypes exist, in direct and inverted orientation, showing evidence of positive selection and predisposing to microdeletion associated with mental retardation. Sequencing of different haplotypes is extremely important to characterize the spectrum of structural variation at this locus. However, de novo assembly with second-generation sequencing reads is still problematic. Using PacBio technology we have sequenced and de novo assembled a tiling path of eight BAC clones (~1.6 Mb region) across this medically relevant region from the library of a hydatidiform mole. Complete hydatidiform moles arise from the fertilization of an enucleated egg from a single sperm and therefore carry a haploid complement of the human genome, eliminating allelic variation that may confound mapping and assembly. The PacBio RS system enables single molecule real time sequencing, featuring long reads and fast turnaround times. With deep sequencing, PacBio reads were able to generate a very uniform sequencing coverage with close to 100% coverage of most of the target interval regions covered. Due to long read lengths, the PacBio RS data could be accurately assembled.
In today’s clinical diagnostic laboratories, the detection of the disease causing mutations is either done through genotyping or Sanger sequencing. Whether done singly or in a multiplex assay, genotyping works only if the exact molecular change is known. Sanger sequencing is the gold standard method that captures both known and novel molecular changes in the disease gene of interest. Most clinical Sanger sequencing assays involve PCR-amplifying the coding sequences of the disease target gene followed by bi-directional sequencing of the amplified products. Therefore for every patient sample, one generates multiple amplicons singly and each amplicon leads to two separate sequencing reactions. Single Molecule, Real-Time (SMRT) sequencing offers several advantages to Sanger sequencing including long read lengths, first-in-first-out processing, fast time to result, high-levels of multiplexing and substantially reduced costs. For our first proof-of-concept experiment, we queried 3 known disease-associated mutations in de-identified clinical samples. We started off with 3 autosomal recessive diseases found at an increased frequency in the Ashkenazi Jewish population: Tay Sachs disease, Niemann-Pick disease and Canavan disease. The mutated gene in Tays Sachs is HEXA, Niemann-Pick is SMPD1 and Canavan is ASPA. Coding exons were amplified in multiple (6-13) amplicons for each gene from both non-carrier and carriers. Amplicons were purified, concentrations normalized, and combined prior to SMRTbell™ Library prep. A single SMRTbell library was sequenced for each gene from each patient using standard Pacific Biosciences C2 chemistry and protocols. Average read lengths of 4,000 bp across samples allowed for high-quality Circular Consensus Sequences (CCS) across all amplicons (all less than 1 kb). This high quality CCS data permitted the clean partitioning of reads from a patient in the presence of heterozygous events. Using non-carrier sequencing as a control, we were able to correctly identify the known events in carrier genes. This suggests the potential utility of SMRT sequencing in a clinical setting, enabling a cost-effective method of replacing targeted mutation detection with sequencing of the entire gene.
The long read lengths of PacBio’s SMRT Sequencing enable detection of linked mutations across multiple kilobases of sequence. This feature is particularly useful in the context of protein engineering, where large numbers of similar constructs are generated routinely to explore the effects of mutations on function and stability. We have developed a PCR-based barcoded sequencing method to generate high quality, full-length sequence data for batches of constructs generated in a common backbone. Individual barcodes are coupled to primers targeting a common region of the vector of interest. The amplified products are pooled into a single DNA library, and sequencing data are clustered by barcode to generate multi-molecule consensus sequences for each construct present in the pool. As a proof-of-concept dataset, we have generated a library of 384 randomly mutated variants of the Phi29 DNA polymerase, a 575 amino acid protein encoded by a 1.7 kb gene. These variants were amplified with a set of barcoded primers, and the resulting library was sequenced on a single SMRT Cell. The data produced sequences that were completely concordant with independent Sanger sequencing, for a 100% accurate reconstruction of the set of clones.
De novo assembly of a complex panicoid grass genome using ultra-long PacBio reads with P6C4 chemistry
Drought is responsible for much of the global losses in crop yields and understanding how plants naturally cope with drought stress is essential for breeding and engineering crops for the changing climate. Resurrection plants desiccate to complete dryness during times of drought, then “come back to life” once water is available making them an excellent model for studying drought tolerance. Understanding the molecular networks governing how resurrection plants handle desiccation will provide targets for crop engineering. Oropetium thomaeum (Oro) is a resurrection plant that also has the smallest known grass genome at 250 Mb compared to Brachypodium distachyon (300 Mb) and rice (350 Mb). Plant genomes, especially grasses, have complex repeat structures such as telomeres, centromeres, and ribosomal gene cassettes, and high heterozygosity, which makes them difficult to assembly using short read next generation sequencing technologies. Ultra-long PacBio reads using the new P6C4 chemistry and the latest 15kb Blue Pippin size-selection protocol to generate 20kb insert libraries that yielded an average read length of 12kb providing ~72X coverage, and 10X coverage with reads over 20kb. The HGAP assembly covers 98% of the genome with a contig N50 of 2.4 Mb, which makes it one of the highest quality and most complete plant genomes assembled to date. Oro has a compact genome structure compared to other grasses with only 16% repeat sequences but has very good collinearity with other grasses. Understanding the genomic mechanisms of extreme desiccation tolerance in resurrection plants like Oro will provide insights for engineering and intelligent breeding of improved food, fuel, and fiber crops.
Whole genome sequencing can provide comprehensive information important for determining the biochemical and genetic nature of all elements inside a genome. The high-quality genome references produced from past genome projects and advances in short-read sequencing technologies have enabled quick and cheap analysis for simple variants. However even with the focus on genome-wide resequencing for SNPs, the heritability of more than 50% of human diseases remains elusive. For non-human organisms, high-contiguity references are deficient, limiting the analysis of genomic features. The long and unbiased reads from single molecule, real-time (SMRT) Sequencing and new de novo assembly approaches have demonstrated the ability to detect more complicated variants and chromosome-level phasing. Moreover, with the recent advance of bioinformatics algorithms and tools, the computation tasks for completing high-quality de novo assembly of large genomes becomes feasible with commodity hardware. Ongoing development in sequencing technologies and bioinformatics will likely lead to routine generation of high-quality reference assemblies in the future. We discuss the current state of art and the challenges in bioinformatics toward such a goal. More specifically, explicit examples of pragmatic computational requirements for assembling mammalian-size genomes and algorithms suitable for processing diploid genomes are discussed.
We have developed barcoding reagents and workflows for multiplexing amplicons or fragmented native genomic (DNA) prior to Single Molecule, Real-Time (SMRT) Sequencing. The long reads of PacBio’s SMRT Sequencing enable detection of linked mutations across multiple kilobases (kb) of sequence. This feature is particularly useful in the context of mutational analysis or SNP confirmation, where a large number of samples are generated routinely. To validate this workflow, a set of 384 1.7-kb amplicons, each derived from variants of the Phi29 DNA polymerase gene, were barcoded during amplification, pooled, and sequenced on a single SMRT Cell. To demonstrate the applicability of the method to longer inserts, a library of 96 5-kb clones derived from the E. coli genome was sequenced.
2015 SMRT Informatics Developers Conference Presentation Slides: Adam English, from the Human Genome Sequencing Center at Baylor College of Medicine presents on the structural variation tools being developed at Baylor.
2015 SMRT Informatics Developers Conference Presentation Slides: Ali Bashir of Mount Sinai School of Medicine discussed methods for characterizing structural variation in human genomes across a variety of coverage levels.
Making the most of long reads: towards efficient assemblers for reference quality, de novo reconstructions
2015 SMRT Informatics Developers Conference Presentation Slides: Gene Myers, Ph.D., Founding Director, Systems Biology Center, Max Planck Institute delivered the keynote presentation. He talked about building efficient assemblers, the importance of random error distribution in sequencing data, and resolving tricky repeats with very long reads. He also encouraged developers to release assembly modules openly, and noted that data should be straightforward to parse since sharing data interfaces is easier than sharing software interfaces.
Outside of the simplest cases (haploid, bacteria, or inbreds), genomic information is not carried in a single reference per individual, but rather has higher ploidy (n=>2) for almost all organisms. The existence of two or more highly related sequences within an individual makes it extremely difficult to build high quality, highly contiguous genome assemblies from short DNA fragments. Based on the earlier work on a polyploidy aware assembler, FALCON ( https://github.com/PacificBiosciences/FALCON) , we developed new algorithms and software (“FALCON-unzip”) for de novo haplotype reconstructions from SMRT Sequencing data. We generate two datasets for developing the algorithms and the prototype software: (1) whole genome sequencing data from a highly repetitive diploid fungal (Clavicorona pyxidata) and (2) whole genome sequencing data from an F1 hybrid from two inbred Arabidopsis strains: Cvi-0 and Col-0. For the fungal genome, we achieved an N50 of 1.53 Mb (of the 1n assembly contigs) of the ~42 Mb 1n genome and an N50 of the haplotigs (haplotype specific contigs) of 872 kb from a 95X read length N50 ~16 kb dataset. We found that ~ 45% of the genome was highly heterozygous and ~55% of the genome was highly homozygous. We developed methods to assess the base-level accuracy and local haplotype phasing accuracy of the assembly with short-read data from the Illumina® platform. For the ArabidopsisF1 hybrid genome, we found that 80% of the genome could be separated into haplotigs. The long range accuracy of phasing haplotigs was evaluated by comparing them to the assemblies from the two inbred parental lines. We show that a more complete view of all haplotypes could provide useful biological insights through improved annotation, characterization of heterozygous variants of all sizes, and resolution of differential allele expression. The current Falcon-Unzip method will lead to understand how to solve more difficult polyploid genome assembly problems and improve the computational efficiency for large genome assemblies. Based on this work, we can develop a pipeline enabling routinely assemble diploid or polyploid genomes as haplotigs, representing a comprehensive view of the genomes that can be studied with the information at hand.
A high quality reference genome is an essential resource for plant and animal breeding and functional and evolutionary studies. The common hop (Humulus lupulus, Cannabaceae) is an economically important crop plant used to flavor and preserve beer. Its genome is large (flow cytometrybased estimates of diploid length >5.4Gb1), highly repetitive, and individual plants display high levels of heterozygosity, which make assembly of an accurate and contiguous reference genome challenging with conventional short-read methods. We present a contig assembly of Cascade Hops using PacBio long reads and the diploid genome assembler, FALCON-Unzip2. The assembly has dramatically improved contiguity and completeness over earlier short-read assemblies. The genome is primarily assembled as haplotypes due to the outbred nature of the organism. We explore patterns of haplotype divergence across the assembly and present strategies to deduplicate haplotypes prior to scaffolding
Human MHC class I genes HLA-A, -B, -C, and class II genes HLA -DR, -DQ, and -DP play a critical role in the immune system as primary factors responsible for…
PacBio SMRT Sequencing is fast changing the genomics space with its long reads and high consensus sequence accuracy, providing the most comprehensive view of the genome and transcriptome. In this…
This webinar, presented by Nisha Pillai, provides an overview of amplicon sequencing to target specific regions of a genome using PacBio Single Molecule, Real-Time (SMRT) Sequencing. This session provides an…