Transcription activator-like effector nucleases (TALENs) have become a powerful tool for genome editing due to the simple code linking the amino acid sequences of their DNA-binding domains to TALEN nucleotide targets. While the initial TALEN-design guidelines are very useful, user-friendly tools defining optimal TALEN designs for robust genome editing need to be developed. Here we evaluated existing guidelines and developed new design guidelines for TALENs based on 205 TALENs tested, and established the scoring algorithm for predicting TALEN activity (SAPTA) as a new online design tool. For any input gene of interest, SAPTA gives a ranked list of potential TALEN target sites, facilitating the selection of optimal TALEN pairs based on predicted activity. SAPTA-based TALEN designs increased the average intracellular TALEN monomer activity by >3-fold, and resulted in an average endogenous gene-modification frequency of 39% for TALENs containing the repeat variable di-residue NK that favors specificity rather than activity. It is expected that SAPTA will become a useful and flexible tool for designing highly active TALENs for genome-editing applications. SAPTA can be accessed via the website at http://baolab.bme.gatech.edu/Research/BioinformaticTools/TAL_targeter.html.
Tal-effector nucleases (TALENs) are engineered proteins that can stimulate precise genome editing through specific DNA double-strand breaks. Sickle cell disease and ß-thalassemia are common genetic disorders caused by mutations in ß-globin, and we engineered a pair of highly active TALENs that induce modification of 54% of human ß-globin alleles near the site of the sickle mutation. These TALENS stimulate targeted integration of therapeutic, full-length beta-globin cDNA to the endogenous ß-globin locus in 19% of cells prior to selection as quantified by single molecule real-time sequencing. We also developed highly active TALENs to human ?-globin, a pharmacologic target in sickle cell disease therapy. Using the ß-globin and ?-globin TALENs, we generated cell lines that express GFP under the control of the endogenous ß-globin promoter and tdTomato under the control of the endogenous ?-globin promoter. With these fluorescent reporter cell lines, we screened a library of small molecule compounds for their differential effect on the transcriptional activity of the endogenous ß- and ?-globin genes and identified several that preferentially upregulate ?-globin expression.
Although engineered nucleases can efficiently cleave intracellular DNA at desired target sites, major concerns remain on potential ‘off-target’ cleavage that may occur throughout the genome. We developed an online tool: predicted report of genome-wide nuclease off-target sites (PROGNOS) that effectively identifies off-target sites. The initial bioinformatics algorithms in PROGNOS were validated by predicting 44 of 65 previously confirmed off-target sites, and by uncovering a new off-target site for the extensively studied zinc finger nucleases (ZFNs) targeting C-C chemokine receptor type 5. Using PROGNOS, we rapidly interrogated 128 potential off-target sites for newly designed transcription activator-like effector nucleases containing either Asn-Asn (NN) or Asn-Lys (NK) repeat variable di-residues (RVDs) and 3- and 4-finger ZFNs, and validated 13 bona fide off-target sites for these nucleases by DNA sequencing. The PROGNOS algorithms were further refined by incorporating additional features of nuclease-DNA interactions and the newly confirmed off-target sites into the training set, which increased the percentage of bona fide off-target sites found within the top PROGNOS rankings. By identifying potential off-target sites in silico, PROGNOS allows the selection of more specific target sites and aids the identification of bona fide off-target sites, significantly facilitating the design of engineered nucleases for genome editing applications.
Human Embryonic Stem Cells (hESCs) are in vitro derivatives of the inner cell mass of the blastocyst and are characterized by an undifferentiated and pluripotent state that can be perpetuated in time, indefinitely. hESCs provide a unique opportunity to both dissect the molecular mechanisms that are predisposed to the maintenance of pluripotency and model the ability to initiate differentiation and cell commitment within the developing embryo. To fully understand these mechanisms, it is necessary to accurately identify the specific transcriptome of hESCs. Many distinct gene annotation methods, such as cDNA and EST sequencing and RNA-Seq, have been used to identify the transcriptome of hESCs. Lately, we developed a new tool (IDP) to integrate the hybrid sequencing data to characterize a more reliable and comprehensive hESC transcriptome with discoveries of many novel transcripts. Copyright © 2014 The Authors. Published by Elsevier Ltd.. All rights reserved.
Long-read sequencing uncovers transcript features missed by short-read methods.
The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.
Personal transcriptomes in which all of an individual’s genetic variants (e.g., single nucleotide variants) and transcript isoforms (transcription start sites, splice sites, and polyA sites) are defined and quantified for full-length transcripts are expected to be important for understanding individual biology and disease, but have not been described previously. To obtain such transcriptomes, we sequenced the lymphoblastoid transcriptomes of three family members (GM12878 and the parents GM12891 and GM12892) by using a Pacific Biosciences long-read approach complemented with Illumina 101-bp sequencing and made the following observations. First, we found that reads representing all splice sites of a transcript are evident for most sufficiently expressed genes =3 kb and often for genes longer than that. Second, we added and quantified previously unidentified splicing isoforms to an existing annotation, thus creating the first personalized annotation to our knowledge. Third, we determined SNVs in a de novo manner and connected them to RNA haplotypes, including HLA haplotypes, thereby assigning single full-length RNA molecules to their transcribed allele, and demonstrated Mendelian inheritance of RNA molecules. Fourth, we show how RNA molecules can be linked to personal variants on a one-by-one basis, which allows us to assess differential allelic expression (DAE) and differential allelic isoforms (DAI) from the phased full-length isoform reads. The DAI method is largely independent of the distance between exon and SNV–in contrast to fragmentation-based methods. Overall, in addition to improving eukaryotic transcriptome annotation, these results describe, to our knowledge, the first large-scale and full-length personal transcriptome.
In this study we identified two 3′-coterminal RNA molecules in the pseudorabies virus. The highly abundant short transcript (CTO-S) proved to be encoded between the ul21 and ul22 genes in close vicinity of the replication origin (OriL) of the virus. The less abundant long RNA molecule (CTO-L) is a transcriptional readthrough product of the ul21 gene and overlaps OriL. These polyadenylated RNAs were characterized by ascertaining their nucleotide sequences with the Illumina HiScanSQ and Pacific Biosciences Real-Time (PacBio RSII) sequencing platforms and by analyzing their transcription kinetics through use of multi-time-point Real-Time RT-PCR and the PacBio RSII system. It emerged that transcription of the CTOs is fully dependent on the viral transactivator protein IE180 and CTO-S is not a microRNA precursor. We propose an interaction between the transcription and replication machineries at this genomic location, which might play an important role in the regulation of DNA synthesis.
Circular RNAs (circRNAs) have re-emerged as an interesting RNA species. Using deep RNA profiling in different mouse tissues, we observed that circRNAs were substantially enriched in brain and a disproportionate fraction of them were derived from host genes that encode synaptic proteins. Moreover, on the basis of separate profiling of the RNAs localized in neuronal cell bodies and neuropil, circRNAs were, on average, more enriched in the neuropil than their host gene mRNA isoforms. Using high-resolution in situ hybridization, we visualized circRNA punctae in the dendrites of neurons. Consistent with the idea that circRNAs might regulate synaptic function during development, many circRNAs changed their abundance abruptly at a time corresponding to synaptogenesis. In addition, following a homeostatic downscaling of neuronal activity many circRNAs exhibited substantial up- or downregulation. Together, our data indicate that brain circRNAs are positioned to respond to and regulate synaptic function.
The simultaneous sequencing of a single cell’s genome and transcriptome offers a powerful means to dissect genetic variation and its effect on gene expression. Here we describe G&T-seq, a method for separating and sequencing genomic DNA and full-length mRNA from single cells. By applying G&T-seq to over 220 single cells from mice and humans, we discovered cellular properties that could not be inferred from DNA or RNA sequencing alone.
Single cell genomic study of Dehalococcoidetes species from deep-sea sediments of the Peruvian Margin.
The phylum Chloroflexi is one of the most frequently detected phyla in the subseafloor of the Pacific Ocean margins. Dehalogenating Chloroflexi (Dehalococcoidetes) was originally discovered as the key microorganisms mediating reductive dehalogenation via their key enzymes reductive dehalogenases (Rdh) as sole mode of energy conservation in terrestrial environments. The frequent detection of Dehalococcoidetes-related 16S rRNA and rdh genes in the marine subsurface implies a role for dissimilatory dehalorespiration in this environment; however, the two genes have never been linked to each other. To provide fundamental insights into the metabolism, genomic population structure and evolution of marine subsurface Dehalococcoidetes sp., we analyzed a non-contaminated deep-sea sediment core sample from the Peruvian Margin Ocean Drilling Program (ODP) site 1230, collected 7.3?m below the seafloor by a single cell genomic approach. We present for the first time single cell genomic data on three deep-sea Chloroflexi (Dsc) single cells from a marine subsurface environment. Two of the single cells were considered to be part of a local Dehalococcoidetes population and assembled together into a 1.38-Mb genome, which appears to be at least 85% complete. Despite a high degree of sequence-level similarity between the shared proteins in the Dsc and terrestrial Dehalococcoidetes, no evidence for catabolic reductive dehalogenation was found in Dsc. The genome content is however consistent with a strictly anaerobic organotrophic or lithotrophic lifestyle.
Global RNA studies have become central to understanding biological processes, but methods such as microarrays and short-read sequencing are unable to describe an entire RNA molecule from 5′ to 3′ end. Here we use single-molecule long-read sequencing technology from Pacific Biosciences to sequence the polyadenylated RNA complement of a pooled set of 20 human organs and tissues without the need for fragmentation or amplification. We show that full-length RNA molecules of up to 1.5 kb can readily be monitored with little sequence loss at the 5′ ends. For longer RNA molecules more 5′ nucleotides are missing, but complete intron structures are often preserved. In total, we identify ~14,000 spliced GENCODE genes. High-confidence mappings are consistent with GENCODE annotations, but >10% of the alignments represent intron structures that were not previously annotated. As a group, transcripts mapping to unannotated regions have features of long, noncoding RNAs. Our results show the feasibility of deep sequencing full-length RNA from complex eukaryotic transcriptomes on a single-molecule level.
Although transcriptional and posttranscriptional events are detected in RNA-Seq data from second-generation sequencing, full-length mRNA isoforms are not captured. On the other hand, third-generation sequencing, which yields much longer reads, has current limitations of lower raw accuracy and throughput. Here, we combine second-generation sequencing and third-generation sequencing with a custom-designed method for isoform identification and quantification to generate a high-confidence isoform dataset for human embryonic stem cells (hESCs). We report 8,084 RefSeq-annotated isoforms detected as full-length and an additional 5,459 isoforms predicted through statistical inference. Over one-third of these are novel isoforms, including 273 RNAs from gene loci that have not previously been identified. Further characterization of the novel loci indicates that a subset is expressed in pluripotent cells but not in diverse fetal and adult tissues; moreover, their reduced expression perturbs the network of pluripotency-associated genes. Results suggest that gene identification, even in well-characterized human cell lines and tissues, is likely far from complete.
Novel exons and splice variants in the human antibody heavy chain identified by single cell and single molecule sequencing.
Antibody heavy chains contain a variable and a constant region. The constant region of the antibody heavy chain is encoded by multiple groups of exons which define the isotype and therefore many functional characteristics of the antibody. We performed both single B cell RNAseq and long read single molecule sequencing of antibody heavy chain transcripts and were able to identify novel exons for IGHA1 and IGHA2 as well as novel isoforms for IGHM antibody heavy chain.
Neurexins are evolutionarily conserved presynaptic cell-adhesion molecules that are essential for normal synapse formation and synaptic transmission. Indirect evidence has indicated that extensive alternative splicing of neurexin mRNAs may produce hundreds if not thousands of neurexin isoforms, but no direct evidence for such diversity has been available. Here we use unbiased long-read sequencing of full-length neurexin (Nrxn)1a, Nrxn1ß, Nrxn2ß, Nrxn3a, and Nrxn3ß mRNAs to systematically assess how many sites of alternative splicing are used in neurexins with a significant frequency, and whether alternative splicing events at these sites are independent of each other. In sequencing more than 25,000 full-length mRNAs, we identified a novel, abundantly used alternatively spliced exon of Nrxn1a and Nrxn3a (referred to as alternatively spliced sequence 6) that encodes a 9-residue insertion in the flexible hinge region between the fifth LNS (laminin-a, neurexin, sex hormone-binding globulin) domain and the third EGF-like sequence. In addition, we observed several larger-scale events of alternative splicing that deleted multiple domains and were much less frequent than the canonical six sites of alternative splicing in neurexins. All of the six canonical events of alternative splicing appear to be independent of each other, suggesting that neurexins may exhibit an even larger isoform diversity than previously envisioned and comprise thousands of variants. Our data are consistent with the notion that a-neurexins represent extracellular protein-interaction scaffolds in which different LNS and EGF domains mediate distinct interactions that affect diverse functions and are independently regulated by independent events of alternative splicing.