The majority of human genes are alternatively spliced, making it possible for most genes to generate multiple proteins. The process of alternative splicing is highly regulated in a developmental-stage and tissue-specific manner. Perturbations in the regulation of these events can lead to disease in humans. Alternative splicing has been shown to play a role in human cancer, muscular dystrophy, Alzheimer’s, and many other diseases. Understanding these diseases requires knowing the full complement of mRNA isoforms. Microarrays and high-throughput cDNA sequencing have become highly successful tools for studying transcriptomes, however these technologies only provide small fragments of transcripts and building complete transcript isoforms has been very challenging. We have developed the Iso-Seq technique, which is capable of sequencing full-length, single-molecule cDNA sequences. The method employs SMRT Sequencing to generate individual molecules with average read lengths of more than 10 kb and some as long as 40 kb. As most transcripts are from 1 to 10 kb, we can sequence through entire RNA molecules, requiring no fragmentation or post-sequencing assembly. Jointly with the sequencing method, we developed a computational pipeline that polishes these full-length transcript sequences into high-quality, non-redundant transcript consensus sequences. Iso-Seq sequencing enables unambiguous identification of alternative splicing events, alternative transcriptional start and poly-A sites, and transcripts from gene fusion events. Knowledge of the complete set of isoforms from a sample of interest is key for accurate quantification of isoform abundance when using any technology for transcriptome studies. Here we characterize the full-length transcriptome of normal human tissues, paired tumor/normal samples from breast cancer, and a brain sample from a patient with Alzheimer’s using deep Iso-Seq sequencing. We highlight numerous discoveries of novel alternatively spliced isoforms, gene-fusions events, and previously unannotated genes that will improve our understanding of human diseases.
The majority of human genes are alternatively spliced, making it possible for most genes to generate multiple proteins. The process of alternative splicing is highly regulated in a developmental-stage and tissue-specific manner. Perturbations in the regulation of these events can lead to disease in humans (1). Alternative splicing has been shown to play a role in human cancer, muscular dystrophy, Alzheimer’s, and many other diseases. Understanding these diseases requires knowing the full complement of mRNA isoforms. Microarrays and high-throughput cDNA sequencing have become highly successful tools for studying transcriptomes, however these technologies only provide small fragments of transcripts and building complete transcript isoforms has been very challenging (2). We have developed a technique, called Iso-Seq sequencing, that is capable of sequencing full-length, single-molecule cDNA sequences. The method employs SMRT Sequencing from PacBio, which can sequence individual molecules with read lengths that average more than 10 kb and can reach as long as 40 kb. As most transcripts are from 1 – 10 kb, we can sequence through entire RNA molecules, requiring no fragmentation or post-sequencing assembly. Jointly with the sequencing method, we developed a computational pipeline that polishes these full-length transcript sequences into high-quality, non-redundant transcript consensus sequences. Iso-Seq sequencing enables unambiguous identification of alternative splicing events, alternative transcriptional start and polyA sites, and transcripts from gene fusion events. Knowledge of the complete set of isoforms from a sample of interest is key for accurate quantification of isoform abundance when using any technology for transcriptome studies (3). Here we characterize the full-length transcriptome of paired tumor/normal samples from breast cancer using deep Iso-Seq sequencing. We highlight numerous discoveries of novel alternatively spliced isoforms, gene-fusion events, and previously unannotated genes that will improve our understanding of human cancer. (1) Faustino NA and Cooper TA. Genes and Development. 2003. 17: 419-437(2) Steijger T, et al. Nat Methods. 2013 Dec;10(12):1177-84.(3) Au KF, et al. Proc Natl Acad Sci U S A. 2013 Dec 10;110(50):E4821-30.
During the past decade, the search for pathogenic mutations in rare human genetic diseases has involved huge efforts to sequence coding regions, or the entire genome, using massively parallel short-read sequencers. However, the approximate current diagnostic rate is <50% using these approaches, and there remain many rare genetic diseases with unknown cause. There may be many reasons for this, but one plausible explanation is that the responsible mutations are in regions of the genome that are difficult to sequence using conventional technologies (e.g., tandem-repeat expansion or complex chromosomal structural aberrations). Despite the drawbacks of high cost and a shortage of standard analytical methods, several studies have analyzed pathogenic changes in the genome using long-read sequencers. The results of these studies provide hope that further application of long-read sequencers to identify the causative mutations in unsolved genetic diseases may expand our understanding of the human genome and diseases. Such approaches may also be applied to molecular diagnosis and therapeutic strategies for patients with genetic diseases in the future.
In the past several years, single-molecule sequencing platforms, such as those by Pacific Biosciences and Oxford Nanopore Technologies, have become available to researchers and are currently being tested for clinical applications. They offer exceptionally long reads that permit direct sequencing through regions of the genome inaccessible or difficult to analyze by short-read platforms. This includes disease-causing long repetitive elements, extreme GC content regions, and complex gene loci. Similarly, these platforms enable structural variation characterization at previously unparalleled resolution and direct detection of epigenetic marks in native DNA. Here, we review how these technologies are opening up new clinical avenues that are being applied to pathogenic microorganisms and viruses, constitutional disorders, pharmacogenomics, cancer, and more.Copyright © 2018 Elsevier Ltd. All rights reserved.
The discovery of mutations associated with human genetic dis- ease is an exercise in comparative genomics (see Glossary). Although there are many different strategies and approaches, the central premise is that affected persons harbor a significant excess of pathogenic DNA variants as com- pared with a group of unaffected persons (controls) that is either clinically defined1 or established by surveying large swaths of the general population.2 The more exclu- sive the variant is to the disease, the greater its penetrance, the larger its effect size, and the more relevant it becomes to both disease diagnosis and future therapeutic investigation. The most popular approach used by researchers in human genetics is the case–control design, but there are others that can be used to track variants and disease in a family context or that consider the probability of different classes of mutations based on evolutionary patterns of divergence or de novo mutational change.3,4 Although the approaches may be straightforward, the discovery of patho- genic variation and its mechanism of action often is less trivial, and decades of research can be required in order to identify the variants underlying both mendelian and complex genetic traits.
Current programmable nuclease-based methods (for example, CRISPR-Cas9) for the precise correction of a disease-causing genetic mutation harness the homology-directed repair pathway. However, this repair process requires the co-delivery of an exogenous DNA donor to recode the sequence and can be inefficient in many cell types. Here we show that disease-causing frameshift mutations that result from microduplications can be efficiently reverted to the wild-type sequence simply by generating a DNA double-stranded break near the centre of the duplication. We demonstrate this in patient-derived cell lines for two diseases: limb-girdle muscular dystrophy type 2G (LGMD2G)1 and Hermansky-Pudlak syndrome type 1 (HPS1)2. Clonal analysis of inducible pluripotent stem (iPS) cells from the LGMD2G cell line, which contains a mutation in TCAP, treated with the Streptococcus pyogenes Cas9 (SpCas9) nuclease revealed that about 80% contained at least one wild-type TCAP allele; this correction also restored TCAP expression in LGMD2G iPS cell-derived myotubes. SpCas9 also efficiently corrected the genotype of an HPS1 patient-derived B-lymphoblastoid cell line. Inhibition of polyADP-ribose polymerase 1 (PARP-1) suppressed the nuclease-mediated collapse of the microduplication to the wild-type sequence, confirming that precise correction is mediated by the microhomology-mediated end joining (MMEJ) pathway. Analysis of editing by SpCas9 and Lachnospiraceae bacterium ND2006 Cas12a (LbCas12a) at non-pathogenic 4-36-base-pair microduplications within the genome indicates that the correction strategy is broadly applicable to a wide range of microduplication lengths and can be initiated by a variety of nucleases. The simplicity, reliability and efficacy of this MMEJ-based therapeutic strategy should permit the development of nuclease-based gene correction therapies for a variety of diseases that are associated with microduplications.
The development of clustered regularly interspaced short-palindromic repeat (CRISPR)-Cas systems for genome editing has transformed the way life science research is conducted and holds enormous potential for the treatment of disease as well as for many aspects of biotech- nology. Here, I provide a personal perspective on the development of CRISPR-Cas9 for genome editing within the broader context of the field and discuss our work to discover novel Cas effectors and develop them into additional molecular tools. The initial demonstra- tion of Cas9-mediated genome editing launched the development of many other technologies, enabled new lines of biological inquiry, and motivated a deeper examination of natural CRISPR-Cas systems, including the discovery of new types of CRISPR-Cas systems. These new discoveries in turn spurred further technological developments. I review these exciting discoveries and technologies as well as provide an overview of the broad array of applications of these technologies in basic research and in the improvement of human health. It is clear that we are only just beginning to unravel the potential within microbial diversity, and it is quite likely that we will continue to discover other exciting phenomena, some of which it may be possible to repurpose as molecular technologies. The transformation of mysterious natural phenomena to powerful tools, however, takes a collective effort to discover, characterize, and engineer them, and it has been a privilege to join the numerous researchers who have contributed to this transformation of CRISPR-Cas systems.
Long-read sequencing unveils IGH-DUX4 translocation into the silenced IGH allele in B-cell acute lymphoblastic leukemia.
[email protected] proto-oncogene translocation is a common oncogenic event in lymphoid lineage cancers such as B-ALL, lymphoma and multiple myeloma. Here, to investigate the interplay between [email protected] proto-oncogene translocation and IGH allelic exclusion, we perform long-read whole-genome and transcriptome sequencing along with epigenetic and 3D genome profiling of Nalm6, an IGH-DUX4 positive B-ALL cell line. We detect significant allelic imbalance on the wild-type over the IGH-DUX4 haplotype in expression and epigenetic data, showing IGH-DUX4 translocation occurs on the silenced IGH allele. In vitro, this reduces the oncogenic stress of DUX4 high-level expression. Moreover, patient samples of IGH-DUX4 B-ALL have similar expression profile and IGH breakpoints as Nalm6, suggesting a common mechanism to allow optimal dosage of non-toxic DUX4 expression.
Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight.
The human genome contains “dark” gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions.Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are =?5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer’s Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in disease cases but not in controls.While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer’s disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.
Tandemly repeated DNA is highly mutable and causes at least 31 diseases, but it is hard to detect pathogenic repeat expansions genome-wide. Here, we report robust detection of human repeat expansions from careful alignments of long but error-prone (PacBio and nanopore) reads to a reference genome. Our method is robust to systematic sequencing errors, inexact repeats with fuzzy boundaries, and low sequencing coverage. By comparing to healthy controls, we prioritize pathogenic expansions within the top 10 out of 700,000 tandem repeats in whole genome sequencing data. This may help to elucidate the many genetic diseases whose causes remain unknown.
In recent genome analyses, population-specific reference panels have indicated important. However, reference panels based on short-read sequencing data do not sufficiently cover long insertions. Therefore, the nature of long insertions has not been well documented. Here, we assembled a Japanese genome using single-molecule real-time sequencing data and characterized insertions found in the assembled genome. We identified 3691 insertions ranging from 100?bps to ~10,000?bps in the assembled genome relative to the international reference sequence (GRCh38). To validate and characterize these insertions, we mapped short-reads from 1070 Japanese individuals and 728 individuals from eight other populations to insertions integrated into GRCh38. With this result, we constructed JRGv1 (Japanese Reference Genome version 1) by integrating the 903 verified insertions, totaling 1,086,173 bases, shared by at least two Japanese individuals into GRCh38. We also constructed decoyJRGv1 by concatenating 3559 verified insertions, totaling 2,536,870 bases, shared by at least two Japanese individuals or by six other assemblies. This assembly improved the alignment ratio by 0.4% on average. These results demonstrate the importance of refining the reference assembly and creating a population-specific reference genome. JRGv1 and decoyJRGv1 are available at the JRG website.
TALENs facilitate targeted genome editing in human cells with high specificity and low cytotoxicity.
Designer nucleases have been successfully employed to modify the genomes of various model organisms and human cell types. While the specificity of zinc-finger nucleases (ZFNs) and RNA-guided endonucleases has been assessed to some extent, little data are available for transcription activator-like effector-based nucleases (TALENs). Here, we have engineered TALEN pairs targeting three human loci (CCR5, AAVS1 and IL2RG) and performed a detailed analysis of their activity, toxicity and specificity. The TALENs showed comparable activity to benchmark ZFNs, with allelic gene disruption frequencies of 15-30% in human cells. Notably, TALEN expression was overall marked by a low cytotoxicity and the absence of cell cycle aberrations. Bioinformatics-based analysis of designer nuclease specificity confirmed partly substantial off-target activity of ZFNs targeting CCR5 and AAVS1 at six known and five novel sites, respectively. In contrast, only marginal off-target cleavage activity was detected at four out of 49 predicted off-target sites for CCR5- and AAVS1-specific TALENs. The rational design of a CCR5-specific TALEN pair decreased off-target activity at the closely related CCR2 locus considerably, consistent with fewer genomic rearrangements between the two loci. In conclusion, our results link nuclease-associated toxicity to off-target cleavage activity and corroborate TALENs as a highly specific platform for future clinical translation. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Vector design Tour de Force: integrating combinatorial and rational approaches to derive novel adeno-associated virus variants.
Methodologies to improve existing adeno-associated virus (AAV) vectors for gene therapy include either rational approaches or directed evolution to derive capsid variants characterized by superior transduction efficiencies in targeted tissues. Here, we integrated both approaches in one unified design strategy of “virtual family shuffling” to derive a combinatorial capsid library whereby only variable regions on the surface of the capsid are modified. Individual sublibraries were first assembled in order to preselect compatible amino acid residues within restricted surface-exposed regions to minimize the generation of dead-end variants. Subsequently, the successful families were interbred to derive a combined library of ~8?×?10(5) complexity. Next-generation sequencing of the packaged viral DNA revealed capsid surface areas susceptible to directed evolution, thus providing guidance for future designs. We demonstrated the utility of the library by deriving an AAV2-based vector characterized by a 20-fold higher transduction efficiency in murine liver, now equivalent to that of AAV8.
The simplicity of site-specific genome targeting by type II clustered, regularly interspaced, short palindromic repeat (CRISPR)-Cas9 nucleases, along with their robust activity profile, has changed the landscape of genome editing. These favorable properties have made the CRISPR-Cas9 system the technology of choice for sequence-specific modifications in vertebrate systems. For many applications, whether the focus is on basic science investigations or therapeutic efficacy, activity and precision are important considerations when one is choosing a nuclease platform, target site and delivery method. Here we review recent methods for increasing the activity and accuracy of Cas9 and assessing the extent of off-target cleavage events.
Accurate identification and quantification of DNA species by next-generation sequencing in adeno-associated viral vectors produced in insect cells.
Recombinant adeno-associated viral (rAAV) vectors have proven excellent tools for the treatment of many genetic diseases and other complex diseases. However, the illegitimate encapsidation of DNA contaminants within viral particles constitutes a major safety concern for rAAV-based therapies. Moreover, the development of rAAV vectors for early-phase clinical trials has revealed the limited accuracy of the analytical tools used to characterize these new and complex drugs. Although most published data concerning residual DNA in rAAV preparations have been generated by quantitative PCR, we have developed a novel single-strand virus sequencing (SSV-Seq) method for quantification of DNA contaminants in AAV vectors produced in mammalian cells by next-generation sequencing (NGS). Here, we describe the adaptation of SSV-Seq for the accurate identification and quantification of DNA species in rAAV stocks produced in insect cells. We found that baculoviral DNA was the most abundant contaminant, representing less than 2.1% of NGS reads regardless of serotype (2, 8, or rh10). Sf9 producer cell DNA was detected at low frequency (=0.03%) in rAAV lots. Advanced computational analyses revealed that (1) baculoviral sequences close to the inverted terminal repeats preferentially underwent illegitimate encapsidation, and (2) single-nucleotide variants were absent from the rAAV genome. The high-throughput sequencing protocol described here enables effective DNA quality control of rAAV vectors produced in insect cells, and is adapted to conform with regulatory agency safety requirements.