Personal transcriptomes in which all of an individual’s genetic variants (e.g., single nucleotide variants) and transcript isoforms (transcription start sites, splice sites, and polyA sites) are defined and quantified for full-length transcripts are expected to be important for understanding individual biology and disease, but have not been described previously. To obtain such transcriptomes, we sequenced the lymphoblastoid transcriptomes of three family members (GM12878 and the parents GM12891 and GM12892) by using a Pacific Biosciences long-read approach complemented with Illumina 101-bp sequencing and made the following observations. First, we found that reads representing all splice sites of a transcript are evident for most sufficiently expressed genes =3 kb and often for genes longer than that. Second, we added and quantified previously unidentified splicing isoforms to an existing annotation, thus creating the first personalized annotation to our knowledge. Third, we determined SNVs in a de novo manner and connected them to RNA haplotypes, including HLA haplotypes, thereby assigning single full-length RNA molecules to their transcribed allele, and demonstrated Mendelian inheritance of RNA molecules. Fourth, we show how RNA molecules can be linked to personal variants on a one-by-one basis, which allows us to assess differential allelic expression (DAE) and differential allelic isoforms (DAI) from the phased full-length isoform reads. The DAI method is largely independent of the distance between exon and SNV–in contrast to fragmentation-based methods. Overall, in addition to improving eukaryotic transcriptome annotation, these results describe, to our knowledge, the first large-scale and full-length personal transcriptome.
Comprehensive genomic analysis of malignant pleural mesothelioma identifies recurrent mutations, gene fusions and splicing alterations.
We analyzed transcriptomes (n = 211), whole exomes (n = 99) and targeted exomes (n = 103) from 216 malignant pleural mesothelioma (MPM) tumors. Using RNA-seq data, we identified four distinct molecular subtypes: sarcomatoid, epithelioid, biphasic-epithelioid (biphasic-E) and biphasic-sarcomatoid (biphasic-S). Through exome analysis, we found BAP1, NF2, TP53, SETD2, DDX3X, ULK2, RYR2, CFAP45, SETDB1 and DDX51 to be significantly mutated (q-score = 0.8) in MPMs. We identified recurrent mutations in several genes, including SF3B1 (~2%; 4/216) and TRAF7 (~2%; 5/216). SF3B1-mutant samples showed a splicing profile distinct from that of wild-type tumors. TRAF7 alterations occurred primarily in the WD40 domain and were, except in one case, mutually exclusive with NF2 alterations. We found recurrent gene fusions and splice alterations to be frequent mechanisms for inactivation of NF2, BAP1 and SETD2. Through integrated analyses, we identified alterations in Hippo, mTOR, histone methylation, RNA helicase and p53 signaling pathways in MPMs.
Over the past decade, the field of genomics has seen such drastic improvements in sequencing chemistries that high-throughput sequencing, or next-generation sequencing (NGS), is being applied to generate data across many disciplines. NGS instruments are becoming less expensive, faster, and smaller, and therefore are being adopted in an increasing number of laboratories, including clinical laboratories. Thus far, clinical use of NGS has been mostly focused on the human genome, for purposes such as characterizing the molecular basis of cancer or for diagnosing and understanding the basis of rare genetic disorders. There are, however, an increasing number of examples whereby NGS is employed to discover novel pathogens, and these cases provide precedent for the use of NGS in microbial diagnostics. NGS has many advantages over traditional microbial diagnostic methods, such as unbiased rather than pathogen-specific protocols, ability to detect fastidious or non-culturable organisms, and ability to detect co-infections. One of the most impressive advantages of NGS is that it requires little or no prior knowledge of the pathogen, unlike many other diagnostic assays; therefore for pathogen discovery, NGS is very valuable. However, despite these advantages, there are challenges involved in implementing NGS for routine clinical microbiological diagnosis. We discuss these advantages and challenges in the context of recently described research studies.
Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics.
Short read massive parallel sequencing has emerged as a standard diagnostic tool in the medical setting. However, short read technologies have inherent limitations such as GC bias, difficulties mapping to repetitive elements, trouble discriminating paralogous sequences, and difficulties in phasing alleles. Long read single molecule sequencers resolve these obstacles. Moreover, they offer higher consensus accuracies and can detect epigenetic modifications from native DNA. The first commercially available long read single molecule platform was the RS system based on PacBio’s single molecule real-time (SMRT) sequencing technology, which has since evolved into their RSII and Sequel systems. Here we capsulize how SMRT sequencing is revolutionizing constitutional, reproductive, cancer, microbial and viral genetic testing.© The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.
It is widely acknowledged that transcriptional diversity largely contributes to biological regulation in eukaryotes. Since the advent of second-generation sequencing technologies, a large number of RNA sequencing studies have considerably improved our understanding of transcriptome complexity. However, it still remains a huge challenge for obtaining full-length transcripts because of difficulties in the short read-based assembly. In the present study we employ PacBio single-molecule long-read sequencing technology for whole-transcriptome profiling in rabbit (Oryctolagus cuniculus). We totally obtain 36,186 high-confidence transcripts from 14,474 genic loci, among which more than 23% of genic loci and 66% of isoforms have not been annotated yet within the current reference genome. Furthermore, about 17% of transcripts are computationally revealed to be non-coding RNAs. Up to 24,797 alternative splicing (AS) and 11,184 alternative polyadenylation (APA) events are detected within this de novo constructed transcriptome, respectively. The results provide a comprehensive set of reference transcripts and hence contribute to the improved annotation of rabbit genome.
In recent years long-read technologies have moved from being a niche and specialist field to a point of relative maturity likely to feature frequently in the genomic landscape. Analogous to next generation sequencing, the cost of sequencing using long-read technologies has materially dropped whilst the instrument throughput continues to increase. Together these changes present the prospect of sequencing large numbers of individuals with the aim of fully characterizing genomes at high resolution. In this article, we will endeavour to present an introduction to long-read technologies showing: what long reads are; how they are distinct from short reads; why long reads are useful and how they are being used. We will highlight the recent developments in this field, and the applications and potential of these technologies in medical research, and clinical diagnostics and therapeutics.
Predicting an HLA-DPB1 expression marker based on standard DPB1 genotyping: Linkage analysis of over 32,000 samples.
The risk of acute graft-versus-host disease (GvHD) after hematopoietic stem cell transplantation is increased with donor-recipient HLA-DPB1 allele mismatching. The single-nucleotide polymorphism (SNP) rs9277534 within the 3′ untranslated region (UTR) correlates with HLA-DPB1 allotype expression and serves as a marker for permissive HLA-DPB1 mismatches. Since rs9277534 is not routinely typed, we analyzed 32,681 samples of mostly European ancestry to investigate if the rs9277534 allele can be reliably imputed from standard DPB1 genotyping. We confirmed the previously-defined linkages between rs9277534 and 18 DPB1 alleles and established additional linkages for 46 DPB1 alleles. Based on these linkages, the rs9277534 allele could be predicted for 99.6% of the samples based on DPB1 genotypes (99.99% concordance). We demonstrate that 100% prediction accuracy could be achieved if the prediction utilized exon 3 sequence information. DPB1 genotyping based on exon 2 data alone allows no unambiguous rs9277534 allele prediction but was estimated to maintain 99% accuracy for samples of European descent. We conclude that DPB1 genotyping is sufficient to infer the DPB1 expression marker rs9277534 with high accuracy. This information could be used to select donors with permissive HLA-DPB1 mismatches without directly screening for rs9277534. Copyright © 2017 The Author(s). Published by Elsevier Inc. All rights reserved.
Novel allele, HLA-B*51:220 generated by a gene conversion event was identified in a Brazilian individual.© 2018 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Conventional and single-molecule targeted sequencing method for specific variant detection in IKBKG while bypassing the IKBKGP1 pseudogene.
In addition to Sanger sequencing, next-generation sequencing of gene panels and exomes has emerged as a standard diagnostic tool in many laboratories. However, these captures can miss regions, have poor efficiency, or capture pseudogenes, which hamper proper diagnoses. One such example is the primary immunodeficiency-associated gene IKBKG. Its pseudogene IKBKGP1 makes traditional capture methods aspecific. We therefore developed a long-range PCR method to efficiently target IKBKG, as well as two associated genes (IRAK4 and MYD88), while bypassing the IKBKGP1 pseudogene. Sequencing accuracy was evaluated using both conventional short-read technology and a newer long-read, single-molecule sequencer. Different mapping and variant calling options were evaluated in their capability to bypass the pseudogene using both sequencing platforms. Based on these evaluations, we determined a robust diagnostic application for unambiguous sequencing and variant calling in IKBKG, IRAK4, and MYD88. This method allows rapid identification of selected primary immunodeficiency diseases in patients suffering from life-threatening invasive pyogenic bacterial infections. Copyright © 2018 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Characterization of a novel allele, HLA-C*02:135N, by full-length gene sequencing in a bone marrow donor.
A frameshift because of a two-nucleotide deletion results in an HLA-C null allele, HLA-C*02:135N.© 2018 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
The novel KIR2DL1*037 allele discovered and characterised by single molecule real-time (SMRT) DNA sequencing.© 2018 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Computational comparison of availability in CTL/gag epitopes among patients with acute and chronic HIV-1 infection.
Recent studies indicate that there is selection bias for transmission of viral polymorphisms associated with higher viral fitness. Furthermore, after transmission and before a specific immune response is mounted in the recipient, the virus undergoes a number of reversions which allow an increase in their replicative capacity. These aspects, and others, affect the viral population characteristic of early acute infection.160 singlegag-gene amplifications were obtained by limiting-dilution RT-PCR from plasma samples of 8 ARV-naïve patients with early acute infection (<30?days, 22?days average) and 8 ARV-naive patients with approximately a year of infection (10 amplicons per patient). Sanger sequencing and NGS SMRT technology (Pacific Biosciences) were implemented to sequence the amplicons. Phylogenetic analysis was performed by using MEGA 6.06. HLA-I (A and B) typing was performed by SSOP-PCR method. The chromatograms were analyzed with Sequencher 4.10. Epitopes and immune-proteosomal cleavages prediction was performed with CBS prediction server for the 30 HLA-A and -B alleles most prevalent in our population with peptide lengths from 8 to 14 mer. Cytotoxic response prediction was performed by using IEDB Analysis Resource.After implementing epitope prediction analysis, we identified a total number of 325 possible viral epitopes present in two or more acute or chronic patients. 60.3% (n?=?196) of them were present only in acute infection (prevalent acute epitopes) while 39.7% (n?=?129) were present only in chronic infection (prevalent chronic epitopes). Within p24, the difference was equally dramatic with 59.4% (79/133) being acute epitopes (p?0.05). This is consistent with progressive viral adaptation to immune response in time and further supported by the fact that cytotoxic responses prediction showed that acute epitopes are more likely to generate immune response than chronic epitopes. Interestingly, only 27.5% of acute epitopes match the population-level consensus sequence of the virus.Our results indicate that certain non-consensus viral residues might be transmitted more frequently than consensus-residues when located in immunological relevant positions (epitopes). This observation might be relevant to the rationale behind development of an effective vaccineto reduce viral reservoir and induce functional cure of HIV infection based in prevalent acute epitopes. Copyright © 2018 Elsevier Ltd. All rights reserved.
High-Resolution Full-Length HLA Typing Method Using Third Generation (Pac-Bio SMRT) Sequencing Technology.
The human HLA genes are among the most polymorphic genes in the human genome. Therefore, it is very difficult to find two unrelated individuals with identical HLA molecules. As a result, HLA Class I and Class II genes are routinely sequenced or serotyped for organ transplantation, autoimmune disease-association studies, drug hypersensitivity research, and other applications. However, these methods were able to give two or four digit data, which was not sufficient enough to understand the completeness of haplotypes of HLA genes. To overcome these limitations, we here described end-to-end workflow for sequencing of HLA class I and class II genes using third generation sequencing, SMRT technology. This method produces fully-phased, unambiguous, allele-level information on the PacBio System.
Full-length extension of HLA allele sequences by HLA allele-specific hemizygous Sanger sequencing (SSBT).
The gold standard for typing at the allele level of the highly polymorphic Human Leucocyte Antigen (HLA) gene system is sequence based typing. Since sequencing strategies have mainly focused on identification of the peptide binding groove, full-length sequence information is lacking for >90% of the HLA alleles. One of the goals of the 17th IHIWS workshop is to establish full-length sequences for as many HLA alleles as possible. In our component “Extension of HLA sequences by full-length HLA allele-specific hemizygous Sanger sequencing” we have used full-length hemizygous Sanger Sequence Based Typing to achieve this goal. We selected samples of which full length sequences were not available in the IPD-IMGT/HLA database. In total we have generated the full-length sequences of 48 HLA-A, 45 -B and 31 -C alleles. For HLA-A extended alleles, 39/48 showed no intron differences compared to the first allele of the corresponding allele group, for HLA-B this was 26/45 and for HLA-C 20/31. Comparing the intron sequences to other alleles of the same allele group revealed that in 5/48 HLA-A, 16/45 HLA-B and 8/31 HLA-C alleles the intron sequence was identical to another allele of the same allele group. In the remaining 10 cases, the sequence either showed polymorphism at a conserved nucleotide or was the result of a gene conversion event. Elucidation of the full-length sequence gives insight in the polymorphic content of the alleles and facilitates the identification of its evolutionary origin. Copyright © 2018 American Society for Histocompatibility and Immunogenetics. All rights reserved.
Full gene HLA class I sequences of 79 novel and 519 mostly uncommon alleles from a large United States registry population.
HLA class I assignments were obtained at single genotype, G-level resolution from 98?855 volunteers for an unrelated donor registry in the United States. In spite of the diverse ancestry of the volunteers, over 99% of the assignments at each locus are common. Within this population, 52 novel alleles differing in exons 2 and 3 are identified and characterized. Previously reported alleles with incomplete sequences in the IPD-IMGT/HLA database (n?=?519) were selected for full gene sequencing and, from this sampling, another 27 novel alleles are described.© 2018 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.