Jim Lupski is a professor at Baylor College of Medicine where he’s on the frontline of incorporating genomic research into everyday clinical practice. The story begins with Jim’s own genome,…
The accurate identification of DNA sequence variants is an important, but challenging task in genomics. It is particularly difficult for single molecule sequencing, which has a per-nucleotide error rate of ~5-15%. Meeting this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. For the well-characterized NA12878 human sample, Clairvoyante achieves 99.67, 95.78, 90.53% F1-score on 1KP common variants, and 98.65, 92.57, 87.26% F1-score for whole-genome analysis, using Illumina, PacBio, and Oxford Nanopore data, respectively. Training on a second human sample shows Clairvoyante is sample agnostic and finds variants in less than 2?h on a standard server. Furthermore, we present 3,135 variants that are missed using Illumina but supported independently by both PacBio and Oxford Nanopore reads. Clairvoyante is available open-source ( https://github.com/aquaskyline/Clairvoyante ), with modules to train, utilize and visualize the model.
Gut microbiota is a determining factor in human physiological functions and health. It is commonly accepted that diet has a major influence on the gut microbial community, however, the effects of diet is not fully understood. The typical Mongolian diet is characterized by high and frequent consumption of fermented dairy products and red meat, and low level of carbohydrates. In this study, the gut microbiota profile of 26 Mongolians whom consumed wheat, rice and oat as the sole carbohydrate staple food for a week each consecutively was determined. It was observed that changes in staple carbohydrate rapidly (within a week) altered gut microbial community structure and metabolic pathway of the subjects. Wheat and oat favored bifidobacteria (Bifidobacterium catenulatum, Bifodobacteriumbifidum, Bifidobacterium adolescentis); whereas rice suppressed bifidobacteria (Bifidobacterium longum, Bifidobacterium adolescentis) and wheat suppresses Lactobaciilus, Ruminococcus and Bacteroides. The study exhibited two gut microbial clustering patterns with the preference of fucosyllactose utilization linking to fucosidase genes (glycoside hydrolase family classifications: GH95 and GH29) encoded by Bifidobacterium, and xylan and arabinoxylan utilization linking to xylanase and arabinoxylanase genes encoded by Bacteroides. There was also a correlation between Lactobacillus ruminis and sialidase, as well as Butyrivibrio crossotus and xylanase/xylosidase. Meanwhile, a strong concordance was found between the gastrointestinal bacterial microbiome and the intestinal virome. Present research will contribute to understanding the impacts of the dietary carbohydrate on human gut microbiome, which will ultimately help understand relationships between dietary factor, microbial populations, and the health of global humans.
Comprehensive genomic analysis of malignant pleural mesothelioma identifies recurrent mutations, gene fusions and splicing alterations.
We analyzed transcriptomes (n = 211), whole exomes (n = 99) and targeted exomes (n = 103) from 216 malignant pleural mesothelioma (MPM) tumors. Using RNA-seq data, we identified four distinct molecular subtypes: sarcomatoid, epithelioid, biphasic-epithelioid (biphasic-E) and biphasic-sarcomatoid (biphasic-S). Through exome analysis, we found BAP1, NF2, TP53, SETD2, DDX3X, ULK2, RYR2, CFAP45, SETDB1 and DDX51 to be significantly mutated (q-score = 0.8) in MPMs. We identified recurrent mutations in several genes, including SF3B1 (~2%; 4/216) and TRAF7 (~2%; 5/216). SF3B1-mutant samples showed a splicing profile distinct from that of wild-type tumors. TRAF7 alterations occurred primarily in the WD40 domain and were, except in one case, mutually exclusive with NF2 alterations. We found recurrent gene fusions and splice alterations to be frequent mechanisms for inactivation of NF2, BAP1 and SETD2. Through integrated analyses, we identified alterations in Hippo, mTOR, histone methylation, RNA helicase and p53 signaling pathways in MPMs.
Alternative RNA splicing is a known phenomenon, but we still do not have a complete catalog of isoforms that explain variability in the human transcriptome. We have made significant progress in developing methods to study variability of the transcriptome, but we are far away of having a complete picture of the transcriptome. The initial methods to study gene expression were based on cloning of cDNAs and Sanger sequencing. The strategy was labor-intensive and expensive. With the development of microarrays, different methods based on exon arrays and tiling arrays provided valuable information about RNA expression. However, the microarray presented significant limitations. Most of the limitations became apparent by 2005, but it was not until 2008 that an alternative method to study the transcriptome was developed. RNA Sequencing using next-generation sequencing (RNA-Seq) quickly became the technology of choice for gene expression profiling. Recently, the precision and sensitivity of RNA-Seq have come into question, especially for transcriptome reconstruction. This chapter will describe a relatively new method, “Isoform Sequencing (Iso-Seq). Iso-Seq was developed by Pacific Biosciences (PacBio), and it is capable of identifying new isoforms with extraordinary precision due to its long-read technology. The technique to create libraries is straightforward, and the PacBio RS II instrument generates the information in hours. The bioinformatics analysis is performed using the freely available SMRT® Portal software. The SMRT Portal is easy to use and capable of performing all the steps necessary to analyze the raw data and to generate high-quality full-length isoforms. For the universal acceptance of the Iso-Seq method, the capacity of the SMRT Cells needs to improve at least 10- to 100-fold to make the system affordable and attractive to users.
Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line.
The SK-BR-3 cell line is one of the most important models for HER2+ breast cancers, which affect one in five breast cancer patients. SK-BR-3 is known to be highly rearranged, although much of the variation is in complex and repetitive regions that may be underreported. Addressing this, we sequenced SK-BR-3 using long-read single molecule sequencing from Pacific Biosciences and develop one of the most detailed maps of structural variations (SVs) in a cancer genome available, with nearly 20,000 variants present, most of which were missed by short-read sequencing. Surrounding the important ERBB2 oncogene (also known as HER2), we discover a complex sequence of nested duplications and translocations, suggesting a punctuated progression. Full-length transcriptome sequencing further revealed several novel gene fusions within the nested genomic variants. Combining long-read genome and transcriptome sequencing enables an in-depth analysis of how SVs disrupt the genome and sheds new light on the complex mechanisms involved in cancer genome evolution.© 2018 Nattestad et al.; Published by Cold Spring Harbor Laboratory Press.
Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays and generate a de novo assembly of 2.93?Gb (contig N50: 8.3?Mb, scaffold N50: 22.0?Mb, including 39.3?Mb N-bases), together with 206?Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8?Mb of HX1-specific sequences, including 4.1?Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.
Improved full-length killer cell immunoglobulin-like receptor transcript discovery in Mauritian cynomolgus macaques.
Killer cell immunoglobulin-like receptors (KIRs) modulate disease progression of pathogens including HIV, malaria, and hepatitis C. Cynomolgus and rhesus macaques are widely used as nonhuman primate models to study human pathogens, and so, considerable effort has been put into characterizing their KIR genetics. However, previous studies have relied on cDNA cloning and Sanger sequencing that lack the throughput of current sequencing platforms. In this study, we present a high throughput, full-length allele discovery method utilizing Pacific Biosciences circular consensus sequencing (CCS). We also describe a new approach to Macaque Exome Sequencing (MES) and the development of the Rhexome1.0, an adapted target capture reagent that includes macaque-specific capture probe sets. By using sequence reads generated by whole genome sequencing (WGS) and MES to inform primer design, we were able to increase the sensitivity of KIR allele discovery. We demonstrate this increased sensitivity by defining nine novel alleles within a cohort of Mauritian cynomolgus macaques (MCM), a geographically isolated population with restricted KIR genetics that was thought to be completely characterized. Finally, we describe an approach to genotyping KIRs directly from sequence reads generated using WGS/MES reads. The findings presented here expand our understanding of KIR genetics in MCM by associating new genes with all eight KIR haplotypes and demonstrating the existence of at least one KIR3DS gene associated with every haplotype.
In contrast to infections with human immunodeficiency virus (HIV) in humans and simian immunodeficiency virus (SIV) in macaques, SIV infection of a natural host, sooty mangabeys (Cercocebus atys), is non-pathogenic despite high viraemia. Here we sequenced and assembled the genome of a captive sooty mangabey. We conducted genome-wide comparative analyses of transcript assemblies from C. atys and AIDS-susceptible species, such as humans and macaques, to identify candidates for host genetic factors that influence susceptibility. We identified several immune-related genes in the genome of C. atys that show substantial sequence divergence from macaques or humans. One of these sequence divergences, a C-terminal frameshift in the toll-like receptor-4 (TLR4) gene of C. atys, is associated with a blunted in vitro response to TLR-4 ligands. In addition, we found a major structural change in exons 3-4 of the immune-regulatory protein intercellular adhesion molecule 2 (ICAM-2); expression of this variant leads to reduced cell surface expression of ICAM-2. These data provide a resource for comparative genomic studies of HIV and/or SIV pathogenesis and may help to elucidate the mechanisms by which SIV-infected sooty mangabeys avoid AIDS.
Thanks to a recent spate of sequencing projects, the Hemiptera are the first hemimetabolous insect order to achieve a critical mass of species with sequenced genomes, establishing the basis for comparative genomics of the bugs. However, as the most speciose hemimetabolous order, there is still a vast swathe of the hemipteran phylogeny that awaits genomic representation across subterranean, terrestrial, and aquatic habitats, and with lineage-specific and developmentally plastic cases of both wing polyphenisms and flightlessness. In this review, we highlight opportunities for taxonomic sampling beyond obvious pest species candidates, motivated by intriguing biological features of certain groups as well as the rich research tradition of ecological, physiological, developmental, and particularly cytogenetic investigation that spans the diversity of the Hemiptera.
Complete genome sequence of N2-fixing model strain Klebsiella sp. nov. M5al, which produces plant cell wall-degrading enzymes and siderophores.
The bacterial strain M5al is a model strain for studying the molecular genetics of N2-fixation and molecular engineering of microbial production of platform chemicals 1,3-propanediol and 2,3-butanediol. Here, we present the complete genome sequence of the strain M5al, which belongs to a novel species closely related toKlebsiella michiganensis. M5al secretes plant cell wall-degrading enzymes and colonizes rice roots but does not cause soft rot disease. M5al also produces siderophores and contains the gene clusters for synthesis and transport of yersiniabactin which is a critical virulence factor forKlebsiellapathogens in causing human disease. We propose that the model strain M5al can be genetically modified to study bacterial N2-fixation in association with non-legume plants and production of 1,3-propanediol and 2,3-butanediol through degradation of plant cell wall biomass.
Trypanosoma cruzi, a zoonotic kinetoplastid protozoan with a complex genome, is the causative agent of American trypanosomiasis (Chagas disease). The parasite uses a highly diverse repertoire of surface molecules, with roles in cell invasion, immune evasion and pathogenesis. Thus far, the genomic regions containing these genes have been impossible to resolve and it has been impossible to study the structure and function of the several thousand repetitive genes encoding the surface molecules of the parasite. We here present an improved genome assembly of a T. cruzi clade I (TcI) strain using high coverage PacBio single molecule sequencing, together with Illumina sequencing of 34 T. cruzi TcI isolates and clones from different geographic locations, sample sources and clinical outcomes. Resolution of the surface molecule gene structure reveals an unusual duality in the organisation of the parasite genome, a core genomic region syntenous with related protozoa flanked by unique and highly plastic subtelomeric regions encoding surface antigens. The presence of abundant interspersed retrotransposons in the subtelomeres suggests that these elements are involved in a recombination mechanism for the generation of antigenic variation and evasion of the host immune response. The comparative genomic analysis of the cohort of TcI strains revealed multiple cases of such recombination events involving surface molecule genes and has provided new insights into T. cruzi population structure.
The accumulation of sequenced Francisella strains has made it increasingly apparent that the 16S rRNA gene alone is not enough to stratify the Francisella genus into precise and clinically useful classifications. Continued whole-genome sequencing of isolates will provide a larger base of knowledge for targeted approaches with broad applicability. Additionally, examination of genomic information on a case-by-case basis will help resolve outstanding questions regarding strain stratification. We report the complete genome sequence of a clinical isolate, designated here as F. novicida-like strain TCH2015, acquired from the lymph node of a 6-year-old male. Two features were atypical for F. novicida: exhibition of functional oxidase activity and additional gene content, including proposed virulence determinants. These differences, which could potentially impact virulence and clinical diagnosis, emphasize the need for more comprehensive methods to profile Francisella isolates. This study highlights the value of whole-genome sequencing, which will lead to a more robust database of environmental and clinical genomes and inform strategies to improve detection and classification of Francisella strains. Copyright © 2017 Elsevier Inc. All rights reserved.
Bats harbor many viruses asymptomatically, including several notorious for causing extreme virulence in humans. To identify differences between antiviral mechanisms in humans and bats, we sequenced, assembled, and analyzed the genome of Rousettus aegyptiacus, a natural reservoir of Marburg virus and the only known reservoir for any filovirus. We found an expanded and diversified KLRC/KLRD family of natural killer cell receptors, MHC class I genes, and type I interferons, which dramatically differ from their functional counterparts in other mammals. Such concerted evolution of key components of bat immunity is strongly suggestive of novel modes of antiviral defense. An evaluation of the theoretical function of these genes suggests that an inhibitory immune state may exist in bats. Based on our findings, we hypothesize that tolerance of viral infection, rather than enhanced potency of antiviral defenses, may be a key mechanism by which bats asymptomatically host viruses that are pathogenic in humans. Copyright © 2018 Elsevier Inc. All rights reserved.
Many genomes have been sequenced to high-quality draft status using Sanger capillary electrophoresis and/or newer short-read sequence data and whole genome assembly techniques. However, even the best draft genomes contain gaps and other imperfections due to limitations in the input data and the techniques used to build draft assemblies. Sequencing biases, repetitive genomic features, genomic polymorphism, and other complicating factors all come together to make some regions difficult or impossible to assemble. Traditionally, draft genomes were upgraded to “phase 3 finished” status using time-consuming and expensive Sanger-based manual finishing processes. For more facile assembly and automated finishing of draft genomes, we present here an automated approach to finishing using long-reads from the Pacific Biosciences RS (PacBio) platform. Our algorithm and associated software tool, PBJelly, (publicly available at https://sourceforge.net/projects/pb-jelly/) automates the finishing process using long sequence reads in a reference-guided assembly process. PBJelly also provides “lift-over” co-ordinate tables to easily port existing annotations to the upgraded assembly. Using PBJelly and long PacBio reads, we upgraded the draft genome sequences of a simulated Drosophila melanogaster, the version 2 draft Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the Sooty mangabey. With 24× mapped coverage of PacBio long-reads, we addressed 99% of gaps and were able to close 69% and improve 12% of all gaps in D. pseudoobscura. With 4× mapped coverage of PacBio long-reads we saw reads address 63% of gaps in our budgerigar assembly, of which 32% were closed and 63% improved. With 6.8× mapped coverage of mangabey PacBio long-reads we addressed 97% of gaps and closed 66% of addressed gaps and improved 19%. The accuracy of gap closure was validated by comparison to Sanger sequencing on gaps from the original D. pseudoobscura draft assembly and shown to be dependent on initial reference quality.