Since the advent of Next-Generation Sequencing (NGS), the cost of de novo genome sequencing and assembly have dropped precipitately, which has spurred interest in genome sequencing overall. Unfortunately the contiguity of the NGS assembled sequences, as well as the accuracy of these assemblies have suffered. Additionally, most NGS de novo assemblies leave large portions of genomes unresolved, and repetitive regions are often collapsed. When compared to the reference quality genome sequences produced before the NGS era, the new sequences are highly fragmented and often prove to be difficult to properly annotate. In some cases the contiguous portions are smaller than the average gene size making the sequence not nearly as useful for biologists as the earlier reference quality genomes including of Human, Mouse, C. elegans, or Drosophila. Recently, new 3rd generation sequencing technologies, long-range molecular techniques, and new informatics tools have facilitated a return to high quality assembly. We will discuss the capabilities of the technologies and assess their impact on assembly projects across the tree of life from small microbial and fungal genomes through large plant and animal genomes. Beyond improvements to contiguity, we will focus on the additional biological insights that can be made with better assemblies, including more complete analysis genes in their flanking regulatory context, in-depth studies of transposable elements and other complex gene families, and long-range synteny analysis of entire chromosomes. We will also discuss the need for new algorithms for representing and analyzing collections of many complete genomes at once.
While advances in RNA sequencing methods have accelerated our understanding of the human transcriptome, isoform discovery remains a challenge because short read lengths require complicated assembly algorithms to infer the contiguity of full-length transcripts. With PacBio’s long reads, one can now sequence full-length transcript isoforms up to 10 kb. The PacBio Iso- Seq protocol produces reads that originate from independent observations of single molecules, meaning no assembly is needed. Here, we sequenced the transcriptome of the human MCF-7 breast cancer cell line using the Clontech SMARTer® cDNA preparation kit and the PacBio RS II. Using PacBio Iso-Seq bioinformatics software, we obtained 55,770 unique, full-length, high-quality transcript sequences that were subsequently mapped back to the human genome with = 99% accuracy. In addition, we identified both known and novel fusion transcripts. To assess our results, we compared the predicted ORFs from the PacBio data against a published mass spectrometry dataset from the same cell line. 84% of the proteins identified with the Uniprot protein database were recovered by the PacBio predictions. Notably, 251 peptides solely matched to the PacBio generated ORFs and were entirely novel, including abundant cases of single amino acid polymorphisms, cassette exon splicing and potential alternative protein coding frames.
The PacBio Iso-Seq method produces high-quality, full-length transcripts of up to 10 kb and longer and has been used to annotate many important plant and animal genomes. We describe here the full Iso-Seq ecosystem that enables researchers to achieve high-quality genome annotations. The Iso-Seq Express workflow is a 1-day protocol that requires only 60-300 ng of total RNA and supports multiplexing of different tissues. Sequencing on a single SMRT Cell 8M on the Sequel II System produces up to 4 million full-length reads, sufficient to exhaustively characterize a whole transcriptome on the order of 15,000-17,000 genes with 100,000 or more transcripts. Most importantly, the method is supported by a maturing suite of official and community-developed tools. The SMRT Link Iso-Seq application outputs high-quality (>99% accurate), full-length transcript sequences that can optionally be mapped to a reference genome for a single SMRT Cell worth of data in 6-9 hours. For example, the SQANTI2 tool classifies Iso-Seq transcripts against a reference annotation, filters potential library artifacts, and processes information from both long read-only and short read-based quantification. IsoPhase is a tool for identifying allele-specific isoform expression. Cogent has been used to process Iso-Seq transcripts in a genome-independent manner to assess genome assemblies. Finally, IsoAnnot is an up-and-coming tool for identifying differential isoform expression across different samples. We describe how these tools complement each other and provide guidelines to make the best use out of Iso-Seq data for understanding transcriptomes.
In this presentation, Elizabeth Tseng explains how PacBio’s full-length RNA Sequencing using the Iso-Seq method can characterize full-length transcripts without the need for computational transcript assembly. The Iso-Seq method is…
Haplotype phasing of genetic variants is important for interpretation of the maize genome, population genetic analysis, and functional genomic analysis of allelic activity. Accordingly, accurate methods for phasing full-length isoforms are essential for functional genomics study. In this study, we performed an isoform-level phasing study in maize, using two inbred lines and their reciprocal crosses, based on single-molecule full-length cDNA sequencing. To phase and analyze full-length transcripts between hybrids and parents, we developed a tool called IsoPhase. Using this tool, we validated the majority of SNPs called against matching short read data and identified cases of allele-specific, gene-level, and isoform-level expression. Our results revealed that maize parental and hybrid lines exhibit different splicing activities. After phasing 6,847 genes in two reciprocal hybrids using embryo, endosperm and root tissues, we annotated the SNPs and identified large-effect genes. In addition, based on single-molecule sequencing, we identified parent-of-origin isoforms in maize hybrids, different novel isoforms between maize parent and hybrid lines, and imprinted genes from different tissues. Finally, we characterized variation in cis- and trans-regulatory effects. Our study provides measures of haplotypic expression that could increase power and accuracy in studies of allelic expression.
Detection of transferable oxazolidinone resistance determinants in Enterococcus faecalis and Enterococcus faecium of swine origin in Sichuan Province, China.
The aim of this study was to detect the transferable oxazolidinone resistance determinants (cfr, optrA and poxtA) in E. faecalis and E. faecium of swine origin in Sichuan Province, China.A total of 158 enterococci strains (93 E. faecalis and 65 E. faecium) isolated from 25 large-scale swine farms were screened for the presence of cfr, optrA and poxtA by PCR. The genetic environments of cfr, optrA and poxtA were characterized by whole genome sequencing. Transfer of oxazolidinone resistance determinants was determined by conjugation or electrotransformation experiments.The transferable oxazolidinone resistance determinants, cfr, optrA and poxtA, were detected in zero, six, and one enterococci strains, respectively. The poxtA in one E. faecalis strain was located on a 37,990 bp plasmid, which co-harbored fexB, cat, tet(L) and tet(M), and could be conjugated to E. faecalis JH2-2. One E. faecalis strain harbored two different OptrA variants, including one variant with a single substitution, Q219H, which has not been reported previously. Two optrA-carrying plasmids, pC25-1, with a size of 45,581 bp, and pC54, with a size of 64,500 bp, shared a 40,494 bp identical region that contained genetic context IS1216E-fexA-optrA-erm(A)-IS1216E, which could be electrotransformed into Staphylococcus aureus. Four different chromosomal optrA gene clusters were found in five strains, in which optrA was associated with Tn554 or Tn558 that were inserted into the radC gene.Our study highlights the fact that mobile genetic elements, such as plasmids, IS1216E, Tn554 and Tn558, may facilitate the horizontal transmission of optrA or poxtA.Copyright © 2019. Published by Elsevier Ltd.
Acinetobacter baumannii is an important Gram-negative pathogen in hospital-related infections. However, treatment options for A. baumannii infections have become limited due to multidrug resistance. Bacterial virulence is often associated with capsule genes found in the K locus, many of which are essential for biosynthesis of the bacterial envelope. However, the roles of other genes in the K locus remain largely unknown. From an in vitro evolution experiment, we obtained an isolate of the virulent and multidrug-resistant A. baumannii strain MDR-ZJ06, called MDR-ZJ06M, which has an insertion by the ISAba16 transposon in gnaA (encoding UDP-N-acetylglucosamine C-6 dehydrogenase), a gene found in the K locus. The isolate showed an increased resistance toward tigecycline, whereas the MIC decreased in the case of carbapenems, cephalosporins, colistin, and minocycline. By using knockout and complementation experiments, we demonstrated that gnaA is important for the synthesis of lipooligosaccharide and capsular polysaccharide and that disruption of the gene affects the morphology, drug susceptibility, and virulence of the pathogen.Copyright © 2019 American Society for Microbiology.
White spot syndrome virus (WSSV) is a crustacean-infecting, double-stranded DNA virus and is the most serious viral pathogen in the global shrimp industry. WSSV is the sole recognized member of the family Nimaviridae, and the lack of genomic data on other nimaviruses has obscured the evolutionary history of WSSV. Here, we investigated the evolutionary history of WSSV by characterizing WSSV relatives hidden in host genomic data. We surveyed 14 host crustacean genomes and identified five novel nimaviral genomes. Comparative genomic analysis of Nimaviridae identified 28 “core genes” that are ubiquitously conserved in Nimaviridae; unexpected conservation of 13 uncharacterized proteins highlighted yet-unknown essential functions underlying the nimavirus replication cycle. The ancestral Nimaviridae gene set contained five baculoviral per os infectivity factor homologs and a sulfhydryl oxidase homolog, suggesting a shared phylogenetic origin of Nimaviridae and insect-associated double-stranded DNA viruses. Moreover, we show that novel gene acquisition and subsequent amplification reinforced the unique accessory gene repertoire of WSSV. Expansion of unique envelope protein and nonstructural virulence-associated genes may have been the key genomic event that made WSSV such a deadly pathogen.IMPORTANCE WSSV is the deadliest viral pathogen threatening global shrimp aquaculture. The evolutionary history of WSSV has remained a mystery, because few WSSV relatives, or nimaviruses, had been reported. Our aim was to trace the history of WSSV using the genomes of novel nimaviruses hidden in host genome data. We demonstrate that WSSV emerged from a diverse family of crustacean-infecting large DNA viruses. By comparing the genomes of WSSV and its relatives, we show that WSSV possesses an expanded set of unique host-virus interaction-related genes. This extensive gene gain may have been the key genomic event that made WSSV such a deadly pathogen. Moreover, conservation of insect-infecting virus protein homologs suggests a common phylogenetic origin of crustacean-infecting Nimaviridae and other insect-infecting DNA viruses. Our work redefines the previously poorly characterized crustacean virus family and reveals the ancient genomic events that preordained the emergence of a devastating shrimp pathogen.Copyright © 2019 American Society for Microbiology.
Complete Sequence of a Novel Multidrug-Resistant Pseudomonas putida Strain Carrying Two Copies of qnrVC6.
This study aimed at identification and characterization of a novel multidrug-resistant Pseudomonas putida strain Guangzhou-Ppu420 carrying two copies of qnrVC6 isolated from a hospital in Guangzhou, China, in 2012. Antimicrobial susceptibility was tested by Vitek2™ Automated Susceptibility System and Etest™ strips, and whole-genome sequencing facilitated analysis of its multidrug resistance. The genome has a length of 6,031,212?bp and an average G?+?C content of 62.01%. A total of 5,421 open reading frames were identified, including eight 5S rRNA, seven 16S rRNA, and seven 23S rRNA, and 76 tRNA genes. Importantly, two copies of qnrVC6 gene with three ISCR1 around, a blaVIM-2 carrying integron In528, a novel gcu173 carrying integron In1348, and six antibiotic resistance genes were identified. This is the first identification of two copies of the qnrVC6 gene in a single P. putida isolate and a class 1 integron In1348.
Mitochondrial DNA and their nuclear copies in the parasitic wasp Pteromalus puparum: A comparative analysis in Chalcidoidea.
Chalcidoidea (chalcidoid wasps) are an abundant and megadiverse insect group with both ecological and economical importance. Here we report a complete mitochondrial genome in Chalcidoidea from Pteromalus puparum (Pteromalidae). Eight tandem repeats followed by 6 reversed repeats were detected in its 3308?bp control region. This long and complex control region may explain failures of amplifying and sequencing of complete mitochondrial genomes in some chalcidoids. In addition to 37 typical mitochondrial genes, an extra identical isoleucine tRNA (trnI) was detected at the opposite end of the control region. This recent mitochondrial gene duplication indicates that gene arrangements in chalcidoids are ongoing. A comparison among available chalcidoid mitochondrial genomes reveals rapid gene order rearrangements overall and high protein substitution rates in most chalcidoid taxa. In addition, we identified 24 nuclear sequences of mitochondrial origin (NUMTs) in P. puparum, summing up to 9989?bp, with 3617?bp of these NUMTs originating from mitochondrial coding regions. NUMTs abundance in P. puparum is only one-twelfth of that in its relative, Nasonia vitripennis. Based on phylogenetic analysis, we provide evidence that a faster nuclear degradation rate contributes to the reduced NUMT numbers in P. puparum. Overall, our study shows unusually high rates of mitochondrial evolution and considerable variation in NUMT accumulation in Chalcidoidea. Copyright © 2018. Published by Elsevier B.V.
Genomic variation and strain-specific functional adaptation in the human gut microbiome during early life.
The human gut microbiome matures towards the adult composition during the first years of life and is implicated in early immune development. Here, we investigate the effects of microbial genomic diversity on gut microbiome development using integrated early childhood data sets collected in the DIABIMMUNE study in Finland, Estonia and Russian Karelia. We show that gut microbial diversity is associated with household location and linear growth of children. Single nucleotide polymorphism- and metagenomic assembly-based strain tracking revealed large and highly dynamic microbial pangenomes, especially in the genus Bacteroides, in which we identified evidence of variability deriving from Bacteroides-targeting bacteriophages. Our analyses revealed functional consequences of strain diversity; only 10% of Finnish infants harboured Bifidobacterium longum subsp. infantis, a subspecies specialized in human milk metabolism, whereas Russian infants commonly maintained a probiotic Bifidobacterium bifidum strain in infancy. Groups of bacteria contributing to diverse, characterized metabolic pathways converged to highly subject-specific configurations over the first two years of life. This longitudinal study extends the current view of early gut microbial community assembly based on strain-level genomic variation.
Our understanding of sequence variation in the HLA-DPB1 gene is largely restricted to the hypervariable antigen recognition domain (ARD) encoded by exon 2. Here, we employed a redundant sequencing strategy combining long-read and short-read data to accurately phase and characterise in full length the majority of common and well-documented (CWD) DPB1 alleles as well as alleles with an observed frequency of at least 0.0006% in our predominantly European sample set. We generated 664 DPB1 sequences, comprising 279 distinct allelic variants. This allows us to present the, to date, most comprehensive analysis of the nature and extent of DPB1 sequence variation. The full-length sequence analysis revealed the existence of two highly diverged allele clades. These clades correlate with the rs9277534 A???G variant, a known expression marker located in the 3′-UTR. The two clades are fully differentiated by 174 fixed polymorphisms throughout a 3.6?kb stretch at the 3′-end of DPB1. The region upstream of this differentiation zone is characterised by increasingly shared variation between the clades. The low-expression A clade comprises 59% of the distinct allelic sequences including the three by far most frequent DPB1 alleles, DPB1*04:01, DPB1*02:01 and DPB1*04:02. Alleles in the A clade show reduced nucleotide diversity with an excess of rare variants when compared to the high-expression G clade. This pattern is consistent with a scenario of recent proliferation of A-clade alleles. The full-length characterisation of all but the most rare DPB1 alleles will benefit the application of NGS for DPB1 genotyping and provides a helpful framework for a deeper understanding of high- and low-expression alleles and their implications in the context of unrelated haematopoietic stem-cell transplantation.Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.
Physiological properties and genetic analysis related to exopolysaccharide (EPS) production in the fresh-water unicellular cyanobacterium Aphanothece sacrum (Suizenji Nori).
The clonal strains, phycoerythrin(PE)-rich- and PE-poor strains, of the unicellular, fresh water cyanobacterium Aphanothece sacrum (Suringar) Okada (Suizenji Nori, in Japanese) were isolated from traditional open-air aquafarms in Japan. A. sacrum appeared to be oligotrophic on the basis of its growth characteristics. The optimum temperature for growth was around 20°C. Maximum growth and biomass increase at 20°C was obtained under light intensities between 40 to 80 µmol m-2 s-1 (fluorescent lamps, 12 h light/12 h dark cycles) and between 40 to 120 µmol m-2 s-1 for PE-rich and PE-poor strains, respectively, of A. sacrum . Purified exopolysaccharide (EPS) of A. sacrum has a molecular weight of ca. 104 kDa with five major monosaccharides (glucose, xylose, rhamnose, galactose and mannose; =85 mol%). We also deciphered the whole genome sequence of the two strains of A. sacrum. The putative genes involved in the polymerization, chain length control, and export of EPS would contribute to understand the biosynthetic process of their extremely high molecular weight EPS. The putative genes encoding Wzx-Wzy-Wzz- and Wza-Wzb-Wzc were conserved in the A. sacrum strains FPU1 and FPU3. This result suggests that the Wzy-dependent pathway participates in the EPS production of A. sacrum.
Trimethoprim/sulfamethoxazole is a synthetic antibiotic combination recommended for the treatment of complicated non-typhoidal Salmonella infections in humans. Resistance to trimethoprim/sulfamethoxazole is mediated by the acquisition of mobile genes, requiring both a dfr gene (trimethoprim resistance) and a sul gene (sulfamethoxazole resistance) for a clinical resistance phenotype (MIC =4/76?mg/L). In 2017, the CDC investigated a multistate outbreak caused by a Salmonella enterica serotype Heidelberg strain with trimethoprim/sulfamethoxazole resistance, in which sul genes but no known dfr genes were detected.To characterize and describe the molecular mechanism of trimethoprim resistance in a Salmonella Heidelberg outbreak isolate.Illumina sequencing data for one outbreak isolate revealed a 588?bp ORF encoding a putative dfr gene. This gene was cloned into Escherichia coli and resistance to trimethoprim was measured by broth dilution and Etest. Phylogenetic analysis of previously reported dfrA genes was performed using MEGA. Long-read sequencing was conducted to determine the context of the novel dfr gene.The novel dfr gene, named dfrA34, conferred trimethoprim resistance (MIC =32?mg/L) when cloned into E. coli. Based on predicted amino acid sequences, dfrA34 shares less than 50% identity with other known dfrA genes. The dfrA34 gene is located in a class 1 integron in a multiresistance region of an IncC plasmid, adjacent to a sul gene, thus conferring clinical trimethoprim/sulfamethoxazole resistance. Additionally, dfrA34 is associated with ISCR1, enabling easy transmission between other plasmids and bacterial strains.
Genome assembly and gene expression in the American black bear provides new insights into the renal response to hibernation.
The prevalence of chronic kidney disease (CKD) is rising worldwide and 10-15% of the global population currently suffers from CKD and its complications. Given the increasing prevalence of CKD there is an urgent need to find novel treatment options. The American black bear (Ursus americanus) copes with months of lowered kidney function and metabolism during hibernation without the devastating effects on metabolism and other consequences observed in humans. In a biomimetic approach to better understand kidney adaptations and physiology in hibernating black bears, we established a high-quality genome assembly. Subsequent RNA-Seq analysis of kidneys comparing gene expression profiles in black bears entering (late fall) and emerging (early spring) from hibernation identified 169 protein-coding genes that were differentially expressed. Of these, 101 genes were downregulated and 68 genes were upregulated after hibernation. Fold changes ranged from 1.8-fold downregulation (RTN4RL2) to 2.4-fold upregulation (CISH). Most notable was the upregulation of cytokine suppression genes (SOCS2, CISH, and SERPINC1) and the lack of increased expression of cytokines and genes involved in inflammation. The identification of these differences in gene expression in the black bear kidney may provide new insights in the prevention and treatment of CKD. © The Author(s) 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.