The Ensembl project has been aggregating, processing, integrating and redistributing genomic datasets since the initial releases of the draft human genome, with the aim of accelerating genomics research through rapid open distribution of public data. Large amounts of raw data are thus transformed into knowledge, which is made available via a multitude of channels, in particular our browser (http://www.ensembl.org). Over time, we have expanded in multiple directions. First, our resources describe multiple fields of genomics, in particular gene annotation, comparative genomics, genetics and epigenomics. Second, we cover a growing number of genome assemblies; Ensembl Release 90 contains exactly 100. Third, our databases feed simultaneously into an array of services designed around different use cases, ranging from quick browsing to genome-wide bioinformatic analysis. We present here the latest developments of the Ensembl project, with a focus on managing an increasing number of assemblies, supporting efforts in genome interpretation and improving our browser.
Many genomes have been sequenced to high-quality draft status using Sanger capillary electrophoresis and/or newer short-read sequence data and whole genome assembly techniques. However, even the best draft genomes contain gaps and other imperfections due to limitations in the input data and the techniques used to build draft assemblies. Sequencing biases, repetitive genomic features, genomic polymorphism, and other complicating factors all come together to make some regions difficult or impossible to assemble. Traditionally, draft genomes were upgraded to “phase 3 finished” status using time-consuming and expensive Sanger-based manual finishing processes. For more facile assembly and automated finishing of draft genomes, we present here an automated approach to finishing using long-reads from the Pacific Biosciences RS (PacBio) platform. Our algorithm and associated software tool, PBJelly, (publicly available at https://sourceforge.net/projects/pb-jelly/) automates the finishing process using long sequence reads in a reference-guided assembly process. PBJelly also provides “lift-over” co-ordinate tables to easily port existing annotations to the upgraded assembly. Using PBJelly and long PacBio reads, we upgraded the draft genome sequences of a simulated Drosophila melanogaster, the version 2 draft Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the Sooty mangabey. With 24× mapped coverage of PacBio long-reads, we addressed 99% of gaps and were able to close 69% and improve 12% of all gaps in D. pseudoobscura. With 4× mapped coverage of PacBio long-reads we saw reads address 63% of gaps in our budgerigar assembly, of which 32% were closed and 63% improved. With 6.8× mapped coverage of mangabey PacBio long-reads we addressed 97% of gaps and closed 66% of addressed gaps and improved 19%. The accuracy of gap closure was validated by comparison to Sanger sequencing on gaps from the original D. pseudoobscura draft assembly and shown to be dependent on initial reference quality.
An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome.
Second generation sequencing has permitted detailed sequence characterisation at the whole genome level of a growing number of non-model organisms, but the data produced have short read-lengths and biased genome coverage leading to fragmented genome assemblies. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality containing fewer gaps and longer contigs. However, these advantages come at a much greater cost per nucleotide and with a perceived increase in error-rate. In this investigation, we evaluated the performance of the PacBio RS sequencing platform through the sequencing and de novo assembly of the Potentilla micrantha chloroplast genome.Following error-correction, a total of 28,638 PacBio RS reads were recovered with a mean read length of 1,902 bp totalling 54,492,250 nucleotides and representing an average depth of coverage of 320× the chloroplast genome. The dataset covered the entire 154,959 bp of the chloroplast genome in a single contig (100% coverage) compared to seven contigs (90.59% coverage) recovered from an Illumina data, and revealed no bias in coverage of GC rich regions. Post-assembly the data were largely concordant with the Illumina data generated and allowed 187 ambiguities in the Illumina data to be resolved. The additional read length also permitted small differences in the two inverted repeat regions to be assigned unambiguously.This is the first report to our knowledge of a chloroplast genome assembled de novo using PacBio sequence data. The PacBio RS data generated here were assembled into a single large contig spanning the P. micrantha chloroplast genome, with a higher degree of accuracy than an Illumina dataset generated at a much greater depth of coverage, due to longer read lengths and lower GC bias in the data. The results we present suggest PacBio data will be of immense utility for the development of genome sequence assemblies containing fewer unresolved gaps and ambiguities and a significantly smaller number of contigs than could be produced using short-read sequence data alone.
REBASE is a comprehensive and fully curated database of information about the components of restriction-modification (RM) systems. It contains fully referenced information about recognition and cleavage sites for both restriction enzymes and methyltransferases as well as commercial availability, methylation sensitivity, crystal and sequence data. All genomes that are completely sequenced are analyzed for RM system components, and with the advent of PacBio sequencing, the recognition sequences of DNA methyltransferases (MTases) are appearing rapidly. Thus, Type I and Type III systems can now be characterized in terms of recognition specificity merely by DNA sequencing. The contents of REBASE may be browsed from the web http://rebase.neb.com and selected compilations can be downloaded by FTP (ftp.neb.com). Monthly updates are also available via email. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from Xanthomonas genomic sequences.
Transcription activator-like effectors (TALEs) are virulence factors, produced by the bacterial plant-pathogen Xanthomonas, that function as gene activators inside plant cells. Although the contribution of individual TALEs to infectivity has been shown, the specific roles of most TALEs, and the overall TALE diversity in Xanthomonas spp. is not known. TALEs possess a highly repetitive DNA-binding domain, which is notoriously difficult to sequence. Here, we describe an improved method for characterizing TALE genes by the use of PacBio sequencing. We present ‘AnnoTALE’, a suite of applications for the analysis and annotation of TALE genes from Xanthomonas genomes, and for grouping similar TALEs into classes. Based on these classes, we propose a unified nomenclature for Xanthomonas TALEs that reveals similarities pointing to related functionalities. This new classification enables us to compare related TALEs and to identify base substitutions responsible for the evolution of TALE specificities.
The DNA Data Bank of Japan Center (DDBJ Center; http://www.ddbj.nig.ac.jp) maintains and provides public archival, retrieval and analytical services for biological information. Since October 2013, DDBJ Center has operated the Japanese Genotype-phenotype Archive (JGA) in collaboration with our partner institute, the National Bioscience Database Center (NBDC) of the Japan Science and Technology Agency. DDBJ Center provides the JGA database system which securely stores genotype and phenotype data collected from individuals whose consent agreements authorize data release only for specific research use. NBDC has established guidelines and policies for sharing human-derived data and reviews data submission and usage requests from researchers. In addition to the JGA project, DDBJ Center develops Semantic Web technologies for data integration and sharing in collaboration with the Database Center for Life Science. This paper describes the overview of the JGA project, updates to the DDBJ databases, and services for data retrieval, analysis and integration. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
GenBank(®) (www.ncbi.nlm.nih.gov/genbank/) is a comprehensive database that contains publicly available nucleotide sequences for 370 000 formally described species. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or the NCBI Submission Portal. GenBank staff assign accession numbers upon data receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Nucleotide database, which links to related information such as taxonomy, genomes, protein sequences and structures, and biomedical journal literature in PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. Recent updates include changes to policies regarding sequence identifiers, an improved 16S submission wizard, targeted loci studies, the ability to submit methylation and BioNano mapping files, and a database of anti-microbial resistance genes. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.
The Vigna Genome Server, ‘VigGS’: A genomic knowledge base of the genus Vigna based on high-quality, annotated genome sequence of the azuki bean, Vigna angularis (Willd.) Ohwi & Ohashi.
The genus Vigna includes legume crops such as cowpea, mungbean and azuki bean, as well as >100 wild species. A number of the wild species are highly tolerant to severe environmental conditions including high-salinity, acid or alkaline soil; drought; flooding; and pests and diseases. These features of the genus Vigna make it a good target for investigation of genetic diversity in adaptation to stressful environments; however, a lack of genomic information has hindered such research in this genus. Here, we present a genome database of the genus Vigna, Vigna Genome Server (‘VigGS’, http://viggs.dna.affrc.go.jp), based on the recently sequenced azuki bean genome, which incorporates annotated exon-intron structures, along with evidence for transcripts and proteins, visualized in GBrowse. VigGS also facilitates user construction of multiple alignments between azuki bean genes and those of six related dicot species. In addition, the database displays sequence polymorphisms between azuki bean and its wild relatives and enables users to design primer sequences targeting any variant site. VigGS offers a simple keyword search in addition to sequence similarity searches using BLAST and BLAT. To incorporate up to date genomic information, VigGS automatically receives newly deposited mRNA sequences of pre-set species from the public database once a week. Users can refer to not only gene structures mapped on the azuki bean genome on GBrowse but also relevant literature of the genes. VigGS will contribute to genomic research into plant biotic and abiotic stresses and to the future development of new stress-tolerant crops.© The Author 2015. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists. All rights reserved. For permissions, please email: email@example.com.
MethSMRT: an integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing
DNA methylation is an important type of epigenetic modifications, where 5- methylcytosine (5mC), 6-methyadenine (6mA) and 4-methylcytosine (4mC) are the most common types. Previous efforts have been largely focused on 5mC, providing invaluable insights into epigenetic regulation through DNA methylation. Recently developed single-molecule real-time (SMRT) sequencing technology provides a unique opportunity to detect the less studied DNA 6mA and 4mC modifications at single-nucleotide resolution. With a rapidly increased amount of SMRT sequencing data generated, there is an emerging demand to systematically explore DNA 6mA and 4mC modifications from these data sets. MethSMRT is the first resource hosting DNA 6mA and 4mC methylomes. All the data sets were processed using the same analysis pipeline with the same quality control. The current version of the database provides a platform to store, browse, search and download epigenome-wide methylation profiles of 156 species, including seven eukaryotes such as Arabidopsis, C. elegans, Drosophila, mouse and yeast, as well as 149 prokaryotes. It also offers a genome browser to visualize the methylation sites and related information such as single nucleotide polymorphisms (SNP) and genomic annotation. Furthermore, the database provides a quick summary of statistics of methylome of 6mA and 4mC and predicted methylation motifs for each species. MethSMRT is publicly available at http://sysbio.sysu.edu.cn/methsmrt/ without use restriction.