Menu
July 7, 2019  |  

Extensive sequencing of seven human genomes to characterize benchmark reference materials.

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCode WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.


July 7, 2019  |  

Atypical Salmonella enterica serovars in murine and human infection models: Is it time to reassess our approach to the study of salmonellosis?

Nontyphoidal Salmonella species are globally disseminated pathogens and the predominant cause of gastroenteritis. The pathogenesis of salmonellosis has been extensively studied using in vivo murine models and cell lines typically challenged with Salmonella Typhimurium. Although serovars Enteritidis and Typhimurium are responsible for the most of human infections reported to the CDC, several other serovars also contribute to clinical cases of salmonellosis. Despite their epidemiological importance, little is known about their infection phenotypes. Here, we report the virulence characteristics and genomes of 10 atypical S. enterica serovars linked to multistate foodborne outbreaks in the United States. We show that the murine RAW 264.7 macrophage model of infection is unsuitable for inferring human relevant differences in nontyphoidal Salmonella infections whereas differentiated human THP-1 macrophages allowed these isolates to be further characterised in a more relevant, human context.


July 7, 2019  |  

A roadmap for gene system development in Clostridium.

Clostridium species are both heroes and villains. Some cause serious human and animal diseases, those present in the gut microbiota generally contribute to health and wellbeing, while others represent useful industrial chassis for the production of chemicals and fuels. To understand, counter or exploit, there is a fundamental requirement for effective systems that may be used for directed or random genome modifications. We have formulated a simple roadmap whereby the necessary gene systems maybe developed and deployed. At its heart is the use of ‘pseudo-suicide’ vectors and the creation of a pyrE mutant (a uracil auxotroph), initially aided by ClosTron technology, but ultimately made using a special form of allelic exchange termed ACE (Allele-Coupled Exchange). All mutants, regardless of the mutagen employed, are made in this host. This is because through the use of ACE vectors, mutants can be rapidly complemented concomitant with correction of the pyrE allele and restoration of uracil prototrophy. This avoids the phenotypic effects frequently observed with high copy number plasmids and dispenses with the need to add antibiotic to ensure plasmid retention. Once available, the pyrE host may be used to stably insert all manner of application specific modules. Examples include, a sigma factor to allow deployment of a mariner transposon, hydrolases involved in biomass deconstruction and therapeutic genes in cancer delivery vehicles. To date, provided DNA transfer is obtained, we have not encountered any clostridial species where this technology cannot be applied. These include, Clostridium difficile, Clostridium acetobutylicum, Clostridium beijerinckii, Clostridium botulinum, Clostridium perfringens, Clostridium sporogenes, Clostridium pasteurianum, Clostridium ljungdahlii, Clostridium autoethanogenum and even Geobacillus thermoglucosidasius. Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.


July 7, 2019  |  

Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads.

Motivation. The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its current error rates are estimated in the range of 15-40%, significantly higher than those of the prevalent next generation sequencing (NGS) technologies (less than 1%). Fundamental bioinformatics tasks such as de novo genome assembly and variant calling require high-quality sequences that need to be extracted from these long but erroneous 3GS sequences. Results. We describe a versatile and efficient linear complexity consensus algorithm Sparc to facilitate de novo genome assembly. Sparc builds a sparse k-mer graph using a collection of sequences from a targeted genomic region. The heaviest path which approximates the most likely genome sequence is searched through a sparsity-induced reweighted graph as the consensus sequence. Sparc supports using NGS and 3GS data together, which leads to significant improvements in both cost efficiency and computational efficiency. Experiments with Sparc show that our algorithm can efficiently provide high-quality consensus sequences using both PacBio and Oxford Nanopore sequencing technologies. With only 30× PacBio data, Sparc can reach a consensus with error rate <0.5%. With the more challenging Oxford Nanopore data, Sparc can also achieve similar error rate when combined with NGS data. Compared with the existing approaches, Sparc calculates the consensus with higher accuracy, and uses approximately 80% less memory and time. Availability. The source code is available for download at https://github.com/yechengxi/Sparc.


July 7, 2019  |  

Genomics-informed isolation and characterization of a symbiotic Nanoarchaeota system from a terrestrial geothermal environment.

Biological features can be inferred, based on genomic data, for many microbial lineages that remain uncultured. However, cultivation is important for characterizing an organism’s physiology and testing its genome-encoded potential. Here we use single-cell genomics to infer cultivation conditions for the isolation of an ectosymbiotic Nanoarchaeota (‘Nanopusillus acidilobi’) and its host (Acidilobus, a crenarchaeote) from a terrestrial geothermal environment. The cells of ‘Nanopusillus’ are among the smallest known cellular organisms (100-300?nm). They appear to have a complete genetic information processing machinery, but lack almost all primary biosynthetic functions as well as respiration and ATP synthesis. Genomic and proteomic comparison with its distant relative, the marine Nanoarchaeum equitans illustrate an ancient, common evolutionary history of adaptation of the Nanoarchaeota to ectosymbiosis, so far unique among the Archaea.


July 7, 2019  |  

Complete genome of Nitrosospira briensis C-128, an ammonia-oxidizing bacterium from agricultural soil.

Nitrosospira briensis C-128 is an ammonia-oxidizing bacterium isolated from an acid agricultural soil. N. briensis C-128 was sequenced with PacBio RS technologies at the DOE-Joint Genome Institute through their Community Science Program (2010). The high-quality finished genome contains one chromosome of 3.21 Mb and no plasmids. We identified 3073 gene models, 3018 of which are protein coding. The two-way average nucleotide identity between the chromosomes of Nitrosospira multiformis ATCC 25196 and Nitrosospira briensis C-128 was found to be 77.2 %. Multiple copies of modules encoding chemolithotrophic metabolism were identified in their genomic context. The gene inventory supports chemolithotrophic metabolism with implications for function in soil environments.


July 7, 2019  |  

Challenges, solutions, and quality metrics of personal genome assembly in advancing precision medicine.

Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging “third generation sequencing” technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring drug therapy and detecting tumors. We believe the precision medicine would largely benefit from bioinformatics solutions, particularly for personal genome assembly.


July 7, 2019  |  

Genomic characterization of the Atlantic cod sex-locus.

A variety of sex determination mechanisms can be observed in evolutionary divergent teleosts. Sex determination is genetic in Atlantic cod (Gadus morhua), however the genomic location or size of its sex-locus is unknown. Here, we characterize the sex-locus of Atlantic cod using whole genome sequence (WGS) data of 227 wild-caught specimens. Analyzing more than 55 million polymorphic loci, we identify 166 loci that are associated with sex. These loci are located in six distinct regions on five different linkage groups (LG) in the genome. The largest of these regions, an approximately 55?Kb region on LG11, contains the majority of genotypes that segregate closely according to a XX-XY system. Genotypes in this region can be used genetically determine sex, whereas those in the other regions are inconsistently sex-linked. The identified region on LG11 and its surrounding genes have no clear sequence homology with genes or regulatory elements associated with sex-determination or differentiation in other species. The functionality of this sex-locus therefore remains unknown. The WGS strategy used here proved adequate for detecting the small regions associated with sex in this species. Our results highlight the evolutionary flexibility in genomic architecture underlying teleost sex-determination and allow practical applications to genetically sex Atlantic cod.


July 7, 2019  |  

DBG2OLC: Efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies.

The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three general and basic design principles: (i) Compact representation of the long reads leads to efficient alignments. (ii) Base-level errors can be skipped; structural errors need to be detected and corrected. (iii) Structurally correct 3GS reads are assembled and polished. In our implementation, preassembled NGS contigs are used to derive the compact representation of the long reads, motivating an algorithmic conversion from a de Bruijn graph to an overlap graph, the two major assembly paradigms. Moreover, since NGS and 3GS data can compensate for each other, our hybrid assembly approach reduces both of their sequencing requirements. Experiments show that our software is able to assemble mammalian-sized genomes orders of magnitude more quickly than existing methods without consuming a lot of memory, while saving about half of the sequencing cost.


July 7, 2019  |  

Improved hybrid de novo genome assembly of domesticated apple (Malus x domestica).

Domesticated apple (Malus?×?domestica Borkh) is a popular temperate fruit with high nutrient levels and diverse flavors. In 2012, global apple production accounted for at least one tenth of all harvested fruits. A high-quality apple genome assembly is crucial for the selection and breeding of new cultivars. Currently, a single reference genome is available for apple, assembled from 16.9?×?genome coverage short reads via Sanger and 454 sequencing technologies. Although a useful resource, this assembly covers only ~89 % of the non-repetitive portion of the genome, and has a relatively short (16.7 kb) contig N50 length. These downsides make it difficult to apply this reference in transcriptive or whole-genome re-sequencing analyses.Here we present an improved hybrid de novo genomic assembly of apple (Golden Delicious), which was obtained from 76 Gb (~102?×?genome coverage) Illumina HiSeq data and 21.7 Gb (~29?×?genome coverage) PacBio data. The final draft genome is approximately 632.4 Mb, representing?~?90 % of the estimated genome. The contig N50 size is 111,619 bp, representing a 7 fold improvement. Further annotation analyses predicted 53,922 protein-coding genes and 2,765 non-coding RNA genes.The new apple genome assembly will serve as a valuable resource for investigating complex apple traits at the genomic level. It is not only suitable for genome editing and gene cloning, but also for RNA-seq and whole-genome re-sequencing studies.


July 7, 2019  |  

Draft genome sequences of Armillaria fuscipes, Ceratocystiopsis minuta, Ceratocystis adiposa, Endoconidiophora laricicola, E. polonica and Penicillium freii DAOMC 242723.

The genomes of Armillaria fuscipes, Ceratocystiopsis minuta, Ceratocystis adiposa, Endoconidiophora laricicola, E. polonica, and Penicillium freii DAOMC 242723 are presented in this genome announcement. These six genomes are from plant pathogens and otherwise economically important fungal species. The genome sizes range from 21 Mb in the case of Ceratocystiopsis minuta to 58 Mb for the basidiomycete Armillaria fuscipes. These genomes include the first reports of genomes for the genus Endoconidiophora. The availability of these genome data will provide opportunities to resolve longstanding questions regarding the taxonomy of species in these genera. In addition these genome sequences through comparative studies with closely related organisms will increase our understanding of how these pathogens cause disease.


July 7, 2019  |  

Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree.

Two types of approaches are mainly considered for the repeat number estimation in short tandem repeat (STR) regions from high-throughput sequencing data: approaches directly counting repeat patterns included in sequence reads spanning the region and approaches based on detecting the difference between the insert size inferred from aligned paired-end reads and the actual insert size. Although the accuracy of repeat numbers estimated with the former approaches is high, the size of target STR regions is limited to the length of sequence reads. On the other hand, the latter approaches can handle STR regions longer than the length of sequence reads. However, repeat numbers estimated with the latter approaches is less accurate than those with the former approaches.We proposed a new statistical model named coalescentSTR that estimates repeat numbers from paired-end read distances for multiple individuals simultaneously by connecting the read generative model for each individual with their genealogy. In the model, the genealogy is represented by handling coalescent trees as hidden variables, and the summation of the hidden variables is taken on coalescent trees sampled based on phased genotypes located around a target STR region with Markov chain Monte Carlo. In the sampled coalescent trees, repeat number information from insert size data is propagated, and more accurate estimation of repeat numbers is expected for STR regions longer than the length of sequence reads. For finding the repeat numbers maximizing the likelihood of the model on the estimation of repeat numbers, we proposed a state-of-the-art belief propagation algorithm on sampled coalescent trees.We verified the effectiveness of the proposed approach from the comparison with existing methods by using simulation datasets and real whole genome and whole exome data for HapMap individuals analyzed in the 1000 Genomes Project.


July 7, 2019  |  

Genomic and transcriptomic analyses of the tangerine pathotype of Alternaria alternata in response to oxidative stress.

The tangerine pathotype of Alternaria alternata produces the A. citri toxin (ACT) and is the causal agent of citrus brown spot that results in significant yield losses worldwide. Both the production of ACT and the ability to detoxify reactive oxygen species (ROS) are required for A. alternata pathogenicity in citrus. In this study, we report the 34.41?Mb genome sequence of strain Z7 of the tangerine pathotype of A. alternata. The host selective ACT gene cluster in strain Z7 was identified, which included 25 genes with 19 of them not reported previously. Of these, 10 genes were present only in the tangerine pathotype, representing the most likely candidate genes for this pathotype specialization. A transcriptome analysis of the global effects of H2O2 on gene expression revealed 1108 up-regulated and 498 down-regulated genes. Expressions of those genes encoding catalase, peroxiredoxin, thioredoxin and glutathione were highly induced. Genes encoding several protein families including kinases, transcription factors, transporters, cytochrome P450, ubiquitin and heat shock proteins were found associated with adaptation to oxidative stress. Our data not only revealed the molecular basis of ACT biosynthesis but also provided new insights into the potential pathways that the phytopathogen A. alternata copes with oxidative stress.


July 7, 2019  |  

Building two indica rice reference genomes with PacBio long-read and Illumina paired-end sequencing data.

Over the past 30 years, we have performed many fundamental studies on two Oryza sativa subsp. indica varieties, Zhenshan 97 (ZS97) and Minghui 63 (MH63). To improve the resolution of many of these investigations, we generated two reference-quality reference genome assemblies using the most advanced sequencing technologies. Using PacBio SMRT technology, we produced over 108 (ZS97) and 174 (MH63) Gb of raw sequence data from 166 (ZS97) and 209 (MH63) pools of BAC clones, and generated ~97 (ZS97) and ~74 (MH63) Gb of paired-end whole-genome shotgun (WGS) sequence data with Illumina sequencing technology. With these data, we successfully assembled two platinum standard reference genomes that have been publicly released. Here we provide the full sets of raw data used to generate these two reference genome assemblies. These data sets can be used to test new programs for better genome assembly and annotation, aid in the discovery of new insights into genome structure, function, and evolution, and help to provide essential support to biological research in general.


July 7, 2019  |  

The genome sequence of allopolyploid Brassica juncea and analysis of differential homoeolog gene expression influencing selection.

The Brassica genus encompasses three diploid and three allopolyploid genomes, but a clear understanding of the evolution of agriculturally important traits via polyploidy is lacking. We assembled an allopolyploid Brassica juncea genome by shotgun and single-molecule reads integrated to genomic and genetic maps. We discovered that the A subgenomes of B. juncea and Brassica napus each had independent origins. Results suggested that A subgenomes of B. juncea were of monophyletic origin and evolved into vegetable-use and oil-use subvarieties. Homoeolog expression dominance occurs between subgenomes of allopolyploid B. juncea, in which differentially expressed genes display more selection potential than neutral genes. Homoeolog expression dominance in B. juncea has facilitated selection of glucosinolate and lipid metabolism genes in subvarieties used as vegetables and for oil production. These homoeolog expression dominance relationships among Brassicaceae genomes have contributed to selection response, predicting the directional effects of selection in a polyploid crop genome.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.