Menu
September 22, 2019

Using XCAVATOR and EXCAVATOR2 to Identify CNVs from WGS, WES, and TS Data.

Copy Number Variants (CNVs) are structural rearrangements contributing to phenotypic variation but also associated with many disease states. In recent years, the identification of CNVs from high-throughput sequencing experiments has become a common practice for both research and clinical purposes. Several computational methods have been developed so far. In this unit, we describe and give instructions on how to run two read count-based tools, XCAVATOR and EXCAVATOR2, which are tailored for the detection of both germline and somatic CNVs from different sequencing experiments (whole-genome, whole-exome, and targeted) in various disease contexts and population genetic studies. © 2018 by John Wiley & Sons, Inc.© 2018 John Wiley & Sons, Inc.


September 22, 2019

Genome-based population structure analysis of the strawberry plant pathogen Xanthomonas fragariae reveals two distinct groups that evolved independently before its species description.

Xanthomonas fragariae is a quarantine organism in Europe, causing angular leaf spots on strawberry plants. It is spreading worldwide in strawberry-producing regions due to import of plant material through trade and human activities. In order to resolve the population structure at the strain level, we have employed high-resolution molecular typing tools on a comprehensive strain collection representing global and temporal distribution of the pathogen. Clustered regularly interspaced short palindromic repeat regions (CRISPRs) and variable number of tandem repeats (VNTRs) were identified within the reference genome of X. fragariae LMG 25863 as a potential source of variation. Strains from our collection were whole-genome sequenced and used in order to identify variable spacers and repeats for discriminative purpose. CRISPR spacer analysis and multiple-locus VNTR analysis (MLVA) displayed a congruent population structure, in which two major groups and a total of four subgroups were revealed. The two main groups were genetically separated before the first X. fragariae isolate was described and are potentially responsible for the worldwide expansion of the bacterial disease. Three primer sets were designed for discriminating CRISPR-associated markers in order to streamline group determination of novel isolates. Overall, this study describes typing methods to discriminate strains and monitor the pathogen population structure, more especially in the view of a new outbreak of the pathogen.


September 22, 2019

MultiMotifMaker: a multi-thread tool for identifying DNA methylation motifs from Pacbio reads.

The methylation of DNA is important mechanism to control biological processes. Recently, the Pacbio SMRT technology provides a new way to identify base methylation in the genome. MotifMaker is a tool developed by Pacbio for discovering DNA methylation motifs from methylated DNA sequences. However, MotifMaker is single-threaded and computational expensive for identifying methylation motifs from large genomes. Here, we present an efficient motif finding algorithm (MultiMotifMaker) by implementing multi threads of the MotifMaker. The MultiMotifMaker, speeds up the motif search about 8-9 times on a 32 core computer comparing to MotifMaker. MultiMotifMaker makes it possible to identify methylation motifs from Pacbio reads for large genomes.


September 22, 2019

Nine draft genome sequences of Claviceps purpurea s.lat., including C. arundinis, C. humidiphila, and C. cf. spartinae, pseudomolecules for the pitch canker pathogen Fusarium circinatum, draft genome of Davidsoniella eucalypti, Grosmannia galeiformis, Quambalaria eucalypti, and Teratosphaeria destructans.

This genome announcement includes draft genomes from Claviceps purpurea s.lat., including C. arundinis, C. humidiphila and C. cf. spartinae. The draft genomes of Davidsoniella eucalypti, Quambalaria eucalypti and Teratosphaeria destructans, all three important eucalyptus pathogens, are presented. The insect associate Grosmannia galeiformis is also described. The pine pathogen genome of Fusarium circinatum has been assembled into pseudomolecules, based on additional sequence data and by harnessing the known synteny within the Fusarium fujikuroi species complex. This new assembly of the F. circinatum genome provides 12 pseudomolecules that correspond to the haploid chromosome number of F. circinatum. These are comparable to other chromosomal assemblies within the FFSC and will enable more robust genomic comparisons within this species complex.


September 22, 2019

Validation of Genomic Structural Variants Through Long Sequencing Technologies.

Although numerous algorithms have been developed to identify large chromosomal rearrangements (i.e., genomic structural variants, SVs), there remains a dearth of approaches to evaluate their results. This is significant, as the accurate identification of SVs is still an outstanding problem whereby no single algorithm has been shown to be able to achieve high sensitivity and specificity across different classes of SVs. The method introduced in this chapter, VaPoR, is specifically designed to evaluate the accuracy of SV predictions using third-generation long sequences. This method uses a recurrence approach and collects direct evidence from raw reads thus avoiding computationally costly whole genome assembly. This chapter would describe in detail as how to apply this tool onto different data types.


September 22, 2019

PBHoover and CigarRoller: a method for confident haploid variant calling on Pacific Biosciences data and its application to heterogeneous population analysis

Motivation: Single Molecule Real-Time (SMRT) sequencing has important and underutilized advantages that amplification-based platforms lack. Lack of systematic error (e.g. GC-bias), complete de novo assembly (including large repetitive regions) without scaffolding, can be mentioned. SMRT sequencing, however suffers from high random error rate and low sequencing depth (older chemistries). Here, we introduce PBHoover, software that uses a heuristic calling algorithm in order to make base calls with high certainty in low coverage regions. This software is also capable of mixed population detection with high sensitivity. PBHoovertextquoterights CigarRoller attachment improves sequencing depth in low-coverage regions through CIGAR-string correction. Results: We tested both modules on 348 M.tuberculosis clinical isolates sequenced on C1 or C2 chemistries. On average, CigarRoller improved percentage of usable read count from 68.9% to 99.98% in C1 runs and from 50% to 99% in C2 runs. Using the greater depth provided by CigarRoller, PBHoover was able to make base and variant calls 99.95% concordant with Sanger calls (QV33). PBHoover also detected antibiotic-resistant subpopulations that went undetected by Sanger. Using C1 chemistry, subpopulations as small as 9% of the total colony can be detected by PBHoover. This provides the most sensitive amplification-free molecular method for heterogeneity analysis and is in line with phenotypic methodstextquoteright sensitivity. This sensitivity significantly improves with the greater depth and lower error rate of the newer chemistries. Availability and Implementation: Executables are freely available under GNU GPL v3+ at http://www.gitlab.com/LPCDRP/pbhoover and http://www.gitlab.com/LPCDRP/CigarRoller. PBHoover is also available on bioconda: https://anaconda.org/bioconda/pbhoover.


September 22, 2019

Potential survival and pathogenesis of a novel strain, Vibrio parahaemolyticus FORC_022, isolated from a soy sauce marinated crab by genome and transcriptome analyses.

Vibrio parahaemolyticus can cause gastrointestinal illness through consumption of seafood. Despite frequent food-borne outbreaks of V. parahaemolyticus, only 19 strains have subjected to complete whole-genome analysis. In this study, a novel strain of V. parahaemolyticus, designated FORC_022 (Food-borne pathogen Omics Research Center_022), was isolated from soy sauce marinated crabs, and its genome and transcriptome were analyzed to elucidate the pathogenic mechanisms. FORC_022 did not include major virulence factors of thermostable direct hemolysin (tdh) and TDH-related hemolysin (trh). However, FORC_022 showed high cytotoxicity and had several V. parahaemolyticus islands (VPaIs) and other virulence factors, such as various secretion systems (types I, II, III, IV, and VI), in comparative genome analysis with CDC_K4557 (the most similar strain) and RIMD2210633 (genome island marker strain). FORC_022 harbored additional virulence genes, including accessory cholera enterotoxin, zona occludens toxin, and tight adhesion (tad) locus, compared with CDC_K4557. In addition, O3 serotype specific gene and the marker gene of pandemic O3:K6 serotype (toxRS) were detected in FORC_022. The expressions levels of genes involved in adherence and carbohydrate transporter were high, whereas those of genes involved in motility, arginine biosynthesis, and proline metabolism were low after exposure to crabs. Moreover, the virulence factors of the type III secretion system, tad locus, and thermolabile hemolysin were overexpressed. Therefore, the risk of foodborne-illness may be high following consumption of FORC_022 contaminated crab. These results provided molecular information regarding the survival and pathogenesis of V. parahaemolyticus FORC_022 strain in contaminated crab and may have applications in food safety.


September 22, 2019

Sequencing of Panax notoginseng genome reveals genes involved in disease resistance and ginsenoside biosynthesis

Background: Panax notoginseng is a traditional Chinese herb with high medicinal and economic value. There has been considerable research on the pharmacological activities of ginsenosides contained in Panax spp.; however, very little is known about the ginsenoside biosynthetic pathway. Results: We reported the first de novo genome of 2.36 Gb of sequences from P. notoginseng with 35,451 protein-encoding genes. Compared to other plants, we found notable gene family contraction of disease-resistance genes in P. notoginseng, but notable expansion for several ATP-binding cassette (ABC) transporter subfamilies, such as the Gpdr subfamily, indicating that ABCs might be an additional mechanism for the plant to cope with biotic stress. Combining eight transcriptomes of roots and aerial parts, we identified several key genes, their transcription factor binding sites and all their family members involved in the synthesis pathway of ginsenosides in P. notoginseng, including dammarenediol synthase, CYP716 and UGT71. Conclusions: The complete genome analysis of P. notoginseng, the first in genus Panax, will serve as an important reference sequence for improving breeding and cultivation of this important nutraceutical and medicinal but vulnerable plant species.


September 22, 2019

A chromosome scale assembly of the model desiccation tolerant grass Oropetium thomaeum

Oropetium thomaeum is an emerging model for desiccation tolerance and genome size evolution in grasses. A high-quality draft genome of Oropetium was recently sequenced, but the lack of a chromosome scale assembly has hindered comparative analyses and downstream functional genomics. Here, we reassembled Oropetium, and anchored the genome into ten chromosomes using Hi-C based chromatin interactions. A combination of high-resolution RNAseq data and homology-based gene prediction identified thousands of new, conserved gene models that were absent from the V1 assembly. This includes thousands of new genes with high expression across a desiccation timecourse. The sorghum and Oropetium genomes have a surprising degree of chromosome-level collinearity, and several chromosome pairs have near perfect synteny. Other chromosomes are collinear in the gene rich chromosome arms but have experienced pericentric translocations. Together, these resources will be useful for the grass comparative genomic community and further establish Oropetium as a model resurrection plant.


September 22, 2019

Integrating long-range connectivity information into de Bruijn graphs.

The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data.We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes.Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex.Supplementary data are available at Bioinformatics online.


September 22, 2019

Periodic variation of mutation rates in bacterial genomes associated with replication timing

The causes and consequences of spatiotemporal variation in mutation rates remain to be explored in nearly all organisms. Here we examine relationships between local mutation rates and replication timing in three bacterial species whose genomes have multiple chromosomes: Vibrio fischeri, Vibrio cholerae, and Burkholderia cenocepacia Following five mutation accumulation experiments with these bacteria conducted in the near absence of natural selection, the genomes of clones from each lineage were sequenced and analyzed to identify variation in mutation rates and spectra. In lineages lacking mismatch repair, base substitution mutation rates vary in a mirrored wave-like pattern on opposing replichores of the large chromosomes of V. fischeri and V. cholerae, where concurrently replicated regions experience similar base substitution mutation rates. The base substitution mutation rates on the small chromosome are less variable in both species but occur at similar rates to those in the concurrently replicated regions of the large chromosome. Neither nucleotide composition nor frequency of nucleotide motifs differed among regions experiencing high and low base substitution rates, which along with the inferred ~800-kb wave period suggests that the source of the periodicity is not sequence specific but rather a systematic process related to the cell cycle. These results support the notion that base substitution mutation rates are likely to vary systematically across many bacterial genomes, which exposes certain genes to elevated deleterious mutational load.IMPORTANCE That mutation rates vary within bacterial genomes is well known, but the detailed study of these biases has been made possible only recently with contemporary sequencing methods. We applied these methods to understand how bacterial genomes with multiple chromosomes, like those of Vibrio and Burkholderia, might experience heterogeneous mutation rates because of their unusual replication and the greater genetic diversity found on smaller chromosomes. This study captured thousands of mutations and revealed wave-like rate variation that is synchronized with replication timing and not explained by sequence context. The scale of this rate variation over hundreds of kilobases of DNA strongly suggests that a temporally regulated cellular process may generate wave-like variation in mutation risk. These findings add to our understanding of how mutation risk is distributed across bacterial and likely also eukaryotic genomes, owing to their highly conserved replication and repair machinery. Copyright © 2018 Dillon et al.


September 22, 2019

A reference genome of the Chinese hamster based on a hybrid assembly strategy.

Accurate and complete genome sequences are essential in biotechnology to facilitate genome-based cell engineering efforts. The current genome assemblies for Cricetulus griseus, the Chinese hamster, are fragmented and replete with gap sequences and misassemblies, consistent with most short-read-based assemblies. Here, we completely resequenced C. griseus using single molecule real time sequencing and merged this with Illumina-based assemblies. This generated a more contiguous and complete genome assembly than either technology alone, reducing the number of scaffolds by >28-fold, with 90% of the sequence in the 122 longest scaffolds. Most genes are now found in single scaffolds, including up- and downstream regulatory elements, enabling improved study of noncoding regions. With >95% of the gap sequence filled, important Chinese hamster ovary cell mutations have been detected in draft assembly gaps. This new assembly will be an invaluable resource for continued basic and pharmaceutical research.© 2018 The Authors. Biotechnology and Bioengineering Published by Wiley Periodicals, Inc.


September 22, 2019

Analysis of the draft genome of the red seaweed Gracilariopsis chorda provides insights into genome size evolution in Rhodophyta.

Red algae (Rhodophyta) underwent two phases of large-scale genome reduction during their early evolution. The red seaweeds did not attain genome sizes or gene inventories typical of other multicellular eukaryotes. We generated a high-quality 92.1 Mb draft genome assembly from the red seaweed Gracilariopsis chorda, including methylation and small (s)RNA data. We analyzed these and other Archaeplastida genomes to address three questions: 1) What is the role of repeats and transposable elements (TEs) in explaining Rhodophyta genome size variation, 2) what is the history of genome duplication and gene family expansion/reduction in these taxa, and 3) is there evidence for TE suppression in red algae? We find that the number of predicted genes in red algae is relatively small (4,803-13,125 genes), particularly when compared with land plants, with no evidence of polyploidization. Genome size variation is primarily explained by TE expansion with the red seaweeds having the largest genomes. Long terminal repeat elements and DNA repeats are the major contributors to genome size growth. About 8.3% of the G. chorda genome undergoes cytosine methylation among gene bodies, promoters, and TEs, and 71.5% of TEs contain methylated-DNA with 57% of these regions associated with sRNAs. These latter results suggest a role for TE-associated sRNAs in RNA-dependent DNA methylation to facilitate silencing. We postulate that the evolution of genome size in red algae is the result of the combined action of TE spread and the concomitant emergence of its epigenetic suppression, together with other important factors such as changes in population size.


September 22, 2019

Transcriptome analysis of Neisseria gonorrhoeae during natural infection reveals differential expression of antibiotic resistance determinants between men and women.

Neisseria gonorrhoeae is a bacterial pathogen responsible for the sexually transmitted infection gonorrhea. Emergence of antimicrobial resistance (AMR) of N. gonorrhoeae worldwide has resulted in limited therapeutic choices for this infection. Men who seek treatment often have symptomatic urethritis; in contrast, gonococcal cervicitis in women is usually minimally symptomatic, but may progress to pelvic inflammatory disease. Previously, we reported the first analysis of gonococcal transcriptome expression determined in secretions from women with cervical infection. Here, we defined gonococcal global transcriptional responses in urethral specimens from men with symptomatic urethritis and compared these with transcriptional responses in specimens obtained from women with cervical infections and in vitro-grown N. gonorrhoeae isolates. This is the first comprehensive comparison of gonococcal gene expression in infected men and women. RNA sequencing analysis revealed that 9.4% of gonococcal genes showed increased expression exclusively in men and included genes involved in host immune cell interactions, while 4.3% showed increased expression exclusively in women and included phage-associated genes. Infected men and women displayed comparable antibiotic-resistant genotypes and in vitro phenotypes, but a 4-fold higher expression of the Mtr efflux pump-related genes was observed in men. These results suggest that expression of AMR genes is programed genotypically and also driven by sex-specific environments. Collectively, our results indicate that distinct N. gonorrhoeae gene expression signatures are detected during genital infection in men and women. We propose that therapeutic strategies could target sex-specific differences in expression of antibiotic resistance genes.IMPORTANCE Recent emergence of antimicrobial resistance of Neisseria gonorrhoeae worldwide has resulted in limited therapeutic choices for treatment of infections caused by this organism. We performed global transcriptomic analysis of N. gonorrhoeae in subjects with gonorrhea who attended a Nanjing, China, sexually transmitted infection (STI) clinic, where antimicrobial resistance of N. gonorrhoeae is high and increasing. We found that N. gonorrhoeae transcriptional responses to infection differed in genital specimens taken from men and women, particularly antibiotic resistance gene expression, which was increased in men. These sex-specific findings may provide a new approach to guide therapeutic interventions and preventive measures that are also sex specific while providing additional insight to address antimicrobial resistance of N. gonorrhoeae. Copyright © 2018 Nudel et al.


September 22, 2019

Analysis of the Gli-D2 locus identifies a genetic target for simultaneously improving the breadmaking and health-related traits of common wheat.

Gliadins are a major component of wheat seed proteins. However, the complex homoeologous Gli-2 loci (Gli-A2, -B2 and -D2) that encode the a-gliadins in commercial wheat are still poorly understood. Here we analyzed the Gli-D2 locus of Xiaoyan 81 (Xy81), a winter wheat cultivar. A total of 421.091 kb of the Gli-D2 sequence was assembled from sequencing multiple bacterial artificial clones, and 10 a-gliadin genes were annotated. Comparative genomic analysis showed that Xy81 carried only eight of the a-gliadin genes of the D genome donor Aegilops tauschii, with two of them each experiencing a tandem duplication. A mutant line lacking Gli-D2 (DLGliD2) consistently exhibited better breadmaking quality and dough functionalities than its progenitor Xy81, but without penalties in other agronomic traits. It also had an elevated lysine content in the grains. Transcriptome analysis verified the lack of Gli-D2 a-gliadin gene expression in DLGliD2. Furthermore, the transcript and protein levels of protein disulfide isomerase were both upregulated in DLGliD2 grains. Consistent with this finding, DLGliD2 had increased disulfide content in the flour. Our work sheds light on the structure and function of Gli-D2 in commercial wheat, and suggests that the removal of Gli-D2 and the gliadins specified by it is likely to be useful for simultaneously enhancing the end-use and health-related traits of common wheat. Because gliadins and homologous proteins are widely present in grass species, the strategy and information reported here may be broadly useful for improving the quality traits of diverse cereal crops.© 2018 The Authors The Plant Journal © 2018 John Wiley & Sons Ltd.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.