Menu
September 22, 2019

A graph-based approach to diploid genome assembly.

Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community.We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50× coverage Illumina data and 10× PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants.https://github.com/whatshap/whatshap.Supplementary data are available at Bioinformatics online.


September 22, 2019

Large-scale gene losses underlie the genome evolution of parasitic plant Cuscuta australis.

Dodders (Cuscuta spp., Convolvulaceae) are root- and leafless parasitic plants. The physiology, ecology, and evolution of these obligate parasites are poorly understood. A high-quality reference genome of Cuscuta australis was assembled. Our analyses reveal that Cuscuta experienced accelerated molecular evolution, and Cuscuta and the convolvulaceous morning glory (Ipomoea) shared a common whole-genome triplication event before their divergence. C. australis genome harbors 19,671 protein-coding genes, and importantly, 11.7% of the conserved orthologs in autotrophic plants are lost in C. australis. Many of these gene loss events likely result from its parasitic lifestyle and the massive changes of its body plan. Moreover, comparison of the gene expression patterns in Cuscuta prehaustoria/haustoria and various tissues of closely related autotrophic plants suggests that Cuscuta haustorium formation requires mostly genes normally involved in root development. The C. australis genome provides important resources for studying the evolution of parasitism, regressive evolution, and evo-devo in plant parasites.


September 22, 2019

Genotype-Corrector: improved genotype calls for genetic mapping in F2 and RIL populations.

F2 and recombinant inbred lines (RILs) populations are very commonly used in plant genetic mapping studies. Although genome-wide genetic markers like single nucleotide polymorphisms (SNPs) can be readily identified by a wide array of methods, accurate genotype calling remains challenging, especially for heterozygous loci and missing data due to low sequencing coverage per individual. Therefore, we developed Genotype-Corrector, a program that corrects genotype calls and imputes missing data to improve the accuracy of genetic mapping. Genotype-Corrector can be applied in a wide variety of genetic mapping studies that are based on low coverage whole genome sequencing (WGS) or Genotyping-by-Sequencing (GBS) related techniques. Our results show that Genotype-Corrector achieves high accuracy when applied to both synthetic and real genotype data. Compared with using raw or only imputed genotype calls, the linkage groups built by corrected genotype data show much less noise and significant distortions can be corrected. Additionally, Genotype-Corrector compares favorably to the popular imputation software LinkImpute and Beagle in both F2 and RIL populations. Genotype-Corrector is publicly available on GitHub at https://github.com/freemao/Genotype-Corrector .


September 22, 2019

npInv: accurate detection and genotyping of inversions using long read sub-alignment.

Detection of genomic inversions remains challenging. Many existing methods primarily target inzversions with a non repetitive breakpoint, leaving inverted repeat (IR) mediated non-allelic homologous recombination (NAHR) inversions largely unexplored.We present npInv, a novel tool specifically for detecting and genotyping NAHR inversion using long read sub-alignment of long read sequencing data. We benchmark npInv with other tools in both simulation and real data. We use npInv to generate a whole-genome inversion map for NA12878 consisting of 30 NAHR inversions (of which 15 are novel), including all previously known NAHR mediated inversions in NA12878 with flanking IR less than 7kb. Our genotyping accuracy on this dataset was 94%. We used PCR to confirm the presence of two of these novel inversions. We show that there is a near linear relationship between the length of flanking IR and the minimum inversion size, without inverted repeats.The application of npInv shows high accuracy in both simulation and real data. The results give deeper insight into understanding inversion.


September 22, 2019

Heterogeneous and flexible transmission of mcr-1 in hospital-associated Escherichia coli.

The recent emergence of a transferable colistin resistance mechanism, MCR-1, has gained global attention because of its threat to clinical treatment of infections caused by multidrug-resistant Gram-negative bacteria. However, the possible transmission route of mcr-1 among Enterobacteriaceae species in clinical settings is largely unknown. Here, we present a comprehensive genomic analysis of Escherichia coli isolates collected in a hospital in Hangzhou, China. We found that mcr-1-carrying isolates from clinical infections and feces of inpatients and healthy volunteers were genetically diverse and were not closely related phylogenetically, suggesting that clonal expansion is not involved in the spread of mcr-1 The mcr-1 gene was found on either chromosomes or plasmids, but in most of the E. coli isolates, mcr-1 was carried on plasmids. The genetic context of the plasmids showed considerable diversity as evidenced by the different functional insertion sequence (IS) elements, toxin-antitoxin (TA) systems, heavy metal resistance determinants, and Rep proteins of broad-host-range plasmids. Additionally, the genomic analysis revealed nosocomial transmission of mcr-1 and the coexistence of mcr-1 with other genes encoding ß-lactamases and fluoroquinolone resistance in the E. coli isolates. These findings indicate that mcr-1 is heterogeneously disseminated in both commensal and pathogenic strains of E. coli, suggest the high flexibility of this gene in its association with diverse genetic backgrounds of the hosts, and provide new insights into the genome epidemiology of mcr-1 among hospital-associated E. coli strains. IMPORTANCE Colistin represents one of the very few available drugs for treating infections caused by extensively multidrug-resistant Gram-negative bacteria. The recently emergent mcr-1 colistin resistance gene threatens the clinical utility of colistin and has gained global attention. How mcr-1 spreads in hospital settings remains unknown and was investigated by whole-genome sequencing of mcr-1-carrying Escherichia coli in this study. The findings revealed extraordinary flexibility of mcr-1 in its spread among genetically diverse E. coli hosts and plasmids, nosocomial transmission of mcr-1-carrying E. coli, and the continuous emergence of novel Inc types of plasmids carrying mcr-1 and new mcr-1 variants. Additionally, mcr-1 was found to be frequently associated with other genes encoding ß-lactams and fluoroquinolone resistance. These findings provide important information on the transmission and epidemiology of mcr-1 and are of significant public health importance as the information is expected to facilitate the control of this significant antibiotic resistance threat. Copyright © 2018 Shen et al.


September 22, 2019

Using XCAVATOR and EXCAVATOR2 to Identify CNVs from WGS, WES, and TS Data.

Copy Number Variants (CNVs) are structural rearrangements contributing to phenotypic variation but also associated with many disease states. In recent years, the identification of CNVs from high-throughput sequencing experiments has become a common practice for both research and clinical purposes. Several computational methods have been developed so far. In this unit, we describe and give instructions on how to run two read count-based tools, XCAVATOR and EXCAVATOR2, which are tailored for the detection of both germline and somatic CNVs from different sequencing experiments (whole-genome, whole-exome, and targeted) in various disease contexts and population genetic studies. © 2018 by John Wiley & Sons, Inc.© 2018 John Wiley & Sons, Inc.


September 22, 2019

Nine draft genome sequences of Claviceps purpurea s.lat., including C. arundinis, C. humidiphila, and C. cf. spartinae, pseudomolecules for the pitch canker pathogen Fusarium circinatum, draft genome of Davidsoniella eucalypti, Grosmannia galeiformis, Quambalaria eucalypti, and Teratosphaeria destructans.

This genome announcement includes draft genomes from Claviceps purpurea s.lat., including C. arundinis, C. humidiphila and C. cf. spartinae. The draft genomes of Davidsoniella eucalypti, Quambalaria eucalypti and Teratosphaeria destructans, all three important eucalyptus pathogens, are presented. The insect associate Grosmannia galeiformis is also described. The pine pathogen genome of Fusarium circinatum has been assembled into pseudomolecules, based on additional sequence data and by harnessing the known synteny within the Fusarium fujikuroi species complex. This new assembly of the F. circinatum genome provides 12 pseudomolecules that correspond to the haploid chromosome number of F. circinatum. These are comparable to other chromosomal assemblies within the FFSC and will enable more robust genomic comparisons within this species complex.


September 22, 2019

Validation of Genomic Structural Variants Through Long Sequencing Technologies.

Although numerous algorithms have been developed to identify large chromosomal rearrangements (i.e., genomic structural variants, SVs), there remains a dearth of approaches to evaluate their results. This is significant, as the accurate identification of SVs is still an outstanding problem whereby no single algorithm has been shown to be able to achieve high sensitivity and specificity across different classes of SVs. The method introduced in this chapter, VaPoR, is specifically designed to evaluate the accuracy of SV predictions using third-generation long sequences. This method uses a recurrence approach and collects direct evidence from raw reads thus avoiding computationally costly whole genome assembly. This chapter would describe in detail as how to apply this tool onto different data types.


September 22, 2019

The integrative conjugative element clc (ICEclc) of Pseudomonas aeruginosa JB2.

Integrative conjugative elements (ICE) are a diverse group of chromosomally integrated, self-transmissible mobile genetic elements (MGE) that are active in shaping the functions of bacteria and bacterial communities. Each type of ICE carries a characteristic set of core genes encoding functions essential for maintenance and self-transmission, and cargo genes that endow on hosts phenotypes beneficial for niche adaptation. An important area to which ICE can contribute beneficial functions is the biodegradation of xenobiotic compounds. In the biodegradation realm, the best-characterized ICE is ICEclc, which carries cargo genes encoding for ortho-cleavage of chlorocatechols (clc genes) and aminophenol metabolism (amn genes). The element was originally identified in the 3-chlorobenzoate-degrader Pseudomonas knackmussii B13, and the closest relative is a nearly identical element in Burkholderia xenovorans LB400 (designated ICEclc-B13 and ICEclc-LB400, respectively). In the present report, genome sequencing of the o-chlorobenzoate degrader Pseudomonas aeruginosa JB2 was used to identify a new member of the ICEclc family, ICEclc-JB2. The cargo of ICEclc-JB2 differs from that of ICEclc-B13 and ICEclc-LB400 in consisting of a unique combination of genes that encode for the utilization of o-halobenzoates and o-hydroxybenzoate as growth substrates (ohb genes and hyb genes, respectively) and which are duplicated in a tandem repeat. Also, ICEclc-JB2 lacks an operon of regulatory genes (tciR-marR-mfsR) that is present in the other two ICEclc, and which controls excision from the host. Thus, the mechanisms regulating intracellular behavior of ICEclc-JB2 may differ from that of its close relatives. The entire tandem repeat in ICEclc-JB2 can excise independently from the element in a process apparently involving transposases/insertion sequence associated with the repeats. Excision of the repeats removes important niche adaptation genes from ICEclc-JB2, rendering it less beneficial to the host. However, the reduced version of ICEclc-JB2 could now acquire new genes that might be beneficial to a future host and, consequently, to the survival of ICEclc-JB2. Collectively, the present identification and characterization of ICEclc-JB2 provides insights into roles of MGE in bacterial niche adaptation and the evolution of catabolic pathways for biodegradation of xenobiotic compounds.


September 22, 2019

A chromosome scale assembly of the model desiccation tolerant grass Oropetium thomaeum

Oropetium thomaeum is an emerging model for desiccation tolerance and genome size evolution in grasses. A high-quality draft genome of Oropetium was recently sequenced, but the lack of a chromosome scale assembly has hindered comparative analyses and downstream functional genomics. Here, we reassembled Oropetium, and anchored the genome into ten chromosomes using Hi-C based chromatin interactions. A combination of high-resolution RNAseq data and homology-based gene prediction identified thousands of new, conserved gene models that were absent from the V1 assembly. This includes thousands of new genes with high expression across a desiccation timecourse. The sorghum and Oropetium genomes have a surprising degree of chromosome-level collinearity, and several chromosome pairs have near perfect synteny. Other chromosomes are collinear in the gene rich chromosome arms but have experienced pericentric translocations. Together, these resources will be useful for the grass comparative genomic community and further establish Oropetium as a model resurrection plant.


September 22, 2019

Integrating long-range connectivity information into de Bruijn graphs.

The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data.We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes.Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex.Supplementary data are available at Bioinformatics online.


September 22, 2019

Human copy number variants are enriched in regions of low mappability.

Copy number variants (CNVs) are known to affect a large portion of the human genome and have been implicated in many diseases. Although whole-genome sequencing (WGS) can help identify CNVs, most analytical methods suffer from limited sensitivity and specificity, especially in regions of low mappability. To address this, we use PopSV, a CNV caller that relies on multiple samples to control for technical variation. We demonstrate that our calls are stable across different types of repeat-rich regions and validate the accuracy of our predictions using orthogonal approaches. Applying PopSV to 640 human genomes, we find that low-mappability regions are approximately 5 times more likely to harbor germline CNVs, in stark contrast to the nearly uniform distribution observed for somatic CNVs in 95 cancer genomes. In addition to known enrichments in segmental duplication and near centromeres and telomeres, we also report that CNVs are enriched in specific types of satellite and in some of the most recent families of transposable elements. Finally, using this comprehensive approach, we identify 3455 regions with recurrent CNVs that were missing from existing catalogs. In particular, we identify 347 genes with a novel exonic CNV in low-mappability regions, including 29 genes previously associated with disease.


September 22, 2019

A synthetic-diploid benchmark for accurate variant-calling evaluation.

Existing benchmark datasets for use in evaluating variant-calling accuracy are constructed from a consensus of known short-variant callers, and they are thus biased toward easy regions that are accessible by these algorithms. We derived a new benchmark dataset from the de novo PacBio assemblies of two fully homozygous human cell lines, which provides a relatively more accurate and less biased estimate of small-variant-calling error rates in a realistic context.


September 22, 2019

Conservation genomics of the declining North American bumblebee Bombus terricola reveals inbreeding and selection on immune genes.

The yellow-banded bumblebee Bombus terricola was common in North America but has recently declined and is now on the IUCN Red List of threatened species. The causes of B. terricola’s decline are not well understood. Our objectives were to create a partial genome and then use this to estimate population data of conservation interest, and to determine whether genes showing signs of recent selection suggest a specific cause of decline. First, we generated a draft partial genome (contig set) for B. terricola, sequenced using Pacific Biosciences RS II at an average depth of 35×. Second, we sequenced the individual genomes of 22 bumblebee gynes from Ontario and Quebec using Illumina HiSeq 2500, each at an average depth of 20×, which were used to improve the PacBio genome calls and for population genetic analyses. The latter revealed that several samples had long runs of homozygosity, and individuals had high inbreeding coefficient F, consistent with low effective population size. Our data suggest that B. terricola’s effective population size has decreased orders of magnitude from pre-Holocene levels. We carried out tests of selection to identify genes that may have played a role in ameliorating environmental stressors underlying B. terricola’s decline. Several immune-related genes have signatures of recent positive selection, which is consistent with the pathogen-spillover hypothesis for B. terricola’s decline. The new B. terricola contig set can help solve the mystery of bumblebee decline by enabling functional genomics research to directly assess the health of pollinators and identify the stressors causing declines.


September 22, 2019

Novel enterobacter lineage as leading cause of nosocomial outbreak involving carbapenemase-producing strains.

We investigated unusual carbapenemase-producing Enterobacter cloacae complex isolates (n = 8) in the novel sequence type (ST) 873, which caused nosocomial infections in 2 hospitals in France. Whole-genome sequence typing showed the 1-year persistence of the epidemic strain, which harbored a blaVIM-4 ST1-IncHI2 plasmid, in 1 health institution and 2 closely related strains harboring blaCTX-M-15 in the other. These isolates formed a new subgroup in the E. hormaechei metacluster, according to their hsp60 sequences and phylogenomic analysis. The average nucleotide identities, specific biochemical properties, and pangenomic and functional investigations of isolates suggested isolates of a novel species that had acquired genes associated with adhesion and mobility. The emergence of this novel Enterobacter phylogenetic lineage within hospitals should be closely monitored because of its ability to persist and spread.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.