Menu
July 7, 2019

Information-optimal genome assembly via sparse read-overlap graphs.

In the context of third-generation long-read sequencing technologies, read-overlap-based approaches are expected to play a central role in the assembly step. A fundamental challenge in assembling from a read-overlap graph is that the true sequence corresponds to a Hamiltonian path on the graph, and, under most formulations, the assembly problem becomes NP-hard, restricting practical approaches to heuristics. In this work, we avoid this seemingly fundamental barrier by first setting the computational complexity issue aside, and seeking an algorithm that targets information limits In particular, we consider a basic feasibility question: when does the set of reads contain enough information to allow unambiguous reconstruction of the true sequence?Based on insights from this information feasibility question, we present an algorithm-the Not-So-Greedy algorithm-to construct a sparse read-overlap graph. Unlike most other assembly algorithms, Not-So-Greedy comes with a performance guarantee: whenever information feasibility conditions are satisfied, the algorithm reduces the assembly problem to an Eulerian path problem on the resulting graph, and can thus be solved in linear time. In practice, this theoretical guarantee translates into assemblies of higher quality. Evaluations on both simulated reads from real genomes and a PacBio Escherichia coli K12 dataset demonstrate that Not-So-Greedy compares favorably with standard string graph approaches in terms of accuracy of the resulting read-overlap graph and contig N50.Available at github.com/samhykim/nsgcourtade@eecs.berkeley.edu or dntse@stanford.eduSupplementary data are available at Bioinformatics online.© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.


July 7, 2019

TeloPCR-seq: a high-throughput sequencing approach for telomeres.

We have developed a high-throughput sequencing approach that enables us to determine terminal telomere sequences from tens of thousands of individual Schizosaccharomyces pombe telomeres. This method provides unprecedented coverage of telomeric sequence complexity in fission yeast. S. pombe telomeres are composed of modular degenerate repeats that can be explained by variation in usage of the TER1 RNA template during reverse transcription. Taking advantage of this deep sequencing approach, we find that ‘like’ repeat modules are highly correlated within individual telomeres. Moreover, repeat module preference varies with telomere length, suggesting that existing repeats promote the incorporation of like repeats and/or that specific conformations of the telomerase holoenzyme efficiently and/or processively add repeats of like nature. After the loss of telomerase activity, this sequencing and analysis pipeline defines a population of telomeres with altered sequence content. This approach will be adaptable to study telomeric repeats in other organisms and also to interrogate repetitive sequences throughout the genome that are inaccessible to other sequencing methods.© 2016 Federation of European Biochemical Societies.


July 7, 2019

CoLoRMap: Correcting Long Reads by Mapping short reads.

Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormapehaghshe@sfu.ca or cedric.chauve@sfu.caSupplementary data are available at Bioinformatics online.© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.


July 7, 2019

Epigenetic mechanisms in microbial members of the human microbiota: current knowledge and perspectives.

The human microbiota and epigenetic processes have both been shown to play a crucial role in health and disease. However, there is extremely scarce information on epigenetic modulation of microbiota members except for a few pathogens. Mainly DNA adenine methylation has been described extensively in modulating the virulence of pathogenic bacteria in particular. It would thus appear likely that such mechanisms are widespread for most bacterial members of the microbiota. This review will present briefly the current knowledge on epigenetic processes in bacteria, give examples of known methylation processes in microbial members of the human microbiota and summarize the knowledge on regulation of host epigenetic processes by the human microbiota.


July 7, 2019

Efficient, cost-effective, high-throughput, Multilocus Sequencing Typing (MLST) method, NGMLST, and the analytical software program MLSTEZ.

Multilocus sequence typing (MLST) has become the preferred method for genotyping many biological species. It can be used to identify major phylogenetic clades, molecular groups, or subpopulations of a species, as well as individual strains or clones. However, conventional MLST is costly and time consuming, which limits its power for genotyping large numbers of samples. Here, we describe a new MLST method that uses next-generation sequencing, a multiplexing protocol, and appropriate analytical software to provide accurate, rapid, and economical MLST genotyping of 96 or more isolates in a single assay.


July 7, 2019

Transfer of the potato plant isolates of Pectobacterium wasabiae to Pectobacterium parmentieri sp. nov.

Pectobacterium wasabiae was originally isolated from Japanese horseradish (Eutrema wasabi), but recently some Pectobacterium isolates collected from potato plants and tubers displaying blackleg and soft rot symptoms were also assigned to P. wasabiae. Here, combining genomic and phenotypical data, we re-evaluated their taxonomic position. PacBio and Illumina technologies were used to complete the genome sequences of P. wasabiae CFBP 3304T and RNS 08-42-1A. Multi-locus sequence analysis showed that the P. wasabiae strains RNS 08-42-1A, SCC3193, CFIA1002 and WPP163, which were collected from potato plant environment, constituted a separate clade from the original Japanese horseradish P. wasabiae. The taxonomic position of these strains was also supported by calculation of the in-silico DNA-DNA hybridization, genome average nucleotide indentity, alignment fraction and average nucleotide indentity values. In addition, they were phenotypically distinguished from P. wasabiae strains by producing acids from (+)-raffinose, a-d(+)-a-lactose, d(+)-galactose and (+)-melibiose but not from methyl a-d-glycopyranoside, (+)-maltose or malonic acid. The name Pectobacterium parmentieri sp. nov. is proposed for this taxon; the type strain is RNS 08-42-1AT (=CFBP 8475T=LMG 29774T).


July 7, 2019

Genomic insights into a sustained national outbreak of Yersinia pseudotuberculosis.

In 2014, a sustained outbreak of yersiniosis due to Yersinia pseudotuberculosis occurred across all major cities in New Zealand (NZ), with a total of 220 laboratory-confirmed cases, representing one of the largest ever reported outbreaks of Y. pseudotuberculosis. Here, we performed whole genome sequencing of outbreak-associated isolates to produce the largest population analysis to date of Y. pseudotuberculosis, giving us unprecedented capacity to understand the emergence and evolution of the outbreak clone. Multivariate analysis incorporating our genomic and clinical epidemiological data strongly suggested a single point-source contamination of the food chain, with subsequent nationwide distribution of contaminated produce. We additionally uncovered significant diversity in key determinants of virulence, which we speculate may help explain the high morbidity linked to this outbreak.


July 7, 2019

svclassify: a method to establish benchmark structural variant calls.

The human genome contains variants ranging in size from small single nucleotide polymorphisms (SNPs) to large structural variants (SVs). High-quality benchmark small variant calls for the pilot National Institute of Standards and Technology (NIST) Reference Material (NA12878) have been developed by the Genome in a Bottle Consortium, but no similar high-quality benchmark SV calls exist for this genome. Since SV callers output highly discordant results, we developed methods to combine multiple forms of evidence from multiple sequencing technologies to classify candidate SVs into likely true or false positives. Our method (svclassify) calculates annotations from one or more aligned bam files from many high-throughput sequencing technologies, and then builds a one-class model using these annotations to classify candidate SVs as likely true or false positives.We first used pedigree analysis to develop a set of high-confidence breakpoint-resolved large deletions. We then used svclassify to cluster and classify these deletions as well as a set of high-confidence deletions from the 1000 Genomes Project and a set of breakpoint-resolved complex insertions from Spiral Genetics. We find that likely SVs cluster separately from likely non-SVs based on our annotations, and that the SVs cluster into different types of deletions. We then developed a supervised one-class classification method that uses a training set of random non-SV regions to determine whether candidate SVs have abnormal annotations different from most of the genome. To test this classification method, we use our pedigree-based breakpoint-resolved SVs, SVs validated by the 1000 Genomes Project, and assembly-based breakpoint-resolved insertions, along with semi-automated visualization using svviz.We find that candidate SVs with high scores from multiple technologies have high concordance with PCR validation and an orthogonal consensus method MetaSV (99.7 % concordant), and candidate SVs with low scores are questionable. We distribute a set of 2676 high-confidence deletions and 68 high-confidence insertions with high svclassify scores from these call sets for benchmarking SV callers. We expect these methods to be particularly useful for establishing high-confidence SV calls for benchmark samples that have been characterized by multiple technologies.


July 7, 2019

Assembly of the draft genome of buckwheat and its applications in identifying agronomically useful genes.

Buckwheat (Fagopyrum esculentum Moench; 2n = 2x = 16) is a nutritionally dense annual crop widely grown in temperate zones. To accelerate molecular breeding programmes of this important crop, we generated a draft assembly of the buckwheat genome using short reads obtained by next-generation sequencing (NGS), and constructed the Buckwheat Genome DataBase. After assembling short reads, we determined 387,594 scaffolds as the draft genome sequence (FES_r1.0). The total length of FES_r1.0 was 1,177,687,305 bp, and the N50 of the scaffolds was 25,109 bp. Gene prediction analysis revealed 286,768 coding sequences (CDSs; FES_r1.0_cds) including those related to transposable elements. The total length of FES_r1.0_cds was 212,917,911 bp, and the N50 was 1,101 bp. Of these, the functions of 35,816 CDSs excluding those for transposable elements were annotated by BLAST analysis. To demonstrate the utility of the database, we conducted several test analyses using BLAST and keyword searches. Furthermore, we used the draft genome as a reference sequence for NGS-based markers, and successfully identified novel candidate genes controlling heteromorphic self-incompatibility of buckwheat. The database and draft genome sequence provide a valuable resource that can be used in efforts to develop buckwheat cultivars with superior agronomic traits.© The Author 2016. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.


July 7, 2019

Clonal dissemination of Pseudomonas aeruginosa sequence type 235 isolates carrying blaIMP-6 and emergence of blaGES-24 and blaIMP-10 on novel genomic islands PAGI-15 and -16 in South Korea.

A total of 431 Pseudomonas aeruginosa clinical isolates were collected from 29 general hospitals in South Korea in 2015. Antimicrobial susceptibility was tested by the disk diffusion method, and MICs of carbapenems were determined by the agar dilution method. Carbapenemase genes were amplified by PCR and sequenced, and the structures of class 1 integrons surrounding the carbapenemase gene cassettes were analyzed by PCR mapping. Multilocus sequence typing (MLST) and pulsed-field gel electrophoresis (PFGE) were performed for strain typing. Whole-genome sequencing was carried out to analyze P. aeruginosa genomic islands (PAGIs) carrying the blaIMP-6, blaIMP-10, and blaGES-24 genes. The rates of carbapenem-nonsusceptible and carbapenemase-producing P. aeruginosa isolates were 34.3% (148/431) and 9.5% (41/431), respectively. IMP-6 was the most prevalent carbapenemase type, followed by VIM-2, IMP-10, and GES-24. All carbapenemase genes were located on class 1 integrons of 6 different types on the chromosome. All isolates harboring carbapenemase genes exhibited genetic relatedness by PFGE (similarity > 80%); moreover, all isolates were identified as sequence type 235 (ST235), with the exception of two ST244 isolates by MLST. The blaIMP-6, blaIMP-10, and blaGES-24 genes were found to be located on two novel PAGIs, designated PAGI-15 and PAGI-16. Our data support the clonal spread of an IMP-6-producing P. aeruginosa ST235 strain, and the emergence of IMP-10 and GES-24 demonstrates the diversification of carbapenemases in P. aeruginosa in Korea. Copyright © 2016, American Society for Microbiology. All Rights Reserved.


July 7, 2019

Complete genome sequence of a psychotrophic Pseudarthrobacter sulfonivorans strain Ar51 (CGMCC 4.7316), a novel crude oil and multi benzene compounds degradation strain.

Pseudarthrobacter sulfonivorans strain Ar51, a psychotrophic bacterium isolated from the Tibet permafrost of China, can degrade crude oil and multi benzene compounds efficiently in low temperature. Here we report the complete genome sequence of this bacterium. The complete genome sequence of Pseudarthrobacter sulfonivorans strain Ar51, consisting of a cycle chromosome with a size of 5.04Mbp and a cycle plasmid with a size of 12.39kbp. The availability of this genome sequence allows us to investigate the genetic basis of crude oil degradation and adaptation to growth in a nutrient-poor permafrost environment. Copyright © 2016 Elsevier B.V. All rights reserved.


July 7, 2019

Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study.

Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated.We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1?kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms.


July 7, 2019

Collection and storage of HLA NGS genotyping data for the 17th International HLA and Immunogenetics Workshop.

For over 50?years, the International HLA and Immunogenetics Workshops (IHIW) have advanced the fields of histocompatibility and immunogenetics (H&I) via community sharing of technology, experience and reagents, and the establishment of ongoing collaborative projects. Held in the fall of 2017, the 17th IHIW focused on the application of next generation sequencing (NGS) technologies for clinical and research goals in the H&I fields. NGS technologies have the potential to allow dramatic insights and advances in these fields, but the scope and sheer quantity of data associated with NGS raise challenges for their analysis, collection, exchange and storage. The 17th IHIW adopted a centralized approach to these issues, and we developed the tools, services and systems to create an effective system for capturing and managing these NGS data. We worked with NGS platform and software developers to define a set of distinct but equivalent NGS typing reports that record NGS data in a uniform fashion. The 17th IHIW database applied our standards, tools and services to collect, validate and store those structured, multi-platform data in an automated fashion. We have created community resources to enable exploration of the vast store of curated sequence and allele-name data in the IPD-IMGT/HLA Database, with the goal of creating a long-term community resource that integrates these curated data with new NGS sequence and polymorphism data, for advanced analyses and applications. Copyright © 2017 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights reserved.


July 7, 2019

Microbial sequence typing in the genomic era.

Next-generation sequencing (NGS), also known as high-throughput sequencing, is changing the field of microbial genomics research. NGS allows for a more comprehensive analysis of the diversity, structure and composition of microbial genes and genomes compared to the traditional automated Sanger capillary sequencing at a lower cost. NGS strategies have expanded the versatility of standard and widely used typing approaches based on nucleotide variation in several hundred DNA sequences and a few gene fragments (MLST, MLVA, rMLST and cgMLST). NGS can now accommodate variation in thousands or millions of sequences from selected amplicons to full genomes (WGS, NGMLST and HiMLST). To extract signals from high-dimensional NGS data and make valid statistical inferences, novel analytic and statistical techniques are needed. In this review, we describe standard and new approaches for microbial sequence typing at gene and genome levels and guidelines for subsequent analysis, including methods and computational frameworks. We also present several applications of these approaches to some disciplines, namely genotyping, phylogenetics and molecular epidemiology. Copyright © 2017 Elsevier B.V. All rights reserved.


July 7, 2019

De novo mutations resolve disease transmission pathways in clonal malaria

Detecting de novo mutations in viral and bacterial pathogens enables researchers to reconstruct detailed networks of disease transmission and is a key technique in genomic epidemiology. However, these techniques have not yet been applied to the malaria parasite, Plasmodium falciparum, in which a larger genome, slower generation times, and a complex life cycle make them difficult to implement. Here, we demonstrate the viability of de novo mutation studies in P. falciparum for the first time. Using a combination of sequencing, library preparation, and genotyping methods that have been optimized for accuracy in low-complexity genomic regions, we have detected de novo mutations that distinguish nominally identical parasites from clonal lineages. Despite its slower evolutionary rate compared with bacterial or viral species, de novo mutation can be detected in P. falciparum across timescales of just 1-2?years and evolutionary rates in low-complexity regions of the genome can be up to twice that detected in the rest of the genome. The increased mutation rate allows the identification of separate clade expansions that cannot be found using previous genomic epidemiology approaches and could be a crucial tool for mapping residual transmission patterns in disease elimination campaigns and reintroduction scenarios.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.