Menu
July 7, 2019

TriPoly: haplotype estimation for polyploids using sequencing data of related individuals.

Knowledge of haplotypes, i.e. phased and ordered marker alleles on a chromosome, is essential to answer many questions in genetics and genomics. By generating short pieces of DNA sequence, high-throughput modern sequencing technologies make estimation of haplotypes possible for single individuals. In polyploids, however, haplotype estimation methods usually require deep coverage to achieve sufficient accuracy. This often renders sequencing-based approaches too costly to be applied to large populations needed in studies of Quantitative Trait Loci.We propose a novel haplotype estimation method for polyploids, TriPoly, that combines sequencing data with Mendelian inheritance rules to infer haplotypes in parent-offspring trios. Using realistic simulations of both short and long-read sequencing data for banana (Musa acuminata) and potato (Solanum tuberosum) trios, we show that TriPoly yields more accurate progeny haplotypes at low coverages compared to existing methods that work on single individuals. We also apply TriPoly to phase Single Nucleotide Polymorphisms on chromosome 5 for a family of tetraploid potato with 2 parents and 37 offspring sequenced with an RNA capture approach. We show that TriPoly haplotype estimates differ from those of the other methods mainly in regions with imperfect sequencing or mapping difficulties, as it does not rely solely on sequence reads and aims to avoid phasings that are not likely to have been passed from the parents to the offspring.TriPoly has been implemented in Python 3.5.2 (also compatible with Python 2.7.3 and higher) and can be freely downloaded at https://github.com/EhsanMotazedi/TriPoly.Supplementary data are available at Bioinformatics online.


July 7, 2019

Fast-SG: an alignment-free algorithm for hybrid assembly.

Long-read sequencing technologies are the ultimate solution for genome repeats, allowing near reference-level reconstructions of large genomes. However, long-read de novo assembly pipelines are computationally intense and require a considerable amount of coverage, thereby hindering their broad application to the assembly of large genomes. Alternatively, hybrid assembly methods that combine short- and long-read sequencing technologies can reduce the time and cost required to produce de novo assemblies of large genomes.Here, we propose a new method, called Fast-SG, that uses a new ultrafast alignment-free algorithm specifically designed for constructing a scaffolding graph using light-weight data structures. Fast-SG can construct the graph from either short or long reads. This allows the reuse of efficient algorithms designed for short-read data and permits the definition of novel modular hybrid assembly pipelines. Using comprehensive standard datasets and benchmarks, we show how Fast-SG outperforms the state-of-the-art short-read aligners when building the scaffoldinggraph and can be used to extract linking information from either raw or error-corrected long reads. We also show how a hybrid assembly approach using Fast-SG with shallow long-read coverage (5X) and moderate computational resources can produce long-range and accurate reconstructions of the genomes of Arabidopsis thaliana (Ler-0) and human (NA12878).Fast-SG opens a door to achieve accurate hybrid long-range reconstructions of large genomes with low effort, high portability, and low cost.


July 7, 2019

Clustering of circular consensus sequences: accurate error correction and assembly of single molecule real-time reads from multiplexed amplicon libraries.

Targeted resequencing with high-throughput sequencing (HTS) platforms can be used to efficiently interrogate the genomes of large numbers of individuals. A critical issue for research and applications using HTS data, especially from long-read platforms, is error in base calling arising from technological limits and bioinformatic algorithms. We found that the community standard long amplicon analysis (LAA) module from Pacific Biosciences is prone to substantial bioinformatic errors that raise concerns about findings based on this pipeline, prompting the need for a new method.A single molecule real-time (SMRT) sequencing-error correction and assembly pipeline, C3S-LAA, was developed for libraries of pooled amplicons. By uniquely leveraging the structure of SMRT sequence data (comprised of multiple low quality subreads from which higher quality circular consensus sequences are formed) to cluster raw reads, C3S-LAA produced accurate consensus sequences and assemblies of overlapping amplicons from single sample and multiplexed libraries. In contrast, despite read depths in excess of 100X per amplicon, the standard long amplicon analysis module from Pacific Biosciences generated unexpected numbers of amplicon sequences with substantial inaccuracies in the consensus sequences. A bootstrap analysis showed that the C3S-LAA pipeline per se was effective at removing bioinformatic sources of error, but in rare cases a read depth of nearly 400X was not sufficient to overcome minor but systematic errors inherent to amplification or sequencing.C3S-LAA uses a divide and conquer processing algorithm for SMRT amplicon-sequence data that generates accurate consensus sequences and local sequence assemblies. Solving the confounding bioinformatic source of error in LAA allowed for the identification of limited instances of errors due to DNA amplification or sequencing of homopolymeric nucleotide tracts. For research and development in genomics, C3S-LAA allows meaningful conclusions and biological inferences to be made from accurately polished sequence output.


July 7, 2019

BMScan: using whole genome similarity to rapidly and accurately identify bacterial meningitis causing species.

Bacterial meningitis is a life-threatening infection that remains a public health concern. Bacterial meningitis is commonly caused by the following species: Neisseria meningitidis, Streptococcus pneumoniae, Listeria monocytogenes, Haemophilus influenzae and Escherichia coli. Here, we describe BMScan (Bacterial Meningitis Scan), a whole-genome analysis tool for the species identification of bacterial meningitis-causing and closely-related pathogens, an essential step for case management and disease surveillance. BMScan relies on a reference collection that contains genomes for 17 focal species to scan against to identify a given species. We established this reference collection by supplementing publically available genomes from RefSeq with genomes from the isolate collections of the Centers for Disease Control Bacterial Meningitis Laboratory and the Minnesota Department of Health Public Health Laboratory, and then filtered them down to a representative set of genomes which capture the diversity for each species. Using this reference collection, we evaluated two genomic comparison algorithms, Mash and Average Nucleotide Identity, for their ability to accurately and rapidly identify our focal species.We found that the results of Mash were strongly correlated with the results of ANI for species identification, while providing a significant reduction in run-time. This drastic difference in run-time enabled the rapid scanning of large reference genome collections, which, when combined with species-specific threshold values, facilitated the development of BMScan. Using a validation set of 15,503 genomes of our species of interest, BMScan accurately identified 99.97% of the species within 16 min 47 s.Identification of the bacterial meningitis pathogenic species is a critical step for case confirmation and further strain characterization. BMScan employs species-specific thresholds for previously-validated, genome-wide similarity statistics compiled from a curated reference genome collection to rapidly and accurately identify the species of uncharacterized bacterial meningitis pathogens and closely related pathogens. BMScan will facilitate the transition in public health laboratories from traditional phenotypic detection methods to whole genome sequencing based methods for species identification.


July 7, 2019

STRetch: detecting and discovering pathogenic short tandem repeat expansions.

Short tandem repeat (STR) expansions have been identified as the causal DNA mutation in dozens of Mendelian diseases. Most existing tools for detecting STR variation with short reads do so within the read length and so are unable to detect the majority of pathogenic expansions. Here we present STRetch, a new genome-wide method to scan for STR expansions at all loci across the human genome. We demonstrate the use of STRetch for detecting STR expansions using short-read whole-genome sequencing data at known pathogenic loci as well as novel STR loci. STRetch is open source software, available from github.com/Oshlack/STRetch .


July 7, 2019

Near- complete genome sequences of Streptomyces sp. strains AC1-42T and AC1-42W, isolated from bat guano from Cabalyorisa Cave, Mabini, Pangasinan, Philippines.

Streptomyces sp. strains AC1-42T and AC1-42W, isolated from bat guano from Cabalyorisa Cave, Mabini, Pangasinan, Philippines, are active against Bacillus subtilis subsp. subtilis KCTC 3135T. The near-complete genome sequences reported here represent a possible source of ribosomally synthesized, posttranslationally mod- ified peptides, such as lantipeptides, bacteriocins, linaridin, and a lasso peptide.


July 7, 2019

Genome sequence of Halomonas hydrothermalis Y2, an efficient ectoine-producer isolated from pulp mill wastewater.

Halophilic microorganisms have great potentials towards biotechnological applications. Halomonas hydrothermalis Y2 is a halotolerant and alkaliphilic strain that isolated from the Na+-rich pulp mill wastewater. The strain is dominant in the bacterial community of pulp mill wastewater and exhibits metabolic diversity in utilizing various substrates. Here we present the genome sequence of this strain, which comprises a circular chromosome 3,933,432 bp in size and a GC content of 60.2%. Diverse genes that encoding proteins for compatible solutes synthesis and transport were identified from the genome. With a complete pathway for ectoine synthesis, the strain could produce ectoine from monosodium glutamate and further partially secreted into the medium. In addition, around 20% ectoine was increased by deleting the ectoine hydroxylase (EctD). The genome sequence we report here will provide genetic information regarding adaptive mechanisms of strain Y2 to its harsh habitat, as well as facilitate exploration of metabolic strategies for diverse compatible solutes, e.g., ectoine production. Copyright © 2018 Elsevier B.V. All rights reserved.


July 7, 2019

Complete genome sequence of Salmonella enterica subsp. enterica serotype Derby, associated with the pork sector in France.

In the European Union, Salmonella enterica subsp. enterica serovar Derby is the most abundant serotype isolated from pork. Recent studies have shown that this serotype is polyphyletic. However, one main genomic lineage, characterized by sequence type 40 (ST40), the presence of the Salmonella pathogenicity island 23, and showing resistance to streptomycin, sulphonamides, and tetracycline (STR-SSS- TET), is pork associated. Here, we describe the complete genome sequence of a strain from this lineage isolated in France.


July 7, 2019

Measuring the mappability spectrum of reference genome assemblies

The ability to infer actionable information from genomic variation data in a resequencing experiment relies on accurately aligning the sequences to a reference genome. However, this accuracy is inherently limited by the quality of the reference assembly and the repetitive content of the subject’s genome. As long read sequencing technologies become more widespread, it is crucial to investigate the expected improvements in alignment accuracy and variant analysis over existing short read methods. The ability to quantify the read length and error rate necessary to uniquely map regions of interest in a sequence allows users to make informed decisions regarding experiment design and provides useful metrics for comparing the magnitude of repetition across different reference assemblies. To this end we have developed NEAT-Repeat, a toolkit for exhaustively identifying the minimum read length required to uniquely map each position of a reference sequence given a specified error rate. Using these tools we computed the -mappability spectrum” for ten reference sequences, including human and a range of plants and animals, quantifying the theoretical improvements in alignment accuracy that would result from sequencing with longer reads or reads with less base-calling errors. Our inclusion of read length and error rate builds upon existing methods for mappability tracks based on uniqueness or aligner-specific mapping scores, and thus enables more comprehensive analysis. We apply our mappability results to whole-genome variant call data, and demonstrate that variants called with low mapping and genotype quality scores are disproportionately found in reference regions that require long reads to be uniquely covered. We propose that our mappability metrics provide a valuable supplement to established variant filtering and annotation pipelines by supplying users with an additional metric related to read mapping quality. NEAT-Repeat can process large and repetitive genomes, such as those of corn and soybean, in a tractable amount of time by leveraging efficient methods for edit distance computation as well as running multiple jobs in parallel. NEAT-Repeat is written in Python 2.7 and C++, and is available at https://github.com/zstephens/neat-repeat.


July 7, 2019

Complete genome sequence of the Arcobacter mytili type strain LMG 24559

Multiple Arcobacter species have been recovered from fresh and con- taminated waters, marine environments, and shellfish. Arcobacter mytili was recov- ered in 2006 from mussels collected from the Ebro River delta in Catalonia, Spain. This study describes the complete whole-genome sequence of the A. mytili type strain LMG 24559 (=F2075T=CECT 7386T).


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.