Menu
September 22, 2019

Reproducible integration of multiple sequencing datasets to form high-confidence SNP, indel, and reference calls for five human genome reference materials

Benchmark small variant calls from the Genome in a Bottle Consortium (GIAB) for the CEPH/HapMap genome NA12878 (HG001) have been used extensively for developing, optimizing, and demonstrating performance of sequencing and bioinformatics methods. Here, we develop a reproducible, cloud-based pipeline to integrate multiple sequencing datasets and form benchmark calls, enabling application to arbitrary human genomes. We use these reproducible methods to form high-confidence calls with respect to GRCh37 and GRCh38 for HG001 and 4 additional broadly-consented genomes from the Personal Genome Project that are available as NIST Reference Materials. These new genomes’ broad, open consent with few restrictions on availability of samples and data is enabling a uniquely diverse array of applications. Our new methods produce 17% more high-confidence SNPs, 176% more indels, and 12% larger regions than our previously published calls. To demonstrate that these calls can be used for accurate benchmarking, we compare other high-quality callsets to ours (e.g., Illumina Platinum Genomes), and we demonstrate that the majority of discordant calls are errors in the other callsets, We also highlight challenges in interpreting performance metrics when benchmarking against imperfect high-confidence calls. We show that benchmarking tools from the Global Alliance for Genomics and Health can be used with our calls to stratify performance metrics by variant type and genome context and elucidate strengths and weaknesses of a method.


September 22, 2019

Long-read genome sequence and assembly of Leptopilina boulardi: a specialist Drosophila parasitoid

Background: Leptopilina boulardi is a specialist parasitoid belonging to the order Hymenoptera, which attacks the larval stages of Drosophila. The Leptopilina genus has enormous value in the biological control of pests as well as in understanding several aspects of host-parasitoid biology. However, none of the members of Figitidae family has their genomes sequenced. In order to improve the understanding of the parasitoid wasps by generating genomic resources, we sequenced the whole genome of L. boulardi. Findings: Here, we report a high quality genome of L. boulardi, assembled from 70Gb of Illumina reads and 10.5Gb of PacBio reads, forming a total coverage of 230X. The 375Mb draft genome has an N50 of 275Kb with 6315 scaffolds >500bp, and encompasses >95% complete BUSCOs. The GC% of the genome is 28.26%, and RepeatMasker identified 868105 repeat elements covering 43.9% of the assembly. A total of 25259 protein-coding genes were predicted using a combination of ab-initio and RNA-Seq based methods, with an average gene size of 3.9Kb. 78.11% of the predicted genes could be annotated with at least one function. Conclusion: Our study provides a highly reliable assembly of this parasitoid wasp, which will be a valuable resource to researchers studying parasitoids. In particular, it can help delineate the host-parasitoid mechanisms that are part of the Drosophila-Leptopilina model system.


September 22, 2019

Stress-adaptive responses associated with high-level carbapenem resistance in KPC-producing Klebsiella pneumoniae.

Carbapenem-resistant Enterobacteriaceae (CRE) organisms have emerged to become a major global public health threat among antimicrobial resistant bacterial human pathogens. Little is known about how CREs emerge. One characteristic phenotype of CREs is heteroresistance, which is clinically associated with treatment failure in patients given a carbapenem. Through in vitro whole-transcriptome analysis we tracked gene expression over time in two different strains (BR7, BR21) of heteroresistant KPC-producing Klebsiella pneumoniae, first exposed to a bactericidal concentration of imipenem followed by growth in drug-free medium. In both strains, the immediate response was dominated by a shift in expression of genes involved in glycolysis toward those involved in catabolic pathways. This response was followed by global dampening of transcriptional changes involving protein translation, folding and transport, and decreased expression of genes encoding critical junctures of lipopolysaccharide biosynthesis. The emerged high-level carbapenem-resistant BR21 subpopulation had a prophage (IS1) disrupting ompK36 associated with irreversible OmpK36 porin loss. On the other hand, OmpK36 loss in BR7 was reversible. The acquisition of high-level carbapenem resistance by the two heteroresistant strains was associated with distinct and shared stepwise transcriptional programs. Carbapenem heteroresistance may emerge from the most adaptive subpopulation among a population of cells undergoing a complex set of stress-adaptive responses.


September 22, 2019

The genome sequence of a new strain of Mycobacterium ulcerans ecovar Liflandii, emerging as a sturgeon pathogen

Mycobacterium ulcerans ecovar Liflandii (MuLiflandii) is emerging as a non-mycobacterial pathogen in amphibians. Here, we make the first report on the prevalence of a new strain of MuLiflandii infection in Chinese sturgeon. All the diseased fish showed the classic clinical symptoms of ascites and/or muscle ulceration. A new slow-growing and acid-fast bacillus ASM001 strain was obtained from the ascites of infected fish; this strain demonstrated pathogenicity when tested in hybrid sturgeon. The complete genome sequence of MuLiflandii ASM001 is a circular chromosome of 6,167,296?bp, with a G?+?C content of 65.57%, containing 4518 predicted coding DNA sequences and 999 pseudo-genes, 3 rRNA operons, and 47 transfer RNA sequences. In addition, we found 245 copies of IS2404, 34 microsatellites, and 36 CRISPR sequences in the whole MuLiflandii ASM001 genome. Among the predicted genes of MuLiflandii ASM001, we found orthologs of 203 virulence factors of clinical MuLiflandii 128FXT operating in host cell invasion, modulation of phagocyte function, and survival inside the macrophages. These virulence factor candidates provide a key basis for understanding their pathogenic mechanisms at the molecular level. A comparative analysis that used complete, existing genomes showed that MuLiflandii ASM001 has high synteny with MuLiflandii 128FXT. We anticipate the availability of the complete MuLiflandii ASM001 genome sequence will provide a valuable resource for comparative genomic studies of MuLiflandii isolates, as well as provide new insights into the host, ecological, and functional diversity of the genus Mycobacterium.


September 22, 2019

The global distribution and spread of the mobilized colistin resistance gene mcr-1.

Colistin represents one of the few available drugs for treating infections caused by carbapenem-resistant Enterobacteriaceae. As such, the recent plasmid-mediated spread of the colistin resistance gene mcr-1 poses a significant public health threat, requiring global monitoring and surveillance. Here, we characterize the global distribution of mcr-1 using a data set of 457 mcr-1-positive sequenced isolates. We find mcr-1 in various plasmid types but identify an immediate background common to all mcr-1 sequences. Our analyses establish that all mcr-1 elements in circulation descend from the same initial mobilization of mcr-1 by an ISApl1 transposon in the mid 2000s (2002-2008; 95% highest posterior density), followed by a marked demographic expansion, which led to its current global distribution. Our results provide the first systematic phylogenetic analysis of the origin and spread of mcr-1, and emphasize the importance of understanding the movement of antibiotic resistance genes across multiple levels of genomic organization.


September 22, 2019

Ploidy variation in Kluyveromyces marxianus separates dairy and non-dairy isolates.

Kluyveromyces marxianus is traditionally associated with fermented dairy products, but can also be isolated from diverse non-dairy environments. Because of thermotolerance, rapid growth and other traits, many different strains are being developed for food and industrial applications but there is, as yet, little understanding of the genetic diversity or population genetics of this species. K. marxianus shows a high level of phenotypic variation but the only phenotype that has been clearly linked to a genetic polymorphism is lactose utilisation, which is controlled by variation in the LAC12 gene. The genomes of several strains have been sequenced in recent years and, in this study, we sequenced a further nine strains from different origins. Analysis of the Single Nucleotide Polymorphisms (SNPs) in 14 strains was carried out to examine genome structure and genetic diversity. SNP diversity in K. marxianus is relatively high, with up to 3% DNA sequence divergence between alleles. It was found that the isolates include haploid, diploid, and triploid strains, as shown by both SNP analysis and flow cytometry. Diploids and triploids contain long genomic tracts showing loss of heterozygosity (LOH). All six isolates from dairy environments were diploid or triploid, whereas 6 out 7 isolates from non-dairy environment were haploid. This also correlated with the presence of functional LAC12 alleles only in dairy haplotypes. The diploids were hybrids between a non-dairy and a dairy haplotype, whereas triploids included three copies of a dairy haplotype.


September 22, 2019

CliqueSNV: Scalable reconstruction of intra-host viral populations from NGS reads

Highly mutable RNA viruses such as influenza A virus, human immunodeficiency virus and hepatitis C virus exist in infected hosts as highly heterogeneous populations of closely related genomic variants. The presence of low-frequency variants with few mutations with respect to major strains may result in an immune escape, emergence of drug resistance, and an increase of virulence and infectivity. Next-generation sequencing technologies permit detection of sample intra-host viral population at extremely great depth, thus providing an opportunity to access low-frequency variants. Long read lengths offered by single-molecule sequencing technologies allow all viral variants to be sequenced in a single pass. However, high sequencing error rates limit the ability to study heterogeneous viral populations composed of rare, closely related variants. In this article, we present CliqueSNV, a novel reference-based method for reconstruction of viral variants from NGS data. It efficiently constructs an allele graph based on linkage between single nucleotide variations and identifies true viral variants by merging cliques of that graph using combinatorial optimization techniques. The new method outperforms existing methods in both accuracy and running time on experimental and simulated NGS data for titrated levels of known viral variants. For PacBio reads, it accurately reconstructs variants with frequency as low as 0.1%. For Illumina reads, it fully reconstructs main variants. The open source implementation of CliqueSNV is freely available for download at https://github.com/vyacheslav-tsivina/CliqueSNV


September 22, 2019

Genomic diversity of Taylorella equigenitalis introduced into the United States from 1978 to 2012.

Contagious equine metritis is a disease of worldwide concern in equids. The United States is considered to be free of the disease although sporadic outbreaks have occurred over the last few decades that were thought to be associated with the importation of horses. The objective of this study was to create finished, reference quality genomes that characterize the diversity of Taylorella equigenitalis isolates introduced into the USA, and identify their differences. Five isolates of T. equigenitalis associated with introductions into the USA from unique sources were sequenced using both short and long read chemistries allowing for complete assembly and annotation. These sequences were compared to previously published genomes as well as the short read sequences of the 200 isolates in the National Veterinary Services Laboratories’ diagnostic repository to identify unique regions and genes, potential virulence factors, and characterize diversity. The 5 genomes varied in size by up to 100,000 base pairs, but averaged 1.68 megabases. The majority of that diversity in size can be explained by repeat regions and 4 main regions of difference, which ranged in size from 15,000 to 45,000 base pairs. The first region of difference contained mostly hypothetical proteins, the second contained the CRISPR, the third contained primarily hemagglutinin proteins, and the fourth contained primarily segments of a type IV secretion system. As expected and previously reported, little evidence of recombination was found within these genomes. Several additional areas of interest were also observed including a mechanism for streptomycin resistance and other virulence factors. A SNP distance comparison of the T. equigenitalis isolates and Mycobacterium tuberculosis complex (MTBC) showed that relatively, T. equigenitalis was a more diverse species than the entirety of MTBC.


September 22, 2019

Targeted sequencing by gene synteny, a new strategy for polyploid species: sequencing and physical structure of a complex sugarcane region.

Sugarcane exhibits a complex genome mainly due to its aneuploid nature and high ploidy level, and sequencing of its genome poses a great challenge. Closely related species with well-assembled and annotated genomes can be used to help assemble complex genomes. Here, a stable quantitative trait locus (QTL) related to sugar accumulation in sorghum was successfully transferred to the sugarcane genome. Gene sequences related to this QTL were identified in silico from sugarcane transcriptome data, and molecular markers based on these sequences were developed to select bacterial artificial chromosome (BAC) clones from the sugarcane variety SP80-3280. Sixty-eight BAC clones containing at least two gene sequences associated with the sorghum QTL were sequenced using Pacific Biosciences (PacBio) technology. Twenty BAC sequences were found to be related to the syntenic region, of which nine were sufficient to represent this region. The strategy we propose is called “targeted sequencing by gene synteny,” which is a simpler approach to understanding the genome structure of complex genomic regions associated with traits of interest.


September 22, 2019

Microsatellite polymorphism in the endangered snail kite reveals a panmictic, low diversity population

Genetic structure and genetic diversity are key population characteristics that can inform conservation decisions, such as delineating management units or assessing potential risks for inbreeding depression. Evidence of genetic structuring or low genetic diversity in the critically endangered snail kite (Rostrhamus sociabilis plumbeus) would have implications for monitoring and planning decisions. Recent work on understanding connectivity across the snail kite range indicated that there is less dispersal between northern and southern parts of the current range, and that dispersal is shaped by individual habitat preference. We examine whether there is neutral genetic structure and the amount of genetic variation in the population by non-lethally sampling 235 nestlings from unique nests across the entire breeding range between 2013 and 2014. Data on 15 microsatellite revealed low diversity (e.g., Na?=?2.54, He?=?0.37) and range-wide panmixia based on AMOVA, Bayesian clustering, spatial autocorrelation, isolation by distance, and spatially explicit ordination analyses. Our results emphasize that long-term recovery goals and management strategies should be based on viewing snail kites as a single genetic population, despite evidence for non-random dispersal between wetlands over ecological time scales. These results also highlight the need to understand potential effects of low genetic diversity on population dynamics and viability of snail kites. More broadly, these results add to the growing evidence for potential discrepancies between dispersal and genetic patterns, emphasizing that care should be taken if using one to interpret the other, particularly for widely-ranging species.


September 22, 2019

Expansions of intronic TTTCA and TTTTA repeats in benign adult familial myoclonic epilepsy.

Epilepsy is a common neurological disorder, and mutations in genes encoding ion channels or neurotransmitter receptors are frequent causes of monogenic forms of epilepsy. Here we show that abnormal expansions of TTTCA and TTTTA repeats in intron 4 of SAMD12 cause benign adult familial myoclonic epilepsy (BAFME). Single-molecule, real-time sequencing of BAC clones and nanopore sequencing of genomic DNA identified two repeat configurations in SAMD12. Intriguingly, in two families with a clinical diagnosis of BAFME in which no repeat expansions in SAMD12 were observed, we identified similar expansions of TTTCA and TTTTA repeats in introns of TNRC6A and RAPGEF2, indicating that expansions of the same repeat motifs are involved in the pathogenesis of BAFME regardless of the genes in which the expanded repeats are located. This discovery that expansions of noncoding repeats lead to neuronal dysfunction responsible for myoclonic tremor and epilepsy extends the understanding of diseases with such repeat expansion.


September 22, 2019

Challenges of Francisella classification exemplified by an atypical clinical isolate.

The accumulation of sequenced Francisella strains has made it increasingly apparent that the 16S rRNA gene alone is not enough to stratify the Francisella genus into precise and clinically useful classifications. Continued whole-genome sequencing of isolates will provide a larger base of knowledge for targeted approaches with broad applicability. Additionally, examination of genomic information on a case-by-case basis will help resolve outstanding questions regarding strain stratification. We report the complete genome sequence of a clinical isolate, designated here as F. novicida-like strain TCH2015, acquired from the lymph node of a 6-year-old male. Two features were atypical for F. novicida: exhibition of functional oxidase activity and additional gene content, including proposed virulence determinants. These differences, which could potentially impact virulence and clinical diagnosis, emphasize the need for more comprehensive methods to profile Francisella isolates. This study highlights the value of whole-genome sequencing, which will lead to a more robust database of environmental and clinical genomes and inform strategies to improve detection and classification of Francisella strains. Copyright © 2017 Elsevier Inc. All rights reserved.


September 22, 2019

Genomics of habitat choice and adaptive evolution in a deep-sea fish.

Intraspecific diversity promotes evolutionary change, and when partitioned among geographic regions or habitats can form the basis for speciation. Marine species live in an environment that can provide as much scope for diversification in the vertical as in the horizontal dimension. Understanding the relevant mechanisms will contribute significantly to our understanding of eco-evolutionary processes and effective biodiversity conservation. Here, we provide an annotated genome assembly for the deep-sea fish Coryphaenoides rupestris and re-sequencing data to show that differentiation at non-synonymous sites in functional loci distinguishes individuals living at different depths, independent of horizontal spatial distance. Our data indicate disruptive selection at these loci; however, we find no clear evidence for differentiation at neutral loci that may indicate assortative mating. We propose that individuals with distinct genotypes at relevant loci segregate by depth as they mature (supported by survey data), which may be associated with ecotype differentiation linked to distinct phenotypic requirements at different depths.


September 22, 2019

Genome analysis of Fimbriiglobus ruber SP5T, a planctomycete with confirmed chitinolytic capability.

Members of the bacterial order Planctomycetales have often been observed in associations with Crustacea. The ability to degrade chitin, however, has never been reported for any of the cultured planctomycetes although utilization of N-acetylglucosamine (GlcNAc) as a sole carbon and nitrogen source is well recognized for these bacteria. Here, we demonstrate the chitinolytic capability of a member of the family Gemmataceae, Fimbriiglobus ruber SP5T, which was isolated from a peat bog. As revealed by metatranscriptomic analysis of chitin-amended peat, the pool of 16S rRNA reads from F. ruber increased in response to chitin availability. Strain SP5T displayed only weak growth on amorphous chitin as a sole source of carbon but grew well with chitin as a source of nitrogen. The genome of F. ruber SP5T is 12.364 Mb in size and is the largest among all currently determined planctomycete genomes. It encodes several enzymes putatively involved in chitin degradation, including two chitinases affiliated with the glycoside hydrolase (GH) family GH18, GH20 family ß-N-acetylglucosaminidase, and the complete set of enzymes required for utilization of GlcNAc. The gene encoding one of the predicted chitinases was expressed in Escherichia coli, and the endochitinase activity of the recombinant enzyme was confirmed. The genome also contains genes required for the assembly of type IV pili, which may be used to adhere to chitin and possibly other biopolymers. The ability to use chitin as a source of nitrogen is of special importance for planctomycetes that inhabit N-depleted ombrotrophic wetlands. IMPORTANCE Planctomycetes represent an important part of the microbial community in Sphagnum-dominated peatlands, but their potential functions in these ecosystems remain poorly understood. This study reports the presence of chitinolytic potential in one of the recently described peat-inhabiting members of the family Gemmataceae, Fimbriiglobus ruber SP5T This planctomycete uses chitin, a major constituent of fungal cell walls and exoskeletons of peat-inhabiting arthropods, as a source of nitrogen in N-depleted ombrotrophic Sphagnum-dominated peatlands. This study reports the chitin-degrading capability of representatives of the order Planctomycetales. Copyright © 2018 American Society for Microbiology.


September 22, 2019

Comparative genome analysis reveals a complex population structure of Legionella pneumophila subspecies.

The majority of Legionnaires’ disease (LD) cases are caused by Legionella pneumophila, a genetically heterogeneous species composed of at least 17 serogroups. Previously, it was demonstrated that L. pneumophila consists of three subspecies: pneumophila, fraseri and pascullei. During an LD outbreak investigation in 2012, we detected that representatives of both subspecies fraseri and pascullei colonized the same water system and that the outbreak-causing strain was a new member of the least represented subspecies pascullei. We used partial sequence based typing consensus patterns to mine an international database for additional representatives of fraseri and pascullei subspecies. As a result, we identified 46 sequence types (STs) belonging to subspecies fraseri and two STs belonging to subspecies pascullei. Moreover, a recent retrospective whole genome sequencing analysis of isolates from New York State LD clusters revealed the presence of a fourth L. pneumophila subspecies that we have termed raphaeli. This subspecies consists of 15 STs. Comparative analysis was conducted using the genomes of multiple members of all four L. pneumophila subspecies. Whereas each subspecies forms a distinct phylogenetic clade within the L. pneumophila species, they share more average nucleotide identity with each other than with other Legionella species. Unique genes for each subspecies were identified and could be used for rapid subspecies detection. Improved taxonomic classification of L. pneumophila strains may help identify environmental niches and virulence attributes associated with these genetically distinct subspecies. Published by Elsevier B.V.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.