Menu
September 22, 2019

Reproducible integration of multiple sequencing datasets to form high-confidence SNP, indel, and reference calls for five human genome reference materials

Benchmark small variant calls from the Genome in a Bottle Consortium (GIAB) for the CEPH/HapMap genome NA12878 (HG001) have been used extensively for developing, optimizing, and demonstrating performance of sequencing and bioinformatics methods. Here, we develop a reproducible, cloud-based pipeline to integrate multiple sequencing datasets and form benchmark calls, enabling application to arbitrary human genomes. We use these reproducible methods to form high-confidence calls with respect to GRCh37 and GRCh38 for HG001 and 4 additional broadly-consented genomes from the Personal Genome Project that are available as NIST Reference Materials. These new genomes’ broad, open consent with few restrictions on availability of samples and data is enabling a uniquely diverse array of applications. Our new methods produce 17% more high-confidence SNPs, 176% more indels, and 12% larger regions than our previously published calls. To demonstrate that these calls can be used for accurate benchmarking, we compare other high-quality callsets to ours (e.g., Illumina Platinum Genomes), and we demonstrate that the majority of discordant calls are errors in the other callsets, We also highlight challenges in interpreting performance metrics when benchmarking against imperfect high-confidence calls. We show that benchmarking tools from the Global Alliance for Genomics and Health can be used with our calls to stratify performance metrics by variant type and genome context and elucidate strengths and weaknesses of a method.


September 22, 2019

Comparative genome analysis reveals a complex population structure of Legionella pneumophila subspecies.

The majority of Legionnaires’ disease (LD) cases are caused by Legionella pneumophila, a genetically heterogeneous species composed of at least 17 serogroups. Previously, it was demonstrated that L. pneumophila consists of three subspecies: pneumophila, fraseri and pascullei. During an LD outbreak investigation in 2012, we detected that representatives of both subspecies fraseri and pascullei colonized the same water system and that the outbreak-causing strain was a new member of the least represented subspecies pascullei. We used partial sequence based typing consensus patterns to mine an international database for additional representatives of fraseri and pascullei subspecies. As a result, we identified 46 sequence types (STs) belonging to subspecies fraseri and two STs belonging to subspecies pascullei. Moreover, a recent retrospective whole genome sequencing analysis of isolates from New York State LD clusters revealed the presence of a fourth L. pneumophila subspecies that we have termed raphaeli. This subspecies consists of 15 STs. Comparative analysis was conducted using the genomes of multiple members of all four L. pneumophila subspecies. Whereas each subspecies forms a distinct phylogenetic clade within the L. pneumophila species, they share more average nucleotide identity with each other than with other Legionella species. Unique genes for each subspecies were identified and could be used for rapid subspecies detection. Improved taxonomic classification of L. pneumophila strains may help identify environmental niches and virulence attributes associated with these genetically distinct subspecies. Published by Elsevier B.V.


September 22, 2019

Comparative genomics of smut pathogens: Insights from orphans and positively selected genes into host specialization.

Host specialization is a key evolutionary process for the diversification and emergence of new pathogens. However, the molecular determinants of host range are poorly understood. Smut fungi are biotrophic pathogens that have distinct and narrow host ranges based on largely unknown genetic determinants. Hence, we aimed to expand comparative genomics analyses of smut fungi by including more species infecting different hosts and to define orphans and positively selected genes to gain further insights into the genetics basis of host specialization. We analyzed nine lineages of smut fungi isolated from eight crop and non-crop hosts: maize, barley, sugarcane, wheat, oats, Zizania latifolia (Manchurian rice), Echinochloa colona (a wild grass), and Persicaria sp. (a wild dicot plant). We assembled two new genomes: Ustilago hordei (strain Uhor01) isolated from oats and U. tritici (strain CBS 119.19) isolated from wheat. The smut genomes were of small sizes, ranging from 18.38 to 24.63 Mb. U. hordei species experienced genome expansions due to the proliferation of transposable elements and the amount of these elements varied among the two strains. Phylogenetic analysis confirmed that Ustilago is not a monophyletic genus and, furthermore, detected misclassification of the U. tritici specimen. The comparison between smut pathogens of crop and non-crop hosts did not reveal distinct signatures, suggesting that host domestication did not play a dominant role in shaping the evolution of smuts. We found that host specialization in smut fungi likely has a complex genetic basis: different functional categories were enriched in orphans and lineage-specific selected genes. The diversification and gain/loss of effector genes are probably the most important determinants of host specificity.


September 22, 2019

Comparative genomics of the wheat fungal pathogen Pyrenophora tritici-repentis reveals chromosomal variations and genome plasticity.

Pyrenophora tritici-repentis (Ptr) is a necrotrophic fungal pathogen that causes the major wheat disease, tan spot. We set out to provide essential genomics-based resources in order to better understand the pathogenicity mechanisms of this important pathogen.Here, we present eight new Ptr isolate genomes, assembled and annotated; representing races 1, 2 and 5, and a new race. We report a high quality Ptr reference genome, sequenced by PacBio technology with Illumina paired-end data support and optical mapping. An estimated 98% of the genome coverage was mapped to 10 chromosomal groups, using a two-enzyme hybrid approach. The final reference genome was 40.9 Mb and contained a total of 13,797 annotated genes, supported by transcriptomic and proteogenomics data sets.Whole genome comparative analysis revealed major chromosomal segmental rearrangements and fusions, highlighting intraspecific genome plasticity in this species. Furthermore, the Ptr race classification was not supported at the whole genome level, as phylogenetic analysis did not cluster the ToxA producing isolates. This expansion of available Ptr genomics resources will directly facilitate research aimed at controlling tan spot disease.


September 22, 2019

Transposable element genomic fissuring in Pyrenophora teres is associated with genome expansion and dynamics of host-pathogen genetic interactions.

Pyrenophora teres, P. teres f. teres (PTT) and P. teres f. maculata (PTM) cause significant diseases in barley, but little is known about the large-scale genomic differences that may distinguish the two forms. Comprehensive genome assemblies were constructed from long DNA reads, optical and genetic maps. As repeat masking in fungal genomes influences the final gene annotations, an accurate and reproducible pipeline was developed to ensure comparability between isolates. The genomes of the two forms are highly collinear, each composed of 12 chromosomes. Genome evolution in P. teres is characterized by genome fissuring through the insertion and expansion of transposable elements (TEs), a process that isolates blocks of genic sequence. The phenomenon is particularly pronounced in PTT, which has a larger, more repetitive genome than PTM and more recent transposon activity measured by the frequency and size of genome fissures. PTT has a longer cultivated host association and, notably, a greater range of host-pathogen genetic interactions compared to other Pyrenophora spp., a property which associates better with genome size than pathogen lifestyle. The two forms possess similar complements of TE families with Tc1/Mariner and LINE-like Tad-1 elements more abundant in PTT. Tad-1 was only detectable as vestigial fragments in PTM and, within the forms, differences in genome sizes and the presence and absence of several TE families indicated recent lineage invasions. Gene differences between P. teres forms are mainly associated with gene-sparse regions near or within TE-rich regions, with many genes possessing characteristics of fungal effectors. Instances of gene interruption by transposons resulting in pseudogenization were detected in PTT. In addition, both forms have a large complement of secondary metabolite gene clusters indicating significant capacity to produce an array of different molecules. This study provides genomic resources for functional genetics to help dissect factors underlying the host-pathogen interactions.


September 22, 2019

Draft genome of the Peruvian scallop Argopecten purpuratus.

The Peruvian scallop, Argopecten purpuratus, is mainly cultured in southern Chile and Peru was introduced into China in the last century. Unlike other Argopecten scallops, the Peruvian scallop normally has a long life span of up to 7 to 10 years. Therefore, researchers have been using it to develop hybrid vigor. Here, we performed whole genome sequencing, assembly, and gene annotation of the Peruvian scallop, with an important aim to develop genomic resources for genetic breeding in scallops.A total of 463.19-Gb raw DNA reads were sequenced. A draft genome assembly of 724.78 Mb was generated (accounting for 81.87% of the estimated genome size of 885.29 Mb), with a contig N50 size of 80.11 kb and a scaffold N50 size of 1.02 Mb. Repeat sequences were calculated to reach 33.74% of the whole genome, and 26,256 protein-coding genes and 3,057 noncoding RNAs were predicted from the assembly.We generated a high-quality draft genome assembly of the Peruvian scallop, which will provide a solid resource for further genetic breeding and for the analysis of the evolutionary history of this economically important scallop.


September 22, 2019

IMSindel: An accurate intermediate-size indel detection tool incorporating de novo assembly and gapped global-local alignment with split read analysis.

Insertions and deletions (indels) have been implicated in dozens of human diseases through the radical alteration of gene function by short frameshift indels as well as long indels. However, the accurate detection of these indels from next-generation sequencing data is still challenging. This is particularly true for intermediate-size indels (=50?bp), due to the short DNA sequencing reads. Here, we developed a new method that predicts intermediate-size indels using BWA soft-clipped fragments (unmatched fragments in partially mapped reads) and unmapped reads. We report the performance comparison of our method, GATK, PINDEL and ScanIndel, using whole exome sequencing data from the same samples. False positive and false negative counts were determined through Sanger sequencing of all predicted indels across these four methods. The harmonic mean of the recall and precision, F-measure, was used to measure the performance of each method. Our method achieved the highest F-measure of 0.84 in one sample, compared to 0.56 for GATK, 0.52 for PINDEL and 0.46 for ScanIndel. Similar results were obtained in additional samples, demonstrating that our method was superior to the other methods for detecting intermediate-size indels. We believe that this methodology will contribute to the discovery of intermediate-size indels associated with human disease.


September 22, 2019

Genomic analysis of oral Campylobacter concisus strains identified a potential bacterial molecular marker associated with active Crohn’s disease.

Campylobacter concisus is an oral bacterium that is associated with inflammatory bowel disease (IBD) including Crohn’s disease (CD) and ulcerative colitis (UC). C. concisus consists of two genomospecies (GS) and diverse strains. This study aimed to identify molecular markers to differentiate commensal and IBD-associated C. concisus strains. The genomes of 63 oral C. concisus strains isolated from patients with IBD and healthy controls were examined, of which 38 genomes were sequenced in this study. We identified a novel secreted enterotoxin B homologue, Csep1. The csep1 gene was found in 56% of GS2 C. concisus strains, presented in the plasmid pICON or the chromosome. A six-nucleotide insertion at the position 654-659?bp in csep1 (csep1-6bpi) was found. The presence of csep1-6bpi in oral C. concisus strains isolated from patients with active CD (47%, 7/15) was significantly higher than that in strains from healthy controls (0/29, P?=?0.0002), and the prevalence of csep1-6bpi positive C. concisus strains was significantly higher in patients with active CD (67%, 4/6) as compared to healthy controls (0/23, P?=?0.0006). Proteomics analysis detected the Csep1 protein. A csep1 gene hot spot in the chromosome of different C. concisus strains was found. The pICON plasmid was only found in GS2 strains isolated from the two relapsed CD patients with small bowel complications. This study reports a C. concisus molecular marker (csep1-6bpi) that is associated with active CD.


September 22, 2019

SvABA: genome-wide detection of structural variants and indels by local assembly.

Structural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA’s performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs and substantially improves detection performance for variants in the 20-300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (<1000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types and found that short templated-sequence insertions occur in ~4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized (50-300 bp) SVs.© 2018 Wala et al.; Published by Cold Spring Harbor Laboratory Press.


September 22, 2019

Multi-omics approach identifies novel pathogen-derived prognostic biomarkers in patients with Pseudomonas aeruginosa bloodstream infection

Pseudomonas aeruginosa is a human pathogen that causes health-care associated blood stream infections (BSI). Although P. aeruginosa BSI are associated with high mortality rates, the clinical relevance of pathogen-derived prognostic biomarker to identify patients at risk for unfavorable outcome remains largely unexplored. We found novel pathogen-derived prognostic biomarker candidates by applying a multi-omics approach on a multicenter sepsis patient cohort. Multi-level Cox regression was used to investigate the relation between patient characteristics and pathogen features (2298 accessory genes, 1078 core protein levels, 107 parsimony-informative variations in reported virulence factors) with 30-day mortality. Our analysis revealed that presence of the helP gene encoding a putative DEAD-box helicase was independently associated with a fatal outcome (hazard ratio 2.01, p = 0.05). helP is located within a region related to the pathogenicity island PAPI-1 in close proximity to a pil gene cluster, which has been associated with horizontal gene transfer. Besides helP, elevated protein levels of the bacterial flagellum protein FliL (hazard ratio 3.44, p < 0.001) and of a bacterioferritin-like protein (hazard ratio 1.74, p = 0.003) increased the risk of death, while high protein levels of a putative aminotransferase were associated with an improved outcome (hazard ratio 0.12, p < 0.001). The prognostic potential of biomarker candidates and clinical factors was confirmed with different machine learning approaches using training and hold-out datasets. The helP genotype appeared the most attractive biomarker for clinical risk stratification due to its relevant predictive power and ease of detection.


September 22, 2019

Genomic analysis of a pan-resistant isolate of Klebsiella pneumoniae, United States 2016.

Antimicrobial resistance is a threat to public health globally and leads to an estimated 23,000 deaths annually in the United States alone. Here, we report the genomic characterization of an unusualKlebsiella pneumoniae, nonsusceptible to all 26 antibiotics tested, that was isolated from a U.S.The isolate harbored four known beta-lactamase genes, including plasmid-mediatedblaNDM-1andblaCMY-6, as well as chromosomalblaCTX-M-15andblaSHV-28, which accounted for resistance to all beta-lactams tested. In addition, sequence analysis identified mechanisms that could explain all other reported nonsusceptibility results, including nonsusceptibility to colistin, tigecycline, and chloramphenicol. Two plasmids, IncA/C2 and IncFIB, were closely related to mobile elements described previously and isolated from Gram-negative bacteria from China, Nepal, India, the United States, and Kenya, suggesting possible origins of the isolate and plasmids. This is one of the firstK. pneumoniaeisolates in the United States to have been reported to the Centers for Disease Control and Prevention (CDC) as nonsusceptible to all drugs tested, including all beta-lactams, colistin, and tigecycline. IMPORTANCE Antimicrobial resistance is a major public health threat worldwide. Bacteria that are nonsusceptible or resistant to all antimicrobials available are of major concern to patients and the public because of lack of treatment options and potential for spread. AKlebsiella pneumoniaestrain that was nonsusceptible to all tested antibiotics was isolated from a U.S.Mechanisms that could explain all observed phenotypic antimicrobial resistance phenotypes, including resistance to colistin and beta-lactams, were identified through whole-genome sequencing. The large variety of resistance determinants identified demonstrates the usefulness of whole-genome sequencing for detecting these genes in an outbreak response. Sequencing of isolates with rare and unusual phenotypes can provide information on how these extremely resistant isolates develop, including whether resistance is acquired on mobile elements or accumulated through chromosomal mutations. Moreover, this provides further insight into not only detecting these highly resistant organisms but also preventing their spread.


September 22, 2019

Genomic architecture of haddock (Melanogrammus aeglefinus) shows expansions of innate immune genes and short tandem repeats.

Increased availability of genome assemblies for non-model organisms has resulted in invaluable biological and genomic insight into numerous vertebrates, including teleosts. Sequencing of the Atlantic cod (Gadus morhua) genome and the genomes of many of its relatives (Gadiformes) demonstrated a shared loss of the major histocompatibility complex (MHC) II genes 100 million years ago. An improved version of the Atlantic cod genome assembly shows an extreme density of tandem repeats compared to other vertebrate genome assemblies. Highly contiguous assemblies are therefore needed to further investigate the unusual immune system of the Gadiformes, and whether the high density of tandem repeats found in Atlantic cod is a shared trait in this group.Here, we have sequenced and assembled the genome of haddock (Melanogrammus aeglefinus) – a relative of Atlantic cod – using a combination of PacBio and Illumina reads. Comparative analyses reveal that the haddock genome contains an even higher density of tandem repeats outside and within protein coding sequences than Atlantic cod. Further, both species show an elevated number of tandem repeats in genes mainly involved in signal transduction compared to other teleosts. A characterization of the immune gene repertoire demonstrates a substantial expansion of MCHI in Atlantic cod compared to haddock. In contrast, the Toll-like receptors show a similar pattern of gene losses and expansions. For the NOD-like receptors (NLRs), another gene family associated with the innate immune system, we find a large expansion common to all teleosts, with possible lineage-specific expansions in zebrafish, stickleback and the codfishes.The generation of a highly contiguous genome assembly of haddock revealed that the high density of short tandem repeats as well as expanded immune gene families is not unique to Atlantic cod – but possibly a feature common to all, or most, codfishes. A shared expansion of NLR genes in teleosts suggests that the NLRs have a more substantial role in the innate immunity of teleosts than other vertebrates. Moreover, we find that high copy number genes combined with variable genome assembly qualities may impede complete characterization of these genes, i.e. the number of NLRs in different teleost species might be underestimates.


September 22, 2019

The complete replicons of 16 Ensifer meliloti strains offer insights into intra- and inter-replicon gene transfer, transposon-associated loci, and repeat elements.

Ensifer meliloti (formerly Rhizobium meliloti and Sinorhizobium meliloti) is a model bacterium for understanding legume-rhizobial symbioses. The tripartite genome of E. meliloti consists of a chromosome, pSymA and pSymB, and in some instances strain-specific accessory plasmids. The majority of previous sequencing studies have relied on the use of assemblies generated from short read sequencing, which leads to gaps and assembly errors. Here we used PacBio-based, long-read assemblies and were able to assemble, de novo, complete circular replicons. In this study, we sequenced, de novo-assembled and analysed 10 E. meliloti strains. Sequence comparisons were also done with data from six previously published genomes. We identified genome differences between the replicons, including mol% G+C and gene content, nucleotide repeats, and transposon-associated loci. Additionally, genomic rearrangements both within and between replicons were identified, providing insight into evolutionary processes at the structural level. There were few cases of inter-replicon gene transfer of core genes between the main replicons. Accessory plasmids were more similar to pSymA than to either pSymB or the chromosome, with respect to gene content, transposon content and G+C content. In our population, the accessory plasmids appeared to share an open genome with pSymA, which contains many nodulation- and nitrogen fixation-related genes. This may explain previous observations that horizontal gene transfer has a greater effect on the content of pSymA than pSymB, or the chromosome, and why some rhizobia show unstable nodulation phenotypes on legume hosts.


September 22, 2019

The Egyptian rousette genome reveals unexpected features of bat antiviral immunity.

Bats harbor many viruses asymptomatically, including several notorious for causing extreme virulence in humans. To identify differences between antiviral mechanisms in humans and bats, we sequenced, assembled, and analyzed the genome of Rousettus aegyptiacus, a natural reservoir of Marburg virus and the only known reservoir for any filovirus. We found an expanded and diversified KLRC/KLRD family of natural killer cell receptors, MHC class I genes, and type I interferons, which dramatically differ from their functional counterparts in other mammals. Such concerted evolution of key components of bat immunity is strongly suggestive of novel modes of antiviral defense. An evaluation of the theoretical function of these genes suggests that an inhibitory immune state may exist in bats. Based on our findings, we hypothesize that tolerance of viral infection, rather than enhanced potency of antiviral defenses, may be a key mechanism by which bats asymptomatically host viruses that are pathogenic in humans. Copyright © 2018 Elsevier Inc. All rights reserved.


September 22, 2019

De novo genome assembly of the red silk cotton tree (Bombax ceiba).

Bombax ceiba L. (the red silk cotton tree) is a large deciduous tree that is distributed in tropical and sub-tropical Asia as well as northern Australia. It has great economic and ecological importance, with several applications in industry and traditional medicine in many Asian countries. To facilitate further utilization of this plant resource, we present here the draft genome sequence for B. ceiba.We assembled a relatively intact genome of B. ceiba by using PacBio single-molecule sequencing and BioNano optical mapping technologies. The final draft genome is approximately 895 Mb long, with contig and scaffold N50 sizes of 1.0 Mb and 2.06 Mb, respectively.The high-quality draft genome assembly of B. ceiba will be a valuable resource enabling further genetic improvement and more effective use of this tree species.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.