Preprint Archives

April 21, 2020

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads.

The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes. © 2019 John Wiley & Sons Ltd/University College London.

April 21, 2020

A robust benchmark for germline structural variant detection

New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. Translating these methods to routine research and clinical practice requires robust benchmark sets. We developed the first benchmark set for identification of both false negative and false positive germline SVs, which complements recent efforts emphasizing increasingly comprehensive characterization of SVs. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods, both alignment- and de novo assembly-based, from short-, linked-, and long-read sequencing, as well as optical and electronic mapping. The final benchmark set contains 12745 isolated, sequence-resolved insertion and deletion calls =50 base pairs (bp) discovered by at least 2 technologies or 5 callsets, genotyped as heterozygous or homozygous variants by long reads. The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.66 Gbp and 9641 SVs supported by at least one diploid assembly. Support for SVs was assessed using svviz with short-, linked-, and long-read sequence data. In general, there was strong support from multiple technologies for the benchmark SVs, with 90 % of the Tier 1 SVs having support in reads from more than one technology. The Mendelian genotype error rate was 0.3 %, and genotype concordance with manual curation was >98.7 %. We demonstrate the utility of the benchmark set by showing it reliably identifies both false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping.

April 21, 2020

The Genome of the Zebra Mussel, Dreissena polymorpha: A Resource for Invasive Species Research

The zebra mussel, Dreissena polymorpha, continues to spread from its native range in Eurasia to Europe and North America, causing billions of dollars in damage and dramatically altering invaded aquatic ecosystems. Despite these impacts, there are few genomic resources for Dreissena or related bivalves, with nearly 450 million years of divergence between zebra mussels and its closest sequenced relative. Although the D. polymorpha genome is highly repetitive, we have used a combination of long-read sequencing and Hi-C-based scaffolding to generate the highest quality molluscan assembly to date. Through comparative analysis and transcriptomics experiments we have gained insights into processes that likely control the invasive success of zebra mussels, including shell formation, synthesis of byssal threads, and thermal tolerance. We identified multiple intact Steamer-Like Elements, a retrotransposon that has been linked to transmissible cancer in marine clams. We also found that D. polymorpha have an unusual 67 kb mitochondrial genome containing numerous tandem repeats, making it the largest observed in Eumetazoa. Together these findings create a rich resource for invasive species research and control efforts.

April 21, 2020

Pseudo-chromosome length genome assembly of a double haploid ‘Bartlett’ pear (Pyrus communis L.)

We report an improved assembly and scaffolding of the European pear (Pyrus communis L.) genome (referred to as BartlettDHv2.0), obtained using a combination of Pacific Biosciences RSII Long read sequencing (PacBio), Bionano optical mapping, chromatin interaction capture (Hi-C), and genetic mapping. A total of 496.9 million bases (Mb) corresponding to 97% of the estimated genome size were assembled into 494 scaffolds. Hi-C data and a high-density genetic map allowed us to anchor and orient 87% of the sequence on the 17 chromosomes of the pear genome. About 50% (247 Mb) of the genome consists of repetitive sequences. Comparison with previous assemblies of Pyrus communis. and Pyrus x bretschneideri confirmed the presence of 37,445 protein-coding genes, which is 13% fewer than previously predicted.

April 21, 2020

Complete genome sequence and annotation of the laboratory reference strain Shigella flexneri serovar 5a M90T and genome-wide transcription start site determination

Background Shigella is a Gram-negative facultative intracellular bacterium that causes bacillary dysentery in humans. Shigella invades cells of the colonic mucosa owing to its virulence plasmid-encoded Type 3 Secretion System (T3SS), and multiplies in the target cell cytosol. Although the laboratory reference strain S. flexneri serotype 5a M90T has been extensively used to understand the molecular mechanisms of pathogenesis, its complete genome sequence is not available, thereby greatly limiting studies employing high-throughput sequencing and systems biology approaches. Results We have sequenced, assembled, annotated and manually curated the full genome of S. flexneri 5a M90T. This yielded two complete circular contigs, the chromosome and the virulence plasmid (pWR100). To obtain the genome sequence, we have employed long-read PacBio DNA sequencing followed by polishing with Illumina RNA-seq data. This provides a new pipeline to prepare gapless, highly accurate genome sequences. Furthermore, we have performed genome-wide analysis of transcriptional start sites and determined the length of 5’ untranslated regions (5’-UTRs) at typical culture conditions for the inoculum of in vitro infection experiments. We identified 6,723 primary TSS (pTSS) and 7,328 secondary TSS (sTSS). The S. flexneri 5a M90T annotated genome sequence and the transcriptional start sites are integrated into RegulonDB (http://regulondb.ccg.unam.mx) and RSAT (http://embnet.ccg.unam.mx/rsat/) to use its analysis tools in S. flexneri 5a M90T genome. Conclusions We provide the first complete genome for S. flexneri serotype 5a, specifically the laboratory reference strain M90T. Our work opens the possibility of employing S. flexneri M90T in high-quality systems biology studies such as transcriptomic and differential expression analyses or in genome evolution studies. Moreover, the catalogue of TSS that we report here can be used in molecular pathogenesis studies as a resource to know which genes are transcribed before infection of host cells. The genome sequence, together with the analysis of transcriptional start sites, is also a valuable tool for precise genetic manipulation of S. flexneri 5a M90T. The hybrid pipeline that we report here combining genome sequencing with long-reads technology and polishing with RNAseq data defines a powerful strategy for genome assembly, polishing and annotation in any type of organism.

April 21, 2020

Soil Probiotic Utilizes Plant and Pollinator Transport for Territorial Expansion

Microbe-plant interactions are linked with the core microbiota, and both the plant and the microbial partners depend on one other to thrive in nature. However, why and how the below-ground core microbiota become established aboveground is poorly understood. We tracked the movement of a probiotic Streptomyces endophyte throughout a managed strawberry ecosystem. Probiotics in the rhizosphere and anthosphere were genetically identical, yet these niches were segregated in space and time. The probiotic in the rhizosphere moved upward via the vascular bundle, relocated to aboveground plant parts, and protected against Botrytis cinerea. It also moved from flowers to roots, and among flowers via pollinators that were protected against pollinator pathogens. Our results reveal a solid evidence in tripartite interaction with Streptomyces exploiting plant and pollinator partners.

April 21, 2020

Integrating multiple genomic technologies to investigate an outbreak of carbapenemase-producing Enterobacter hormaechei

Carbapenem-resistant Enterobacteriaceae (CRE) represent one of the most urgent threats to human health posed by antibiotic resistant bacteria. Enterobacter hormaechei and other members of the Enterobacter cloacae complex are the most commonly encountered Enterobacter spp. within clinical settings, responsible for numerous outbreaks and ultimately poorer patient outcomes. Here we applied three complementary whole genome sequencing (WGS) technologies to characterise a hospital cluster of blaIMP-4 carbapenemase-producing E. hormaechei.In response to a suspected CRE outbreak in 2015 within an Intensive Care Unit (ICU)/Burns Unit in a Brisbane tertiary referral hospital we used Illumina sequencing to determine that all outbreak isolates were sequence type (ST)90 and near-identical at the core genome level. Comparison to publicly available data unequivocally linked all 10 isolates to a 2013 isolate from the same ward, confirming the hospital environment as the most likely original source of infection in the 2015 cases. No clonal relationship was found to IMP-4-producing isolates identified from other local hospitals. However, using Pacific Biosciences long-read sequencing we were able to resolve the complete context of the blaIMP-4 gene, which was found to be on a large IncHI2 plasmid carried by all IMP-4-producing isolates. Continued surveillance of the hospital environment was carried out using Oxford Nanopore long-read sequencing, which was able to rapidly resolve the true relationship of subsequent isolates to the initial outbreak. Shotgun metagenomic sequencing of environmental samples also found evidence of ST90 E. hormaechei and the IncHI2 plasmid within the hospital plumbing.Overall, our strategic application of three WGS technologies provided an in-depth analysis of the outbreak, including the transmission dynamics of a carbapenemase-producing E. hormaechei cluster, identification of possible hospital reservoirs and the full context of blaIMP-4 on a multidrug resistant IncHI2 plasmid that appears to be widely distributed in Australia.

April 21, 2020

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

Long-read RNA sequencing (RNA-seq) is promising to transcriptomics studies, however, the alignment of the reads is still a fundamental but non-trivial task due to the sequencing errors and complicated gene structures. We propose deSALT, a tailored two-pass long RNA-seq read alignment approach, which constructs graph-based alignment skeletons to sensitively infer exons, and use them to generate spliced reference sequence to produce refined alignments. deSALT addresses several difficult issues, such as small exons, serious sequencing errors and consensus spliced alignment. Benchmarks demonstrate that this approach has a better ability to produce high-quality full-length alignments, which has enormous potentials to transcriptomics studies.

April 21, 2020

Extended haplotype phasing of de novo genome assemblies with FALCON-Phase

Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. These assemblies can be created in various ways, such as use of tissues that contain single-haplotype (haploid) genomes, or by co-sequencing of parental genomes, but these approaches can be impractical in many situations. We present FALCON-Phase, which integrates long-read sequencing data and ultra-long-range Hi-C chromatin interaction data of a diploid individual to create high-quality, phased diploid genome assemblies. The method was evaluated by application to three datasets, including human, cattle, and zebra finch, for which high-quality, fully haplotype resolved assemblies were available for benchmarking. Phasing algorithm accuracy was affected by heterozygosity of the individual sequenced, with higher accuracy for cattle and zebra finch (>97%) compared to human (82%). In addition, scaffolding with the same Hi-C chromatin contact data resulted in phased chromosome-scale scaffolds.

April 21, 2020

Biogeography and Microscale Diversity Shape the Biosynthetic Potential of Fungus-growing Ant-associated Pseudonocardia

The geographic and phylogenetic scale of ecologically relevant microbial diversity is still poorly understood. Using a model mutualism, fungus-growing ants and their defensive bacterial associate Pseudonocardia, we analyzed genetic diversity and biosynthetic potential in 46 strains isolated from ant colonies in a 20km transect near Barro Colorado Island in Panama. Despite an average pairwise core genome similarity of greater than 99%, population genomic analysis revealed several distinct bacterial populations matching ant host geographic distribution. We identified both genetic diversity signatures and divergent genes distinct to each lineage. We also identify natural product biosynthesis clusters specific to isolation locations. These geographic patterns were observable despite the populations living in close proximity to each other and provides evidence of ongoing genetic exchange. Our results add to the growing body of literature suggesting that variation in traits of interest can be found at extremely fine phylogenetic scales.

April 21, 2020

Variation in genome content and predatory phenotypes between Bdellovibrio sp. NC01 isolated from soil and B. bacteriovorus type strain HD100

The range of naturally occurring variation in the ability of Bdellovibrio strains to attack and kill Gram-negative bacteria is not well understood. Defining phenotypic and associated genotypic variation among Bdellovibrio may further our understanding of how this genus impacts microbial communities. In addition, comparisons of the predatory phenotypes of divergent strains may inform the development of Bdellovibrio as biocontrol agents to combat bacterial infections. We isolated Bdellovibrio sp. NC01 from soil and compared its genome and predatory phenotypes to B. bacteriovorus type strain HD100. Based on analysis of 16S rRNA gene sequences and average amino acid identity, NC01 belongs to a different species than HD100. Genome-wide comparisons and individual gene analyses indicated that eight NC01 genome regions were likely acquired by horizontal gene transfer (HGT), further supporting an important role for HGT in Bdellovibrio genome evolution. Within these regions, multiple protein-coding sequences were assigned predicted functions related to transcriptional regulation and transport; however, most were annotated as hypothetical proteins. Compared to HD100, NC01 has a limited prey range and kills E. coli ML35 less efficiently. Whereas HD100 drastically reduces the ML35 population and then maintains low prey population density, NC01 causes a smaller reduction in ML35, after which the prey population recovers, accompanied by a decrease in NC01. In addition, NC01 forms turbid plaques on lawns of E. coli ML35, in contrast to clear plaques formed by HD100. Characterizing variation in interactions between Bdellovibrio and Gram-negative bacteria, such as observed with NC01 and HD100, is valuable for understanding the ecological significance of predatory bacteria and evaluating their effectiveness in clinical applications.

April 21, 2020

Divergent selection following speciation in two ectoparasitic honey bee mites

Multispecies host-parasite evolution is common, but how parasites evolve after speciating remains poorly understood. Shared evolutionary history and physiology may propel species along similar evolutionary trajectories whereas pursuing different strategies can reduce competition. We test these scenarios in the economically important association between honey bees and ectoparasitic mites by sequencing the genomes of the sister mite species Varroa destructor and Varroa jacobsoni. These genomes were closely related, with 99.7% sequence identity. Among the 9,628 orthologous genes, 4.8% showed signs of positive selection in at least one species. Divergent selective trajectories were discovered in conserved chemosensory gene families (IGR, SNMP), and Halloween genes (CYP) involved in moulting and reproduction. However, there was little overlap in these gene sets and associated GO terms, indicating different selective regimes operating on each of the parasites. Based on our findings, we suggest that species-specific strategies may be needed to combat evolving parasite communities.

April 21, 2020

SyRI: identification of syntenic and rearranged regions from whole-genome assemblies

We present SyRI, an efficient tool for genome-wide identification of structural rearrangements (SR) from genome graphs, which are built up from pair-wise whole-genome alignments. Instead of searching for differences, SyRI starts by finding all co-linear regions between the genomes. As all remaining regions are SRs by definition, they can be classified as inversions, translocations, or duplications based on their positions in convoluted networks of repetitive alignments. Finally, SyRI reports local variations like SNPs and indels within syntenic and rearranged regions. We show SyRItextquoterights broad applicability to multiple species and genetically validate the presence of ~100 translocations identified in Arabidopsis.

April 21, 2020

Draft Genome Assembly and Annotation of Red Raspberry Rubus Idaeus

The red raspberry, Rubus idaeus, is widely distributed in all temperate regions of Europe, Asia, and North America and is a major commercial fruit valued for its taste, high antioxidant and vitamin content. However, Rubus breeding is a long and slow process hampered by limited genomic and molecular resources. Genomic resources such as a complete genome sequencing and transcriptome will be of exceptional value to improve research and breeding of this high value crop. Using a hybrid sequence assembly approach including data from both long and short sequence reads, we present the first assembly of the Rubus idaeus genome (Joan J. variety). The de novo assembled genome consists of 2,145 scaffolds with a genome completeness of 95.3% and an N50 score of 638 KB. Leveraging a linkage map, we anchored 80.1% of the genome onto seven chromosomes. Using over 1 billion paired-end RNAseq reads, we annotated 35,566 protein coding genes with a transcriptome completeness score of 97.2%. The Rubus idaeus genome provides an important new resource for researchers and breeders.

April 21, 2020

Virus-host coexistence in phytoplankton through the genomic lens

Phytoplankton-virus interactions are major determinants of geochemical cycles in the oceans. Viruses are responsible for the redirection of carbon and nutrients away from larger organisms back towards microorganisms via the lysis of microalgae in a process coined the “viral shunt”. Virus-host interactions are generally expected to follow “boom and bust” dynamics, whereby a numerically dominant strain is lysed and replaced by a virus resistant strain. Here, we isolated a microalga and its infective nucleo-cytoplasmic large DNA virus (NCLDV) concomitantly from the environment in the surface NW Mediterranean Sea, Ostreococcus mediterraneus, and show continuous growth in culture of both the microalga and the virus. Evolution experiments through single cell bottlenecks demonstrate that, in the absence of the virus, susceptible cells evolve from one ancestral resistant single cell, and vice-versa; that is that resistant cells evolve from one ancestral susceptible cell. This provides evidence that the observed sustained viral production is the consequence of a minority of virus-susceptible cells. The emergence of these cells is explained by low-level phase switching between virus-resistant and virus-susceptible phenotypes, akin to a bet hedging strategy. Whole genome sequencing and analysis of the ~14 Mb microalga and the ~200 kb virus points towards ancient speciation of the microalga within the Ostreococcus species complex and frequent gene exchanges between prasinoviruses infecting Ostreococcus species. Re-sequencing of one susceptible strain demonstrated that the phase switch involved a large 60 Kb deletion of one chromosome. This chromosome is an outlier chromosome compared to the streamlined, gene dense, GC-rich standard chromosomes, as it contains many repeats and few orthologous genes. While this chromosome has been described in three different genera, its size increments have been previously associated to antiviral immunity and resistance in another species from the same genus. Mathematical modelling of this mechanism predicts microalga-virus population dynamics consistent with the observation of continuous growth of both virus and microalga. Altogether, our results suggest a previously overlooked strategy in phytoplankton-virus interactions.

Asset Tag: Preprint

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads.

A robust benchmark for germline structural variant detection

The Genome of the Zebra Mussel, Dreissena polymorpha: A Resource for Invasive Species Research

Pseudo-chromosome length genome assembly of a double haploid ‘Bartlett’ pear (Pyrus communis L.)

Complete genome sequence and annotation of the laboratory reference strain Shigella flexneri serovar 5a M90T and genome-wide transcription start site determination

Soil Probiotic Utilizes Plant and Pollinator Transport for Territorial Expansion

Integrating multiple genomic technologies to investigate an outbreak of carbapenemase-producing Enterobacter hormaechei

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

Extended haplotype phasing of de novo genome assemblies with FALCON-Phase

Biogeography and Microscale Diversity Shape the Biosynthetic Potential of Fungus-growing Ant-associated Pseudonocardia

Variation in genome content and predatory phenotypes between Bdellovibrio sp. NC01 isolated from soil and B. bacteriovorus type strain HD100

Divergent selection following speciation in two ectoparasitic honey bee mites

SyRI: identification of syntenic and rearranged regions from whole-genome assemblies

Draft Genome Assembly and Annotation of Red Raspberry Rubus Idaeus

Virus-host coexistence in phytoplankton through the genomic lens

Subscribe for blog updates:

Filter by topic

Talk with an expert

Antimicrobial resistance research

Subscribe for blog updates:

Filter by topic

Talk with an expert