Structural variation Archives - Page 23 of 31

July 7, 2019

Plasmid composition in Aeromonas salmonicida subsp. salmonicida 01-B526 unravels unsuspected type three secretion system loss patterns.

Aeromonas salmonicida subsp. salmonicida is a ubiquitous psychrophilic waterborne bacterium and a fish pathogen. The numerous mobile elements, especially insertion sequences (IS), in its genome promote rearrangements that impact its phenotype. One of the main virulence factors of this bacterium, its type three secretion system (TTSS), is affected by these rearrangements. In Aeromonas salmonicida subsp. salmonicida most of the TTSS genes are encoded in a single locus on a large plasmid called pAsa5, and may be lost when the bacterium is cultivated at a higher temperature (25 °C), producing non-virulent mutants. In a previous study, pAsa5-rearranged strains that lacked the TTSS locus on pAsa5 were produced using parental strains, including 01-B526. Some of the generated deletions were explained by homologous recombination between ISs found on pAsa5, whereas the others remained unresolved. To investigate those rearrangements, short- and long-read high-throughput sequencing technologies were used on the A. salmonicida subsp. salmonicida 01-B526 whole genome.Whole genome sequencing of the 01-B526 strain revealed that its pAsa5 has an additional IS copy, an ISAS5, compared to the reference strain (A449) sequence, which allowed for a previously unknown rearrangement to occur. It also appeared that 01-B526 bears a second large plasmid, named pAsa9, which shares 40 kbp of highly similar sequences with pAsa5. Following these discoveries, previously unexplained deletions were elucidated by genotyping. Furthermore, in one of the derived strains a fusion of pAsa5 and pAsa9, involving the newly discovered ISAS5 copy, was observed.The loss of TTSS and hence virulence is explained by one consistent mechanism: IS-driven homologous recombination. The similarities between pAsa9 and pAsa5 also provide another example of genetic diversity driven by ISs.

July 7, 2019

Discovery and genotyping of novel sequence insertions in many sequenced individuals

Motivation: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. There are only a handful of algorithms that are specifically developed for novel sequence insertion discovery that can bypass the need for the whole genome de novo assembly. Still, most such algorithms rely on high depth of coverage, and to our knowledge there is only one method (PopIns) that can use multi-sample data to “collectively” obtain a very high coverage dataset to accurately find insertions common in a given population. Result: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. Availability and implementation: Pamir is available at https://github.com/vpc-ccg/pamir. Contact:fhach@sfu.ca, prostatecentre.com or calkan@cs.bilkent.edu.tr Supplementary information:Supplementary data are available at Bioinformatics online.

July 7, 2019

Evidence for contemporary switching of the O-antigen gene cluster between Shiga toxin-producing Escherichia coli strains colonizing cattle.

Shiga toxin-producing Escherichia coli (STEC) comprise a group of zoonotic enteric pathogens with ruminants, especially cattle, as the main reservoir. O-antigens are instrumental for host colonization and bacterial niche adaptation. They are highly immunogenic and, therefore, targeted by the adaptive immune system. The O-antigen is one of the most diverse bacterial cell constituents and variation not only exists between different bacterial species, but also between individual isolates/strains within a single species. We recently identified STEC persistently infecting cattle and belonging to the different serotypes O156:H25 (n = 21) and O182:H25 (n = 15) that were of the MLST sequence types ST300 or ST688. These STs differ by a single nucleotide in purA only. Fitness-, virulence-associated genome regions, and CRISPR/CAS (clustered regularly interspaced short palindromic repeats/CRISPR associated sequence) arrays of these STEC O156:H25 and O182:H25 isolates were highly similar, and identical genomic integration sites for the stx converting bacteriophages and the core LEE, identical Shiga toxin converting bacteriophage genes for stx1a, identical complete LEE loci, and identical sets of chemotaxis and flagellar genes were identified. In contrast to this genomic similarity, the nucleotide sequences of the O-antigen gene cluster (O-AGC) regions between galF and gnd and very few flanking genes differed fundamentally and were specific for the respective serotype. Sporadic aEPEC O156:H8 isolates (n = 5) were isolated in temporal and spatial proximity. While the O-AGC and the corresponding 5′ and 3′ flanking regions of these aEPEC isolates were identical to the respective region in the STEC O156:H25 isolates, the core genome, the virulence associated genome regions and the CRISPR/CAS elements differed profoundly. Our cumulative epidemiological and molecular data suggests a recent switch of the O-AGC between isolates with O156:H8 strains having served as DNA donors. Such O-antigen switches can affect the evaluation of a strain’s pathogenic and virulence potential, suggesting that NGS methods might lead to a more reliable risk assessment.

July 7, 2019

Whole-genome restriction mapping by “subhaploid”-based RAD sequencing: An efficient and flexible approach for physical mapping and genome scaffolding.

Assembly of complex genomes using short reads remains a major challenge, which usually yields highly fragmented assemblies. Generation of ultradense linkage maps is promising for anchoring such assemblies, but traditional linkage mapping methods are hindered by the infrequency and unevenness of meiotic recombination that limit attainable map resolution. Here we develop a sequencing-based “in vitro” linkage mapping approach (called RadMap), where chromosome breakage and segregation are realized by generating hundreds of “subhaploid” fosmid/bacterial-artificial-chromosome clone pools, and by restriction site-associated DNA sequencing of these clone pools to produce an ultradense whole-genome restriction map to facilitate genome scaffolding. A bootstrap-based minimum spanning tree algorithm is developed for grouping and ordering of genome-wide markers and is implemented in a user-friendly, integrated software package (AMMO). We perform extensive analyses to validate the power and accuracy of our approach in the model plant Arabidopsis thaliana and human. We also demonstrate the utility of RadMap for enhancing the contiguity of a variety of whole-genome shotgun assemblies generated using either short Illumina reads (300 bp) or long PacBio reads (6-14 kb), with up to 15-fold improvement of N50 (~816 kb-3.7 Mb) and high scaffolding accuracy (98.1-98.5%). RadMap outperforms BioNano and Hi-C when input assembly is highly fragmented (contig N50 = 54 kb). RadMap can capture wide-range contiguity information and provide an efficient and flexible tool for high-resolution physical mapping and scaffolding of highly fragmented assemblies. Copyright © 2017 Dou et al.

July 7, 2019

Hybrid assembly with long and short reads improves discovery of gene family expansions.

Long-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation.We developed a hybrid assembly pipeline called “Alpaca” that can operate on 20X long-read coverage plus about 50X short-insert and 50X long-insert short-read coverage. To preclude collapse of tandem repeats, Alpaca relies on base-call-corrected long reads for contig formation.Compared to two other assembly protocols, Alpaca demonstrated the most reference agreement and repeat capture on the rice genome. On three accessions of the model legume Medicago truncatula, Alpaca generated the most agreement to a conspecific reference and predicted tandemly repeated genes absent from the other assemblies.Our results suggest Alpaca is a useful tool for investigating structural and copy number variation within de novo assemblies of sampled populations.

July 7, 2019

CLOVE: classification of genomic fusions into structural variation events.

A precise understanding of structural variants (SVs) in DNA is important in the study of cancer and population diversity. Many methods have been designed to identify SVs from DNA sequencing data. However, the problem remains challenging because existing approaches suffer from low sensitivity, precision, and positional accuracy. Furthermore, many existing tools only identify breakpoints, and so not collect related breakpoints and classify them as a particular type of SV. Due to the rapidly increasing usage of high throughput sequencing technologies in this area, there is an urgent need for algorithms that can accurately classify complex genomic rearrangements (involving more than one breakpoint or fusion).We present CLOVE, an algorithm for integrating the results of multiple breakpoint or SV callers and classifying the results as a particular SV. CLOVE is based on a graph data structure that is created from the breakpoint information. The algorithm looks for patterns in the graph that are characteristic of more complex rearrangement types. CLOVE is able to integrate the results of multiple callers, producing a consensus call.We demonstrate using simulated and real data that re-classified SV calls produced by CLOVE improve on the raw call set of existing SV algorithms, particularly in terms of accuracy. CLOVE is freely available from http://www.github.com/PapenfussLab .

July 7, 2019

Genome graphs

There is increasing recognition that a single, monoploid reference genome is a poor universal reference structure for human genetics, because it represents only a tiny fraction of human variation. Adding this missing variation results in a structure that can be described as a mathematical graph: a genome graph. We demonstrate that, in comparison to the existing reference genome (GRCh38), genome graphs can substantially improve the fractions of reads that map uniquely and perfectly. Furthermore, we show that this fundamental simplification of read mapping transforms the variant calling problem from one in which many non-reference variants must be discovered de-novo to one in which the vast majority of variants are simply re-identified within the graph. Using standard benchmarks as well as a novel reference-free evaluation, we show that a simplistic variant calling procedure on a genome graph can already call variants at least as well as, and in many cases better than, a state-of-the-art method on the linear human reference genome. We anticipate that graph-based references will supplant linear references in humans and in other applications where cohorts of sequenced individuals are available.

July 7, 2019

Tandem duplications lead to novel expression patterns through exon shuffling in Drosophila yakuba.

One common hypothesis to explain the impacts of tandem duplications is that whole gene duplications commonly produce additive changes in gene expression due to copy number changes. Here, we use genome wide RNA-seq data from a population sample of Drosophila yakuba to test this ‘gene dosage’ hypothesis. We observe little evidence of expression changes in response to whole transcript duplication capturing 5′ and 3′ UTRs. Among whole gene duplications, we observe evidence that dosage sharing across copies is likely to be common. The lack of expression changes after whole gene duplication suggests that the majority of genes are subject to tight regulatory control and therefore not sensitive to changes in gene copy number. Rather, we observe changes in expression level due to both shuffling of regulatory elements and the creation of chimeric structures via tandem duplication. Additionally, we observe 30 de novo gene structures arising from tandem duplications, 23 of which form with expression in the testes. Thus, the value of tandem duplications is likely to be more intricate than simple changes in gene dosage. The common regulatory effects from chimeric gene formation after tandem duplication may explain their contribution to genome evolution.

July 7, 2019

Whole genome sequencing predicts novel human disease models in rhesus macaques.

Rhesus macaques are an important pre-clinical model of human disease. To advance our understanding of genomic variation that may influence disease, we surveyed genome-wide variation in 21 rhesus macaques. We employed best-practice variant calling, validated with Mendelian inheritance. Next, we used alignment data from our cohort to detect genomic regions likely to produce inaccurate genotypes, potentially due to either gene duplication or structural variation between individuals. We generated a final dataset of >16 million high confidence variants, including 13 million in Chinese-origin rhesus macaques, an increasingly important disease model. We detected an average of 131 mutations predicted to severely alter protein coding per animal, and identified 45 such variants that coincide with known pathogenic human variants. These data suggest that expanded screening of existing breeding colonies will identify novel models of human disease, and that increased genomic characterization can help inform research studies in macaques. Copyright © 2017 Elsevier Inc. All rights reserved.

July 7, 2019

The genetic basis of resistance and matching-allele interactions of a host-parasite system: The Daphnia magna-Pasteuria ramosa model.

Negative frequency-dependent selection (NFDS) is an evolutionary mechanism suggested to govern host-parasite coevolution and the maintenance of genetic diversity at host resistance loci, such as the vertebrate MHC and R-genes in plants. Matching-allele interactions of hosts and parasites that prevent the emergence of host and parasite genotypes that are universally resistant and infective are a genetic mechanism predicted to underpin NFDS. The underlying genetics of matching-allele interactions are unknown even in host-parasite systems with empirical support for coevolution by NFDS, as is the case for the planktonic crustacean Daphnia magna and the bacterial pathogen Pasteuria ramosa. We fine-map one locus associated with D. magna resistance to P. ramosa and genetically characterize two haplotypes of the Pasteuria resistance (PR-) locus using de novo genome and transcriptome sequencing. Sequence comparison of PR-locus haplotypes finds dramatic structural polymorphisms between PR-locus haplotypes including a large portion of each haplotype being composed of non-homologous sequences resulting in haplotypes differing in size by 66 kb. The high divergence of PR-locus haplotypes suggest a history of multiple, diverse and repeated instances of structural mutation events and restricted recombination. Annotation of the haplotypes reveals striking differences in gene content. In particular, a group of glycosyltransferase genes that is present in the susceptible but absent in the resistant haplotype. Moreover, in natural populations, we find that the PR-locus polymorphism is associated with variation in resistance to different P. ramosa genotypes, pointing to the PR-locus polymorphism as being responsible for the matching-allele interactions that have been previously described for this system. Our results conclusively identify a genetic basis for the matching-allele interaction observed in a coevolving host-parasite system and provide a first insight into its molecular basis.

July 7, 2019

Critical points for an accurate human genome analysis.

Next-generation sequencing is radically changing how DNA diagnostic laboratories operate. What started as a single-gene profession is now developing into gene panel sequencing and whole-exome and whole-genome sequencing (WES/WGS) analyses. With further advances in sequencing technology and concomitant price reductions, WGS will soon become the standard and be routinely offered. Here, we focus on the critical steps involved in performing WGS, with a particular emphasis on points where WGS differs from WES, the important variables that should be taken into account, and the quality control measures that can be taken to monitor the process. The points discussed here, combined with recent publications on guidelines for reporting variants, will facilitate the routine implementation of WGS into a diagnostic setting.© 2017 Wiley Periodicals, Inc.

July 7, 2019

Automated structural variant verification in human genomesw using single-molecule electronic DNA mapping.

The importance of structural variation in human disease and the difficulty of detecting structural variants larger than 50 base pairs has led to the development of several long-read sequencing technologies and optical mapping platforms. Frequently, multiple technologies and ad hoc methods are required to obtain a consensus regarding the location, size and nature of a structural variant, with no approach able to reliably bridge the gap of variant sizes between the domain of short-read approaches and the largest rearrangements observed with optical mapping. To address this unmet need, we have developed a new software package, SV-VerifyTM, which utilizes data collected with the Nabsys High Definition Mapping (HD-MappingTM) system, to perform hypothesis-based verification of putative deletions. We demonstrate that whole genome maps, constructed from electronic detection of tagged DNA, hundreds of kilobases in length, can be used effectively to facilitate calling of structural variants ranging in size from 300 base pairs to hundreds of kilobase pairs. SV-Verify implements hypothesis-based verification of putative structural variants using a set of support vector machines and is capable of concurrently testing several thousand independent hypotheses. We describe support vector machine training, utilizing a well-characterized human genome, and application of the resulting classifiers to another human genome, demonstrating high sensitivity and specificity for deletions >= 300 base pairs.

July 7, 2019

ALUMINUM RESISTANCE TRANSCRIPTION FACTOR 1 (ART1) contributes to natural variation in aluminum resistance in diverse genetic backgrounds of rice (O. sativa)

Abstract Transcription factors (TFs) regulate the expression of other genes to indirectly mediate stress resistance mechanisms. Therefore, when studying TF-mediated stress resistance, it is important to understand how TFs interact with genes in the genetic background. Here, we fine-mapped the aluminum (Al) resistance QTL Alt12.1 to a 44-kb region containing six genes. Among them is ART1, which encodes a C2H2-type zinc finger TF required for Al resistance in rice. The mapping parents, Al-resistant cv Azucena (tropical japonica) and Al-sensitive cv IR64 (indica), have extensive sequence polymorphism within the ART1 coding region, but similar ART1 expression levels. Using reciprocal near-isogenic lines (NILs) we examined how allele-swapping the Alt12.1 locus would affect plant responses to Al. Analysis of global transcriptional responses to Al stress in roots of the NILs alongside their recurrent parents demonstrated that the presence of the Alt12.1 from Al-resistant Azucena led to greater changes in gene expression in response to Al when compared to the Alt12.1 from IR64 in both genetic backgrounds. The presence of the ART1 allele from the opposite parent affected the expression of several genes not previously implicated in rice Al tolerance. We highlight examples where putatively functional variation in cis-regulatory regions of ART1-regulated genes interacts with ART1 to determine gene expression in response to Al. This ART1–promoter interaction may be associated with transgressive variation for Al resistance in the Azucena × IR64 population. These results illustrate how ART1 interacts with the genetic background to contribute to quantitative phenotypic variation in rice Al resistance.

July 7, 2019

Strategies for optimizing BioNano and Dovetail explored through a second reference quality assembly for the legume model, Medicago truncatula.

Third generation sequencing technologies, with sequencing reads in the tens- of kilo-bases, facilitate genome assembly by spanning ambiguous regions and improving continuity. This has been critical for plant genomes, which are difficult to assemble due to high repeat content, gene family expansions, segmental and tandem duplications, and polyploidy. Recently, high-throughput mapping and scaffolding strategies have further improved continuity. Together, these long-range technologies enable quality draft assemblies of complex genomes in a cost-effective and timely manner.Here, we present high quality genome assemblies of the model legume plant, Medicago truncatula (R108) using PacBio, Dovetail Chicago (hereafter, Dovetail) and BioNano technologies. To test these technologies for plant genome assembly, we generated five assemblies using all possible combinations and ordering of these three technologies in the R108 assembly. While the BioNano and Dovetail joins overlapped, they also showed complementary gains in continuity and join numbers. Both technologies spanned repetitive regions that PacBio alone was unable to bridge. Combining technologies, particularly Dovetail followed by BioNano, resulted in notable improvements compared to Dovetail or BioNano alone. A combination of PacBio, Dovetail, and BioNano was used to generate a high quality draft assembly of R108, a M. truncatula accession widely used in studies of functional genomics. As a test for the usefulness of the resulting genome sequence, the new R108 assembly was used to pinpoint breakpoints and characterize flanking sequence of a previously identified translocation between chromosomes 4 and 8, identifying more than 22.7 Mb of novel sequence not present in the earlier A17 reference assembly.Adding Dovetail followed by BioNano data yielded complementary improvements in continuity over the original PacBio assembly. This strategy proved efficient and cost-effective for developing a quality draft assembly compared to traditional reference assemblies.

July 7, 2019

Resolving multicopy duplications de novo using polyploid phasing

While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog-specific variants. In this paper, we study the problem of resolving the variations in multicopy, long segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation-clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on average 7.0 haplotypes in 10-copy duplication datasets whereas existing algorithms reconstruct less than one copy on average.

Auto Tag: Structural variation

Plasmid composition in Aeromonas salmonicida subsp. salmonicida 01-B526 unravels unsuspected type three secretion system loss patterns.

Discovery and genotyping of novel sequence insertions in many sequenced individuals

Evidence for contemporary switching of the O-antigen gene cluster between Shiga toxin-producing Escherichia coli strains colonizing cattle.

Whole-genome restriction mapping by “subhaploid”-based RAD sequencing: An efficient and flexible approach for physical mapping and genome scaffolding.

Hybrid assembly with long and short reads improves discovery of gene family expansions.

CLOVE: classification of genomic fusions into structural variation events.

Genome graphs

Tandem duplications lead to novel expression patterns through exon shuffling in Drosophila yakuba.

Whole genome sequencing predicts novel human disease models in rhesus macaques.

The genetic basis of resistance and matching-allele interactions of a host-parasite system: The Daphnia magna-Pasteuria ramosa model.

Critical points for an accurate human genome analysis.

Automated structural variant verification in human genomesw using single-molecule electronic DNA mapping.

ALUMINUM RESISTANCE TRANSCRIPTION FACTOR 1 (ART1) contributes to natural variation in aluminum resistance in diverse genetic backgrounds of rice (O. sativa)

Strategies for optimizing BioNano and Dovetail explored through a second reference quality assembly for the legume model, Medicago truncatula.

Resolving multicopy duplications de novo using polyploid phasing

Subscribe for blog updates:

Filter by topic

Talk with an expert

Antimicrobial resistance research

Subscribe for blog updates:

Filter by topic

Talk with an expert