Menu
September 22, 2019  |  

Variation graph toolkit improves read mapping by representing genetic variation in the reference.

Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references represent only one version of each locus, ignoring variation in the population. Poor representation of an individual’s genome sequence impacts read mapping and introduces bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation across a population, including large-scale structural variation such as inversions and duplications. Previous graph genome software implementations have been limited by scalability or topological constraints. Here we present vg, a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference, and effectively removing reference bias. These capabilities make using variation graphs as references for DNA sequencing practical at a gigabase scale, or at the topological complexity of de novo assemblies.


September 22, 2019  |  

Antiviral adaptive immunity and tolerance in the mosquito Aedes aegyti

Mosquitoes spread pathogenic arboviruses while themselves tolerate infection. We here characterize an immunity pathway providing long-term antiviral protection and define how this pathway discriminates between self and non-self. Mosquitoes use viral RNAs to create viral derived cDNAs (vDNAs) central to the antiviral response. vDNA molecules are acquired through a process of reverse-transcription and recombination directed by endogenous retrotransposons. These vDNAs are thought to integrate in the host genome as endogenous viral elements (EVEs). Sequencing of pre-integrated vDNA revealed that the acquisition process exquisitely distinguishes viral from host RNA, providing one layer of self-nonself discrimination. Importantly, we show EVE-derived piRNAs have antiviral activity and are loaded onto Piwi4 to inhibit virus replication. In a second layer of self-non-self discrimination, Piwi4 preferentially loads EVE-derived piRNAs, discriminating against transposon-targeting piRNAs. Our findings define a fundamental virus-specific immunity pathway in mosquitoes that uses EVEs as a potent and specific antiviral transgenerational mechanism.


September 22, 2019  |  

An introduced crop plant is driving diversification of the virulent bacterial pathogen Erwinia tracheiphila.

Erwinia tracheiphila is the causal agent of bacterial wilt of cucurbits, an economically important phytopathogen affecting an economically important phytopathogen affecting few cultivated Cucurbitaceae few cultivated Cucurbitaceae host plant species in temperate eastern North America. However, essentially nothing is known about E. tracheiphila population structure or genetic diversity. To address this shortcoming, a representative collection of 88 E. tracheiphila isolates was gathered from throughout its geographic range, and their genomes were sequenced. Phylogenomic analysis revealed three genetic clusters with distinct hrpT3SS virulence gene repertoires, host plant association patterns, and geographic distributions. Low genetic heterogeneity within each cluster suggests a recent population bottleneck followed by population expansion. We showed that in the field and greenhouse, cucumber (Cucumis sativus), which was introduced to North America by early Spanish conquistadors, is the most susceptible host plant species and the only species susceptible to isolates from all three lineages. The establishment of large agricultural populations of highly susceptible C. sativus in temperate eastern North America may have facilitated the original emergence of E. tracheiphila into cucurbit agroecosystems, and this introduced plant species may now be acting as a highly susceptible reservoir host. Our findings have broad implications for agricultural sustainability by drawing attention to how worldwide crop plant movement, agricultural intensification, and locally unique environments may affect the emergence, evolution, and epidemic persistence of virulent microbial pathogens.IMPORTANCEErwinia tracheiphila is a virulent phytopathogen that infects two genera of cucurbit crop plants, Cucurbita spp. (pumpkin and squash) and Cucumis spp. (muskmelon and cucumber). One of the unusual ecological traits of this pathogen is that it is limited to temperate eastern North America. Here, we complete the first large-scale sequencing of an E. tracheiphila isolate collection. From phylogenomic, comparative genomic, and empirical analyses, we find that introduced Cucumis spp. crop plants are driving the diversification of E. tracheiphila into multiple lineages. Together, the results from this study show that locally unique biotic (plant population) and abiotic (climate) conditions can drive the evolutionary trajectories of locally endemic pathogens in unexpected ways. Copyright © 2018 Shapiro et al.


September 22, 2019  |  

Targeted genotyping of variable number tandem repeats with adVNTR.

Whole-genome sequencing is increasingly used to identify Mendelian variants in clinical pipelines. These pipelines focus on single-nucleotide variants (SNVs) and also structural variants, while ignoring more complex repeat sequence variants. Here, we consider the problem of genotyping Variable Number Tandem Repeats (VNTRs), composed of inexact tandem duplications of short (6-100 bp) repeating units. VNTRs span 3% of the human genome, are frequently present in coding regions, and have been implicated in multiple Mendelian disorders. Although existing tools recognize VNTR carrying sequence, genotyping VNTRs (determining repeat unit count and sequence variation) from whole-genome sequencing reads remains challenging. We describe a method, adVNTR, that uses hidden Markov models to model each VNTR, count repeat units, and detect sequence variation. adVNTR models can be developed for short-read (Illumina) and single-molecule (Pacific Biosciences [PacBio]) whole-genome and whole-exome sequencing, and show good results on multiple simulated and real data sets.© 2018 Bakhtiari et al.; Published by Cold Spring Harbor Laboratory Press.


September 22, 2019  |  

TranSurVeyor: an improved database-free algorithm for finding non-reference transpositions in high-throughput sequencing data.

Transpositions transfer DNA segments between different loci within a genome; in particular, when a transposition is found in a sample but not in a reference genome, it is called a non-reference transposition. They are important structural variations that have clinical impact. Transpositions can be called by analyzing second generation high-throughput sequencing datasets. Current methods follow either a database-based or a database-free approach. Database-based methods require a database of transposable elements. Some of them have good specificity; however this approach cannot detect novel transpositions, and it requires a good database of transposable elements, which is not yet available for many species. Database-free methods perform de novo calling of transpositions, but their accuracy is low. We observe that this is due to the misalignment of the reads; since reads are short and the human genome has many repeats, false alignments create false positive predictions while missing alignments reduce the true positive rate. This paper proposes new techniques to improve database-free non-reference transposition calling: first, we propose a realignment strategy called one-end remapping that corrects the alignments of reads in interspersed repeats; second, we propose a SNV-aware filter that removes some incorrectly aligned reads. By combining these two techniques and other techniques like clustering and positive-to-negative ratio filter, our proposed transposition caller TranSurVeyor shows at least 3.1-fold improvement in terms of F1-score over existing database-free methods. More importantly, even though TranSurVeyor does not use databases of prior information, its performance is at least as good as existing database-based methods such as MELT, Mobster and Retroseq. We also illustrate that TranSurVeyor can discover transpositions that are not known in the current database.


September 22, 2019  |  

Noise-Cancelling Repeat Finder: Uncovering tandem repeats in error-prone long-read sequencing data

Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response.


September 22, 2019  |  

Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools.

We produced an extensive collection of deep re-sequencing datasets for the Venter/HuRef genome using the Illumina massively-parallel DNA sequencing platform. The original Venter genome sequence is a very-high quality phased assembly based on Sanger sequencing. Therefore, researchers developing novel computational tools for the analysis of human genome sequence variation for the dominant Illumina sequencing technology can test and hone their algorithms by making variant calls from these Venter/HuRef datasets and then immediately confirm the detected variants in the Sanger assembly, freeing them of the need for further experimental validation. This process also applies to implementing and benchmarking existing genome analysis pipelines. We prepared and sequenced 200?bp and 350?bp short-insert whole-genome sequencing libraries (sequenced to 100x and 40x genomic coverages respectively) as well as 2?kb, 5?kb, and 12?kb mate-pair libraries (49x, 122x, and 145x physical coverages respectively). Lastly, we produced a linked-read library (128x physical coverage) from which we also performed haplotype phasing.


September 22, 2019  |  

Report from the Killer-cell Immunoglobulin-like Receptors (KIR) component of the 17th International HLA and Immunogenetics Workshop.

The goals of the KIR component of the 17th International HLA and Immunogenetics Workshop (IHIW) were to encourage and educate researchers to begin analyzing KIR at allelic resolution, and to survey the nature and extent of KIR allelic diversity across human populations. To represent worldwide diversity, we analyzed 1269 individuals from ten populations, focusing on the most polymorphic KIR genes, which express receptors having three immunoglobulin (Ig)-like domains (KIR3DL1/S1, KIR3DL2 and KIR3DL3). We identified 13 novel alleles of KIR3DL1/S1, 13 of KIR3DL2 and 18 of KIR3DL3. Previously identified alleles, corresponding to 33 alleles of KIR3DL1/S1, 38 of KIR3DL2, and 43 of KIR3DL3, represented over 90% of the observed allele frequencies for these genes. In total we observed 37 KIR3DL1/S1 allotypes, 40 for KIR3DL2 and 44 for KIR3DL3. As KIR allotype diversity can affect NK cell function, this demonstrates potential for high functional diversity worldwide. Allelic variation further diversifies KIR haplotypes. We determined KIR3DL3?~?KIR3DL1/S1?~?KIR3DL2 haplotypes from five of the studied populations, and observed multiple population-specific haplotypes in each. This included 234 distinct haplotypes in European Americans, 191 in Ugandans, 35 in Papuans, 95 in Egyptians and 86 in Spanish populations. For another 35 populations, encompassing 642,105 individuals we focused on KIR3DL2 and identified another 375 novel alleles, with approximately half of them observed in more than one individual. The KIR allelic level data gathered from this project represents the most comprehensive summary of global KIR allelic diversity to date, and continued analysis will improve understanding of KIR allelic polymorphism in global populations. Further, the wealth of new data gathered in the course of this workshop component highlights the value of collaborative, community-based efforts in immunogenetics research, exemplified by the IHIW.Copyright © 2018. Published by Elsevier Inc.


September 22, 2019  |  

The genomic landscape of molecular responses to natural drought stress in Panicum hallii

Environmental stress is a major driver of ecological community dynamics and agricultural productivity. This is especially true for soil water availability, because drought is the greatest abiotic inhibitor of worldwide crop yields. Here, we test the genetic basis of drought responses in the genetic model for C4perennial grasses, Panicum hallii, through population genomics, field-scale gene-expression (eQTL) analysis, and comparison of two complete genomes. While gene expression networks are dominated by local cis-regulatory elements, we observe three genomic hotspots of unlinked trans-regulatory loci. These regulatory hubs are four times more drought responsive than the genome-wide average. Additionally, cis- and trans-regulatory networks are more likely to have opposing effects than expected under neutral evolution, supporting a strong influence of compensatory evolution and stabilizing selection. These results implicate trans-regulatory evolution as a driver of drought responses and demonstrate the potential for crop improvement in drought-prone regions through modification of gene regulatory networks.


September 22, 2019  |  

Integrative haplotype estimation with sub-linear complexity

The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here, we present a new method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear scaling with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPET4 in an open source format on https://odelaneau.github.io/shapeit4/ and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.


September 21, 2019  |  

Divergent selection causes whole genome differentiation without physical linkage among the targets in Spodoptera frugiperda (Noctuidae)

The process of speciation involves whole genome differentiation by overcoming gene flow between diverging populations. We have ample knowledge which evolutionary forces may cause genomic differentiation, and several speciation models have been proposed to explain the transition from genetic to genomic differentiation. However, it is still unclear what are critical conditions enabling genomic differentiation in nature. The Fall armyworm, Spodoptera frugiperda, is observed as two sympatric strains that have different host-plant ranges, suggesting the possibility of ecological divergent selection. In our previous study, we observed that these two strains show genetic differentiation across the whole genome with an unprecedentedly low extent, suggesting the possibility that whole genome sequences started to be differentiated between the strains. In this study, we analyzed whole genome sequences from these two strains from Mississippi to identify critical evolutionary factors for genomic differentiation. The genomic Fst is low (0.017) while 91.3% of 10kb windows have Fst greater than 0, suggesting genome-wide differentiation with a low extent. We identified nearly 400 outliers of genetic differentiation between strains, and found that physical linkage among these outliers is not a primary cause of genomic differentiation. Fst is not significantly correlated with gene density, a proxy for the strength of selection, suggesting that a genomic reduction in migration rate dominates the extent of local genetic differentiation. Our analyses reveal that divergent selection alone is sufficient to generate genomic differentiation, and any following diversifying factors may increase the level of genetic differentiation between diverging strains in the process of speciation.


September 21, 2019  |  

Discovery and genotyping of structural variation from long-read haploid genome sequence data.

In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that >89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF > 1%). We estimate that this theoretical human diploid differs by as much as ~16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery from genotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that ~59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.© 2017 Huddleston et al.; Published by Cold Spring Harbor Laboratory Press.


July 19, 2019  |  

Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability.

Recurrent deletions of chromosome 15q13.3 associate with intellectual disability, schizophrenia, autism and epilepsy. To gain insight into the instability of this region, we sequenced it in affected individuals, normal individuals and nonhuman primates. We discovered five structural configurations of the human chromosome 15q13.3 region ranging in size from 2 to 3 Mb. These configurations arose recently (~0.5-0.9 million years ago) as a result of human-specific expansions of segmental duplications and two independent inversion events. All inversion breakpoints map near GOLGA8 core duplicons-a ~14-kb primate-specific chromosome 15 repeat that became organized into larger palindromic structures. GOLGA8-flanked palindromes also demarcate the breakpoints of recurrent 15q13.3 microdeletions, the expansion of chromosome 15 segmental duplications in the human lineage and independent structural changes in apes. The significant clustering (P = 0.002) of breakpoints provides mechanistic evidence for the role of this core duplicon and its palindromic architecture in promoting the evolutionary and disease-related instability of chromosome 15.


July 19, 2019  |  

Genetic stabilization of the drug-resistant PMEN1 Pneumococcus lineage by its distinctive DpnIII restriction-modification system.

The human pathogen Streptococcus pneumoniae (pneumococcus) exhibits a high degree of genomic diversity and plasticity. Isolates with high genomic similarity are grouped into lineages that undergo homologous recombination at variable rates. PMEN1 is a pandemic, multidrug-resistant lineage. Heterologous gene exchange between PMEN1 and non-PMEN1 isolates is directional, with extensive gene transfer from PMEN1 strains and only modest transfer into PMEN1 strains. Restriction-modification (R-M) systems can restrict horizontal gene transfer, yet most pneumococcal strains code for either the DpnI or DpnII R-M system and neither limits homologous recombination. Our comparative genomic analysis revealed that PMEN1 isolates code for DpnIII, a third R-M system syntenic to the other Dpn systems. Characterization of DpnIII demonstrated that the endonuclease cleaves unmethylated double-stranded DNA at the tetramer sequence 5′ GATC 3′, and the cognate methylase is a C5 cytosine-specific DNA methylase. We show that DpnIII decreases the frequency of recombination under in vitro conditions, such that the number of transformants is lower for strains transformed with unmethylated DNA than in those transformed with cognately methylated DNA. Furthermore, we have identified two PMEN1 isolates where the DpnIII endonuclease is disrupted, and phylogenetic work by Croucher and colleagues suggests that these strains have accumulated genomic differences at a higher rate than other PMEN1 strains. We propose that the R-M locus is a major determinant of genetic acquisition; the resident R-M system governs the extent of genome plasticity.Pneumococcus is one of the most important community-acquired bacterial pathogens. Pneumococcal strains can develop resistance to antibiotics and to serotype vaccines by acquiring genes from other strains or species. Thus, genomic plasticity is associated with strain adaptability and pneumococcal success. PMEN1 is a widespread and multidrug-resistant highly pathogenic pneumococcal lineage, which has evolved over the past century and displays a relatively stable genome. In this study, we characterize DpnIII, a restriction-modification (R-M) system that limits recombination. DpnIII is encountered in the PMEN1 lineage, where it replaces other R-M systems that do not decrease plasticity. Our hypothesis is that this genomic region, where different pneumococcal lineages code for variable R-M systems, plays a role in the fine-tuning of the extent of genomic plasticity. It is possible that well-adapted lineages such as PMEN1 have a mechanism to increase genomic stability, rather than foster genomic plasticity. Copyright © 2015 Eutsey et al.


July 19, 2019  |  

Whole genome?

The reference human genome assembly is remarkable in its completeness and usefulness in research. However, the range of allelic variation in the human population is not well described by a haploid assembly with a profusion of alternative loci. Homozygous regions and the use of multiple sequencing technologies increasingly have roles in strategies for identifying regulatory and trait-associated variation.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.