Menu
July 7, 2019

Dissecting a hidden gene duplication: the Arabidopsis thaliana SEC10 locus.

Repetitive sequences present a challenge for genome sequence assembly, and highly similar segmental duplications may disappear from assembled genome sequences. Having found a surprising lack of observable phenotypic deviations and non-Mendelian segregation in Arabidopsis thaliana mutants in SEC10, a gene encoding a core subunit of the exocyst tethering complex, we examined whether this could be explained by a hidden gene duplication. Re-sequencing and manual assembly of the Arabidopsis thaliana SEC10 (At5g12370) locus revealed that this locus, comprising a single gene in the reference genome assembly, indeed contains two paralogous genes in tandem, SEC10a and SEC10b, and that a sequence segment of 7 kb in length is missing from the reference genome sequence. Differences between the two paralogs are concentrated in non-coding regions, while the predicted protein sequences exhibit 99% identity, differing only by substitution of five amino acid residues and an indel of four residues. Both SEC10 genes are expressed, although varying transcript levels suggest differential regulation. Homozygous T-DNA insertion mutants in either paralog exhibit a wild-type phenotype, consistent with proposed extensive functional redundancy of the two genes. By these observations we demonstrate that recently duplicated genes may remain hidden even in well-characterized genomes, such as that of A. thaliana. Moreover, we show that the use of the existing A. thaliana reference genome sequence as a guide for sequence assembly of new Arabidopsis accessions or related species has at least in some cases led to error propagation.


July 7, 2019

Characterization of structural variants with single molecule and hybrid sequencing approaches.

Structural variation is common in human and cancer genomes. High-throughput DNA sequencing has enabled genome-scale surveys of structural variation. However, the short reads produced by these technologies limit the study of complex variants, particularly those involving repetitive regions. Recent ‘third-generation’ sequencing technologies provide single-molecule templates and longer sequencing reads, but at the cost of higher per-nucleotide error rates.We present MultiBreak-SV, an algorithm to detect structural variants (SVs) from single molecule sequencing data, paired read sequencing data, or a combination of sequencing data from different platforms. We demonstrate that combining low-coverage third-generation data from Pacific Biosciences (PacBio) with high-coverage paired read data is advantageous on simulated chromosomes. We apply MultiBreak-SV to PacBio data from four human fosmids and show that it detects known SVs with high sensitivity and specificity. Finally, we perform a whole-genome analysis on PacBio data from a complete hydatidiform mole cell line and predict 1002 high-probability SVs, over half of which are confirmed by an Illumina-based assembly. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.


July 7, 2019

An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella.

Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism) matrix (reference-based and reference-free), and (3) phylogenetic inference method (FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole genome sequences representing 107 unique Salmonella enterica subsp. enterica ser. Montevideo strains. Reference-based approaches for identifying SNPs produced trees that were significantly more similar to one another than those produced under the reference-free approach. Topologies inferred using a core matrix (i.e., no missing data) were significantly more discordant than those inferred using a non-core matrix that allows for some missing data. However, allowing for too much missing data likely results in a high false discovery rate of SNPs. When analyzing the same SNP matrix, we observed that the more thorough inference methods implemented in GARLI and RAxML produced more similar topologies than FastTreeMP. Our results also confirm that reproducibility varies among NGS platforms where the MiSeq had the lowest number of pairwise differences among replicate runs. Our investigation into the robustness of clustering patterns illustrates the importance of carefully considering how data from different platforms are combined and analyzed. We found clear differences in the topologies inferred, and certain methods performed significantly better than others for discriminating between the highly clonal organisms investigated here. The methods supported by our results represent a preliminary set of guidelines and a step towards developing validated standards for clustering based on whole genome sequence data.


July 7, 2019

Pseudoautosomal region 1 length polymorphism in the human population.

The human sex chromosomes differ in sequence, except for the pseudoautosomal regions (PAR) at the terminus of the short and the long arms, denoted as PAR1 and PAR2. The boundary between PAR1 and the unique X and Y sequences was established during the divergence of the great apes. During a copy number variation screen, we noted a paternally inherited chromosome X duplication in 15 independent families. Subsequent genomic analysis demonstrated that an insertional translocation of X chromosomal sequence into theMa Y chromosome generates an extended PAR. The insertion is generated by non-allelic homologous recombination between a 548 bp LTR6B repeat within the Y chromosome PAR1 and a second LTR6B repeat located 105 kb from the PAR boundary on the X chromosome. The identification of the reciprocal deletion on the X chromosome in one family and the occurrence of the variant in different chromosome Y haplogroups demonstrate this is a recurrent genomic rearrangement in the human population. This finding represents a novel mechanism shaping sex chromosomal evolution.


July 7, 2019

Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement.

Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small e.g., 180 bp and large e.g., 3-5 Kb inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing mis-assemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon identifies small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and resolve large insertions. Pilon is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.


July 7, 2019

ProbAlign: a re-alignment method for long sequencing reads

The incorrect alignments are a severe problem in variant calling, and remain as a challenge computational issue in Bioinformatics field. Although there have been some methods utilizing the re-alignment approach to tackle the misalignments, a standalone re-alignment tool for long sequencing reads is lacking. Hence, we present a standalone tool to correct the misalignments, called ProbAlign. It can be integrated into the pipelines of not only variant calling but also other genomic applications. We demonstrate the use of re-alignment in two diverse and important genomics fields: variant calling and viral quasispecies reconstruction. First, variant calling results in the Pacific Biosciences SMRT re-sequencing data of NA12878 show that false positives can be reduced by 43.5%, and true positives can be increased by 24.8% averagely, after re-alignment. Second, results in reconstructing a 5-virus-mix show that the viral population can be completely unraveled, and also the estimation of quasispecies frequencies has been improved, after re-alignment. ProbAlign is freely available in the PyroTools toolkit (https://github.com/homopolymer/PyroTools).


July 7, 2019

The challenges and importance of structural variation detection in livestock.

Recent studies in humans and other model organisms have demonstrated that structural variants (SVs) comprise a substantial proportion of variation among individuals of each species. Many of these variants have been linked to debilitating diseases in humans, thereby cementing the importance of refining methods for their detection. Despite progress in the field, reliable detection of SVs still remains a problem even for human subjects. Many of the underlying problems that make SVs difficult to detect in humans are amplified in livestock species, whose lower quality genome assemblies and incomplete gene annotation can often give rise to false positive SV discoveries. Regardless of the challenges, SV detection is just as important for livestock researchers as it is for human researchers, given that several productive traits and diseases have been linked to copy number variations (CNVs) in cattle, sheep, and pig. Already, there is evidence that many beneficial SVs have been artificially selected in livestock such as a duplication of the agouti signaling protein gene that causes white coat color in sheep. In this review, we will list current SV and CNV discoveries in livestock and discuss the problems that hinder routine discovery and tracking of these polymorphisms. We will also discuss the impacts of selective breeding on CNV and SV frequencies and mention how SV genotyping could be used in the future to improve genetic selection.


July 7, 2019

Evolution and diversity of copy number variation in the great ape lineage.

Copy number variation (CNV) contributes to disease and has restructured the genomes of great apes. The diversity and rate of this process, however, have not been extensively explored among great ape lineages. We analyzed 97 deeply sequenced great ape and human genomes and estimate 16% (469 Mb) of the hominid genome has been affected by recent CNV. We identify a comprehensive set of fixed gene deletions (n = 340) and duplications (n = 405) as well as >13.5 Mb of sequence that has been specifically lost on the human lineage. We compared the diversity and rates of copy number and single nucleotide variation across the hominid phylogeny. We find that CNV diversity partially correlates with single nucleotide diversity (r(2) = 0.5) and recapitulates the phylogeny of apes with few exceptions. Duplications significantly outpace deletions (2.8-fold). The load of segregating duplications remains significantly higher in bonobos, Western chimpanzees, and Sumatran orangutans-populations that have experienced recent genetic bottlenecks (P = 0.0014, 0.02, and 0.0088, respectively). The rate of fixed deletion has been more clocklike with the exception of the chimpanzee lineage, where we observe a twofold increase in the chimpanzee-bonobo ancestor (P = 4.79 × 10(-9)) and increased deletion load among Western chimpanzees (P = 0.002). The latter includes the first genomic disorder in a chimpanzee with features resembling Smith-Magenis syndrome mediated by a chimpanzee-specific increase in segmental duplication complexity. We hypothesize that demographic effects, such as bottlenecks, have contributed to larger and more gene-rich segments being deleted in the chimpanzee lineage and that this effect, more generally, may account for episodic bursts in CNV during hominid evolution.


July 7, 2019

Precise breakpoint localization of large genomic deletions using PacBio and Illumina next-generation sequencers.

Herein we present the applicability of single-molecule (PacBio RS) and second-generation sequencing technology (Illumina) to the characterization of large genomic deletions. By testing samples previously characterized using a Sanger approach, our methods determined that both next-generation sequencing platforms were able to identify the position of deletion breakpoints. Our results point out various advantages of next-generation sequencing platforms when characterizing genomic deletions; however, special attention must be dedicated to identical sequences flanking the breakpoints, such as poly(N) motifs.


July 7, 2019

Mutation in the C-di-AMP cyclase dacA affects fitness and resistance of methicillin resistant Staphylococcus aureus.

Faster growing and more virulent strains of methicillin resistant Staphylococcus aureus (MRSA) are increasingly displacing highly resistant MRSA. Elevated fitness in these MRSA is often accompanied by decreased and heterogeneous levels of methicillin resistance; however, the mechanisms for this phenomenon are not yet fully understood. Whole genome sequencing was used to investigate the genetic basis of this apparent correlation, in an isogenic MRSA strain pair that differed in methicillin resistance levels and fitness, with respect to growth rate. Sequencing revealed only one single nucleotide polymorphism (SNP) in the diadenylate cyclase gene dacA in the faster growing but less resistant strain. Diadenylate cyclases were recently discovered to synthesize the new second messenger cyclic diadenosine monophosphate (c-di-AMP). Introduction of this mutation into the highly resistant but slower growing strain reduced resistance and increased its growth rate, suggesting a direct connection between the dacA mutation and the phenotypic differences of these strains. Quantification of cellular c-di-AMP revealed that the dacA mutation decreased c-di-AMP levels resulting in reduced autolysis, increased salt tolerance and a reduction in the basal expression of the cell wall stress stimulon. These results indicate that c-di-AMP affects cell envelope-related signalling in S. aureus. The influence of c-di-AMP on growth rate and methicillin resistance in MRSA indicate that altering c-di-AMP levels could be a mechanism by which MRSA strains can increase their fitness levels by reducing their methicillin resistance levels.


July 7, 2019

Use of four next-generation sequencing platforms to determine HIV-1 coreceptor tropism.

HIV-1 coreceptor tropism assays are required to rule out the presence of CXCR4-tropic (non-R5) viruses prior treatment with CCR5 antagonists. Phenotypic (e.g., Trofile™, Monogram Biosciences) and genotypic (e.g., population sequencing linked to bioinformatic algorithms) assays are the most widely used. Although several next-generation sequencing (NGS) platforms are available, to date all published deep sequencing HIV-1 tropism studies have used the 454™ Life Sciences/Roche platform. In this study, HIV-1 co-receptor usage was predicted for twelve patients scheduled to start a maraviroc-based antiretroviral regimen. The V3 region of the HIV-1 env gene was sequenced using four NGS platforms: 454™, PacBio® RS (Pacific Biosciences), Illumina®, and Ion Torrent™ (Life Technologies). Cross-platform variation was evaluated, including number of reads, read length and error rates. HIV-1 tropism was inferred using Geno2Pheno, Web PSSM, and the 11/24/25 rule and compared with Trofile™ and virologic response to antiretroviral therapy. Error rates related to insertions/deletions (indels) and nucleotide substitutions introduced by the four NGS platforms were low compared to the actual HIV-1 sequence variation. Each platform detected all major virus variants within the HIV-1 population with similar frequencies. Identification of non-R5 viruses was comparable among the four platforms, with minor differences attributable to the algorithms used to infer HIV-1 tropism. All NGS platforms showed similar concordance with virologic response to the maraviroc-based regimen (75% to 80% range depending on the algorithm used), compared to Trofile (80%) and population sequencing (70%). In conclusion, all four NGS platforms were able to detect minority non-R5 variants at comparable levels suggesting that any NGS-based method can be used to predict HIV-1 coreceptor usage.


July 7, 2019

Structure of the type IV secretion system in different strains of Anaplasma phagocytophilum.

Anaplasma phagocytophilum is an intracellular organism in the Order Rickettsiales that infects diverse animal species and is causing an emerging disease in humans, dogs and horses. Different strains have very different cell tropisms and virulence. For example, in the U.S., strains have been described that infect ruminants but not dogs or rodents. An intriguing question is how the strains of A. phagocytophilum differ and what different genome loci are involved in cell tropisms and/or virulence. Type IV secretion systems (T4SS) are responsible for translocation of substrates across the cell membrane by mechanisms that require contact with the recipient cell. They are especially important in organisms such as the Rickettsiales which require T4SS to aid colonization and survival within both mammalian and tick vector cells. We determined the structure of the T4SS in 7 strains from the U.S. and Europe and revised the sequence of the repetitive virB6 locus of the human HZ strain.Although in all strains the T4SS conforms to the previously described split loci for vir genes, there is great diversity within these loci among strains. This is particularly evident in the virB2 and virB6 which are postulated to encode the secretion channel and proteins exposed on the bacterial surface. VirB6-4 has an unusual highly repetitive structure and can have a molecular weight greater than 500,000. For many of the virs, phylogenetic trees position A. phagocytophilum strains infecting ruminants in the U.S. and Europe distant from strains infecting humans and dogs in the U.S.Our study reveals evidence of gene duplication and considerable diversity of T4SS components in strains infecting different animals. The diversity in virB2 is in both the total number of copies, which varied from 8 to 15 in the herein characterized strains, and in the sequence of each copy. The diversity in virB6 is in the sequence of each of the 4 copies in the single locus and the presence of varying numbers of repetitive units in virB6-3 and virB6-4. These data suggest that the T4SS should be investigated further for a potential role in strain virulence of A. phagocytophilum.


July 7, 2019

A hybrid approach for the automated finishing of bacterial genomes.

Advances in DNA sequencing technology have improved our ability to characterize most genomic diversity. However, accurate resolution of large structural events is challenging because of the short read lengths of second-generation technologies. Third-generation sequencing technologies, which can yield longer multikilobase reads, have the potential to address limitations associated with genome assembly. Here we combine sequencing data from second- and third-generation DNA sequencing technologies to assemble the two-chromosome genome of a recent Haitian cholera outbreak strain into two nearly finished contigs at >99.9% accuracy. Complex regions with clinically relevant structure were completely resolved. In separate control assemblies on experimental and simulated data for the canonical N16961 cholera reference strain, we obtained 14 scaffolds of greater than 1 kb for the experimental data and 8 scaffolds of greater than 1 kb for the simulated data, which allowed us to correct several errors in contigs assembled from the short-read data alone. This work provides a blueprint for the next generation of rapid microbial identification and full-genome assembly.


July 7, 2019

Cancer genomics: technology, discovery, and translation.

In recent years, the increasing awareness that somatic mutations and other genetic aberrations drive human malignancies has led us within reach of personalized cancer medicine (PCM). The implementation of PCM is based on the following premises: genetic aberrations exist in human malignancies; a subset of these aberrations drive oncogenesis and tumor biology; these aberrations are actionable (defined as having the potential to affect management recommendations based on diagnostic, prognostic, and/or predictive implications); and there are highly specific anticancer agents available that effectively modulate these targets. This article highlights the technology underlying cancer genomics and examines the early results of genome sequencing and the challenges met in the discovery of new genetic aberrations. Finally, drawing from experiences gained in a feasibility study of somatic mutation genotyping and targeted exome sequencing led by Princess Margaret Hospital-University Health Network and the Ontario Institute for Cancer Research, the processes, challenges, and issues involved in the translation of cancer genomics to the clinic are discussed.


July 7, 2019

Genomic epidemiology of the Escherichia coli O104:H4 outbreaks in Europe, 2011.

The degree to which molecular epidemiology reveals information about the sources and transmission patterns of an outbreak depends on the resolution of the technology used and the samples studied. Isolates of Escherichia coli O104:H4 from the outbreak centered in Germany in May-July 2011, and the much smaller outbreak in southwest France in June 2011, were indistinguishable by standard tests. We report a molecular epidemiological analysis using multiplatform whole-genome sequencing and analysis of multiple isolates from the German and French outbreaks. Isolates from the German outbreak showed remarkably little diversity, with only two single nucleotide polymorphisms (SNPs) found in isolates from four individuals. Surprisingly, we found much greater diversity (19 SNPs) in isolates from seven individuals infected in the French outbreak. The German isolates form a clade within the more diverse French outbreak strains. Moreover, five isolates derived from a single infected individual from the French outbreak had extremely limited diversity. The striking difference in diversity between the German and French outbreak samples is consistent with several hypotheses, including a bottleneck that purged diversity in the German isolates, variation in mutation rates in the two E. coli outbreak populations, or uneven distribution of diversity in the seed populations that led to each outbreak.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.