Menu
July 7, 2019  |  

Speeding up DNA sequence alignment by optical correlator

In electronic computers, extensive amount of computations required for searching biological sequences in big databases leads to vast amount of energy consumption for electrical processing and cooling. On the other hand, optical processing is much faster than electrical counterpart, due to its parallel processing capability, at a fraction of energy consumption level and cost. In this regard, this paper proposes a correlation-based optical algorithm using metamaterial, taking advantages of optical parallel processing, to efficiently locate the edits as a means of DNA sequence comparison. Specifically, the proposed algorithm partitions the read DNA sequence into multiple overlapping intervals, referred to as windows, and then, extracts the peaks resulted from their cross-correlation with the reference sequence in parallel. Finally, to locate the edits, a simple algorithm utilizing number and location of the peaks is introduced to analyze the correlation outputs obtained from window-based DNA sequence comparison. As a novel implementation approach, we adopt multiple metamaterial-based optical correlators to optically implement the proposed parallel architecture, named as Window-based Optical Correlator (WOC). This wave-based computing architecture fully controls wave transmission and phase using dielectric and plasmonic materials. Design limitations and challenges of the proposed architecture are also discussed in details. The simulation results, comparing WOC with the well-known BLAST algorithm, demonstrate superior speed-up up to 60%, as well as, high accuracy even at the presence of large number of edits. Also, WOC method considerably reduces power consumption as a result of implementing metamaterial-based optical computing structure.


July 7, 2019  |  

A universal SNP and small-indel variant caller using deep neural networks.

Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.


July 7, 2019  |  

Spalter: A meta machine learning approach to distinguish true DNA variants from sequencing artefacts

Being able to distinguish between true DNA variants and technical sequencing artefacts is a fundamental task in whole genome, exome or targeted gene analysis. Variant calling tools provide diagnostic parameters, such as strand bias or an aggregated overall quality for each called variant, to help users make an informed choice about which variants to accept or discard. Having several such quality indicators poses a problem for the users of variant callers because they need to set or adjust thresholds for each such indicator. Alternatively, machine learning methods can be used to train a classifier based on these indicators. This approach needs large sets of labeled training data, which is not easily available. The new approach presented here relies on the idea that a true DNA variant exists independently of technical features of the read in which it appears (e.g. base quality, strand, position in the read). Therefore the nucleotide separability classification problem – predicting the nucleotide state of each read in a given pileup based on technical features only – should be near impossible to solve for true variants. Nucleotide separability, i.e. achievable classification accuracy, can either be used to distinguish between true variants and technical artefacts directly, using a thresholding approach, or it can be used as a meta-feature to train a separability-based classifier. This article explores both possibilities with promising results, showing accuracies around 90%.


July 7, 2019  |  

STRetch: detecting and discovering pathogenic short tandem repeat expansions.

Short tandem repeat (STR) expansions have been identified as the causal DNA mutation in dozens of Mendelian diseases. Most existing tools for detecting STR variation with short reads do so within the read length and so are unable to detect the majority of pathogenic expansions. Here we present STRetch, a new genome-wide method to scan for STR expansions at all loci across the human genome. We demonstrate the use of STRetch for detecting STR expansions using short-read whole-genome sequencing data at known pathogenic loci as well as novel STR loci. STRetch is open source software, available from github.com/Oshlack/STRetch .


July 7, 2019  |  

Genomics, GPCRs and new targets for the control of insect pests and vectors.

The pressing need for new pest control products with novel modes of action has spawned interest in small molecules and peptides targeting arthropod GPCRs. Genome sequence data and tools for reverse genetics have enabled the prediction and characterization of GPCRs from many invertebrates. We review recent work to identify, characterize and de-orphanize arthropod GPCRs, with a focus on studies that reveal exciting new functional roles for these receptors, including the regulation of metabolic resistance. We explore the potential for insecticides targeting Class A biogenic amine-binding and peptide-binding receptors, and consider the innovation required to generate pest-selective leads for development, within the context of new PCR-targeting products to control arthropod vectors of disease.Copyright © 2018. Published by Elsevier Inc.


July 7, 2019  |  

Approximate, simultaneous comparison of microbial genome architectures via syntenic anchoring of quiver representations

Motivation A long-standing limitation in comparative genomic studies is the dependency on a reference genome, which hinders the spectrum of genetic diversity that can be identified across a population of organisms. This is especially true in the microbial world where genome architectures can significantly vary. There is therefore a need for computational methods that can simultaneously analyze the architectures of multiple genomes without introducing bias from a reference. Results In this article, we present Ptolemy: a novel method for studying the diversity of genome architectures—such as structural variation and pan-genomes—across a collection of microbial assemblies without the need of a reference. Ptolemy is a ‘top-down’ approach to compare whole genome assemblies. Genomes are represented as labeled multi-directed graphs—known as quivers—which are then merged into a single, canonical quiver by identifying ‘gene anchors’ via synteny analysis. The canonical quiver represents an approximate, structural alignment of all genomes in a given collection encoding structural variation across (sub-) populations within the collection. We highlight various applications of Ptolemy by analyzing structural variation and the pan-genomes of different datasets composing of Mycobacterium, Saccharomyces, Escherichia and Shigella species. Our results show that Ptolemy is flexible and can handle both conserved and highly dynamic genome architectures. Ptolemy is user-friendly—requires only FASTA-formatted assembly along with a corresponding GFF-formatted file—and resource-friendly—can align 24 genomes in ~10 mins with four CPUs and <2 GB of RAM.


July 7, 2019  |  

Measuring the mappability spectrum of reference genome assemblies

The ability to infer actionable information from genomic variation data in a resequencing experiment relies on accurately aligning the sequences to a reference genome. However, this accuracy is inherently limited by the quality of the reference assembly and the repetitive content of the subject’s genome. As long read sequencing technologies become more widespread, it is crucial to investigate the expected improvements in alignment accuracy and variant analysis over existing short read methods. The ability to quantify the read length and error rate necessary to uniquely map regions of interest in a sequence allows users to make informed decisions regarding experiment design and provides useful metrics for comparing the magnitude of repetition across different reference assemblies. To this end we have developed NEAT-Repeat, a toolkit for exhaustively identifying the minimum read length required to uniquely map each position of a reference sequence given a specified error rate. Using these tools we computed the -mappability spectrum” for ten reference sequences, including human and a range of plants and animals, quantifying the theoretical improvements in alignment accuracy that would result from sequencing with longer reads or reads with less base-calling errors. Our inclusion of read length and error rate builds upon existing methods for mappability tracks based on uniqueness or aligner-specific mapping scores, and thus enables more comprehensive analysis. We apply our mappability results to whole-genome variant call data, and demonstrate that variants called with low mapping and genotype quality scores are disproportionately found in reference regions that require long reads to be uniquely covered. We propose that our mappability metrics provide a valuable supplement to established variant filtering and annotation pipelines by supplying users with an additional metric related to read mapping quality. NEAT-Repeat can process large and repetitive genomes, such as those of corn and soybean, in a tractable amount of time by leveraging efficient methods for edit distance computation as well as running multiple jobs in parallel. NEAT-Repeat is written in Python 2.7 and C++, and is available at https://github.com/zstephens/neat-repeat.


July 7, 2019  |  

Recombination hotspots in an extended human pseudoautosomal domain predicted from double-strand break maps and characterized by sperm-based crossover analysis.

The human X and Y chromosomes are heteromorphic but share a region of homology at the tips of their short arms, pseudoautosomal region 1 (PAR1), that supports obligate crossover in male meiosis. Although the boundary between pseudoautosomal and sex-specific DNA has traditionally been regarded as conserved among primates, it was recently discovered that the boundary position varies among human males, due to a translocation of ~110 kb from the X to the Y chromosome that creates an extended PAR1 (ePAR). This event has occurred at least twice in human evolution. So far, only limited evidence has been presented to suggest this extension is recombinationally active. Here, we sought direct proof by examining thousands of gametes from each of two ePAR-carrying men, for two subregions chosen on the basis of previously published male X-chromosomal meiotic double-strand break (DSB) maps. Crossover activity comparable to that seen at autosomal hotspots was observed between the X and the ePAR borne on the Y chromosome both at a distal and a proximal site within the 110-kb extension. Other hallmarks of classic recombination hotspots included evidence of transmission distortion and GC-biased gene conversion. We observed good correspondence between the male DSB clusters and historical recombination activity of this region in the X chromosomes of females, as ascertained from linkage disequilibrium analysis; this suggests that this region is similarly primed for crossover in both male and female germlines, although sex-specific differences may also exist. Extensive resequencing and inference of ePAR haplotypes, placed in the framework of the Y phylogeny as ascertained by both Y microsatellites and single nucleotide polymorphisms, allowed us to estimate a minimum rate of crossover over the entire ePAR region of 6-fold greater than genome average, comparable with pedigree estimates of PAR1 activity generally. We conclude ePAR very likely contributes to the critical crossover function of PAR1.


July 7, 2019  |  

Picky comprehensively detects high-resolution structural variants in nanopore long reads.

Acquired genomic structural variants (SVs) are major hallmarks of cancer genomes, but they are challenging to reconstruct from short-read sequencing data. Here we exploited the long reads of the nanopore platform using our customized pipeline, Picky ( https://github.com/TheJacksonLaboratory/Picky ), to reveal SVs of diverse architecture in a breast cancer model. We identified the full spectrum of SVs with superior specificity and sensitivity relative to short-read analyses, and uncovered repetitive DNA as the major source of variation. Examination of genome-wide breakpoints at nucleotide resolution uncovered micro-insertions as the common structural features associated with SVs. Breakpoint density across the genome is associated with the propensity for interchromosomal connectivity and was found to be enriched in promoters and transcribed regions of the genome. Furthermore, we observed an over-representation of reciprocal translocations from chromosomal double-crossovers through phased SVs. We demonstrate that Picky analysis is an effective tool for comprehensive detection of SVs in cancer genomes from long-read data.


July 7, 2019  |  

Overview of the germline and expressed repertoires of the TRB genes in Sus scrofa.

The a/ß T cell receptor (TR) is a complex heterodimer that recognizes antigenic peptides and binds to major histocompatibility complex (MH) molecules. Both a and ß chains are encoded by different genes localized on two distinct chromosomal loci: TRA and TRB. The present study employed the recent release of the swine genome assembly to define the genomic organization of the TRB locus. According to the sequencing data, the pig TRB locus spans approximately 400 kb of genomic DNA and consists of 38 TRBV genes belonging to 24 subgroups located upstream of three in tandem TRBD-J-C clusters, which are followed by a TRBV gene in an inverted transcriptional orientation. Comparative analysis confirms that the general organization of the TRB locus is similar among mammalian species, but the number of germline TRBV genes varies greatly even between species belonging to the same order, determining the diversity and specificity of the immune response. However, sequence analysis of the TRB locus also suggests the presence of blocks of conserved homology in the genomic region across mammals. Furthermore, by analysing a public cDNA collection, we identified the usage pattern of the TRBV, TRBD, and TRBJ genes in the adult pig TRB repertoire, and we noted that the expressed TRBV repertoire seems to be broader and more diverse than the germline repertoire, in line with the presence of a high level of TRBV gene polymorphisms. Because the nucleotide differences seems to be principally concentrated in the CDR2 region, it is reasonable to presume that most T cell ß-chain diversity can be related to polymorphisms in pig MH molecules. Domestic pigs represent a valuable animal model as they are even more anatomically, genetically and physiologically similar to humans than are mice. Therefore, present knowledge on the genomic organization of the pig TRB locus allows the collection of increased information on the basic aspects of the porcine immune system and contributes to filling the gaps left by rodent models.


July 7, 2019  |  

Signatures of selection and environmental adaptation across the goat genome post-domestication.

Since goat was domesticated 10,000 years ago, many factors have contributed to the differentiation of goat breeds and these are classified mainly into two types: (i) adaptation to different breeding systems and/or purposes and (ii) adaptation to different environments. As a result, approximately 600 goat breeds have developed worldwide; they differ considerably from one another in terms of phenotypic characteristics and are adapted to a wide range of climatic conditions. In this work, we analyzed the AdaptMap goat dataset, which is composed of data from more than 3000 animals collected worldwide and genotyped with the CaprineSNP50 BeadChip. These animals were partitioned into groups based on geographical area, production uses, available records on solid coat color and environmental variables including the sampling geographical coordinates, to investigate the role of natural and/or artificial selection in shaping the genome of goat breeds.Several signatures of selection on different chromosomal regions were detected across the different breeds, sub-geographical clusters, phenotypic and climatic groups. These regions contain genes that are involved in important biological processes, such as milk-, meat- or fiber-related production, coat color, glucose pathway, oxidative stress response, size, and circadian clock differences. Our results confirm previous findings in other species on adaptation to extreme environments and human purposes and provide new genes that could explain some of the differences between goat breeds according to their geographical distribution and adaptation to different environments.These analyses of signatures of selection provide a comprehensive first picture of the global domestication process and adaptation of goat breeds and highlight possible genes that may have contributed to the differentiation of this species worldwide.


July 7, 2019  |  

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

De novo assembly is the process of reconstructing genomes from DNA fragments (reads), which may contain redundancy and errors. Longer reads simplify assembly and improve contiguity of the output, but current long-read technologies come with high error rates. A crucial step of de novo genome assembly for long reads consists of finding overlapping reads. We present Berkeley Long-Read to Long-Read Aligner and Overlapper (BELLA), which implement a novel approach to compute overlaps using Sparse Generalized Matrix Multiplication (SpGEMM). We present a probabilistic model which demonstrates the soundness of using short, fixed length k-mers to detect overlaps, avoiding expensive pairwise alignment of all reads against all others. We then introduce a notion of reliable k-mers based on our probabilistic model. The use of reliable k-mers eliminates both the k-mer set explosion that would otherwise happen with highly erroneous reads and the spurious overlaps due to k-mers originating from repetitive regions. Finally, we present a new method to separate true alignments from false positives depending on the alignment score. Using this methodology, which is employed in BELLAtextquoterights precise mode, the probability of false positives drops exponentially as the length of overlap between sequences increases. On simulated data, BELLA achieves an average of 2.26% higher recall than state-of-the-art tools in its sensitive mode and 18.90% higher precision than state-of-the-art tools in its precise mode, while being performance competitive.


July 7, 2019  |  

iMGEins: detecting novel mobile genetic elements inserted in individual genomes.

Recent advances in sequencing technology have allowed us to investigate personal genomes to find structural variations, which have been studied extensively to identify their association with the physiology of diseases such as cancer. In particular, mobile genetic elements (MGEs) are one of the major constituents of the human genomes, and cause genome instability by insertion, mutation, and rearrangement.We have developed a new program, iMGEins, to identify such novel MGEs by using sequencing reads of individual genomes, and to explore the breakpoints with the supporting reads and MGEs detected. iMGEins is the first MGE detection program that integrates three algorithmic components: discordant read-pair mapping, split-read mapping, and insertion sequence assembly. Our evaluation results showed its outstanding performance in detecting novel MGEs from simulated genomes, as well as real personal genomes. In detail, the average recall and precision rates of iMGEins are 96.67 and 100%, respectively, which are the highest among the programs compared. In the testing with real human genomes of the NA12878 sample, iMGEins shows the highest accuracy in detecting MGEs within 20?bp proximity of the breakpoints annotated.In order to study the dynamics of MGEs in individual genomes, iMGEins was developed to accurately detect breakpoints and report inserted MGEs. Compared with other programs, iMGEins has valuable features of identifying novel MGEs and assembling the MGEs inserted.


July 7, 2019  |  

Bridging gaps in transposable element research with single-molecule and single-cell technologies

More than half of the genomic landscape in humans and many other organisms is composed of repetitive DNA, which mostly derives from transposable elements (TEs) and viruses. Recent technological advances permit improved assessment of the repetitive content across genomes and newly developed molecular assays have revealed important roles of TEs and viruses in host genome evolution and organization. To update on our current understanding of TE biology and to promote new interdisciplinary strategies for the TE research community, leading experts gathered for the 2nd Uppsala Transposon Symposium on October 4–5, 2018 in Uppsala, Sweden. Using cutting-edge single-molecule and single-cell approaches, research on TEs and other repeats has entered a new era in biological and biomedical research.


July 7, 2019  |  

Hardwood tree genomics: Unlocking woody plant biology.

Woody perennial angiosperms (i.e., hardwood trees) are polyphyletic in origin and occur in most angiosperm orders. Despite their independent origins, hardwoods have shared physiological, anatomical, and life history traits distinct from their herbaceous relatives. New high-throughput DNA sequencing platforms have provided access to numerous woody plant genomes beyond the early reference genomes of Populus and Eucalyptus, references that now include willow and oak, with pecan and chestnut soon to follow. Genomic studies within these diverse and undomesticated species have successfully linked genes to ecological, physiological, and developmental traits directly. Moreover, comparative genomic approaches are providing insights into speciation events while large-scale DNA resequencing of native collections is identifying population-level genetic diversity responsible for variation in key woody plant biology across and within species. Current research is focused on developing genomic prediction models for breeding, defining speciation and local adaptation, detecting and characterizing somatic mutations, revealing the mechanisms of gender determination and flowering, and application of systems biology approaches to model complex regulatory networks underlying quantitative traits. Emerging technologies such as single-molecule, long-read sequencing is being employed as additional woody plant species, and genotypes within species, are sequenced, thus enabling a comparative (“evo-devo”) approach to understanding the unique biology of large woody plants. Resource availability, current genomic and genetic applications, new discoveries and predicted future developments are illustrated and discussed for poplar, eucalyptus, willow, oak, chestnut, and pecan.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.