Menu
July 7, 2019  |  

Measuring the mappability spectrum of reference genome assemblies

The ability to infer actionable information from genomic variation data in a resequencing experiment relies on accurately aligning the sequences to a reference genome. However, this accuracy is inherently limited by the quality of the reference assembly and the repetitive content of the subject’s genome. As long read sequencing technologies become more widespread, it is crucial to investigate the expected improvements in alignment accuracy and variant analysis over existing short read methods. The ability to quantify the read length and error rate necessary to uniquely map regions of interest in a sequence allows users to make informed decisions regarding experiment design and provides useful metrics for comparing the magnitude of repetition across different reference assemblies. To this end we have developed NEAT-Repeat, a toolkit for exhaustively identifying the minimum read length required to uniquely map each position of a reference sequence given a specified error rate. Using these tools we computed the -mappability spectrum” for ten reference sequences, including human and a range of plants and animals, quantifying the theoretical improvements in alignment accuracy that would result from sequencing with longer reads or reads with less base-calling errors. Our inclusion of read length and error rate builds upon existing methods for mappability tracks based on uniqueness or aligner-specific mapping scores, and thus enables more comprehensive analysis. We apply our mappability results to whole-genome variant call data, and demonstrate that variants called with low mapping and genotype quality scores are disproportionately found in reference regions that require long reads to be uniquely covered. We propose that our mappability metrics provide a valuable supplement to established variant filtering and annotation pipelines by supplying users with an additional metric related to read mapping quality. NEAT-Repeat can process large and repetitive genomes, such as those of corn and soybean, in a tractable amount of time by leveraging efficient methods for edit distance computation as well as running multiple jobs in parallel. NEAT-Repeat is written in Python 2.7 and C++, and is available at https://github.com/zstephens/neat-repeat.


July 7, 2019  |  

Recombination hotspots in an extended human pseudoautosomal domain predicted from double-strand break maps and characterized by sperm-based crossover analysis.

The human X and Y chromosomes are heteromorphic but share a region of homology at the tips of their short arms, pseudoautosomal region 1 (PAR1), that supports obligate crossover in male meiosis. Although the boundary between pseudoautosomal and sex-specific DNA has traditionally been regarded as conserved among primates, it was recently discovered that the boundary position varies among human males, due to a translocation of ~110 kb from the X to the Y chromosome that creates an extended PAR1 (ePAR). This event has occurred at least twice in human evolution. So far, only limited evidence has been presented to suggest this extension is recombinationally active. Here, we sought direct proof by examining thousands of gametes from each of two ePAR-carrying men, for two subregions chosen on the basis of previously published male X-chromosomal meiotic double-strand break (DSB) maps. Crossover activity comparable to that seen at autosomal hotspots was observed between the X and the ePAR borne on the Y chromosome both at a distal and a proximal site within the 110-kb extension. Other hallmarks of classic recombination hotspots included evidence of transmission distortion and GC-biased gene conversion. We observed good correspondence between the male DSB clusters and historical recombination activity of this region in the X chromosomes of females, as ascertained from linkage disequilibrium analysis; this suggests that this region is similarly primed for crossover in both male and female germlines, although sex-specific differences may also exist. Extensive resequencing and inference of ePAR haplotypes, placed in the framework of the Y phylogeny as ascertained by both Y microsatellites and single nucleotide polymorphisms, allowed us to estimate a minimum rate of crossover over the entire ePAR region of 6-fold greater than genome average, comparable with pedigree estimates of PAR1 activity generally. We conclude ePAR very likely contributes to the critical crossover function of PAR1.


July 7, 2019  |  

Genome analysis of Mycobacterium avium subspecies hominissuis strain 109.

Infection with Mycobacterium avium is a significant cause of morbidity and its treatment requires the use of multiple antibiotics for more than 12 months. In the current work, we provide the genome sequence, gene annotations, gene ontology annotations, and protein homology data for M. avium strain 109 (MAC109), which has been used extensively in preclinical studies. The de novo assembled genome consists of a circular chromosome of length 5,188,883?bp and two circular plasmids of sizes 147,100?bp and 16,516?bp. We have named the plasmids pMAC109a and pMAC109b, respectively. Based on its genome, we confirm that MAC109 should be classified as Mycobacterium avium subsp. hominissuis. Using genome annotation software, we identified 4,841 coding sequences and annotated these with Gene Ontology (GO) terms. Additionally, we wrote software to generate a database of homologous proteins among MAC109 and eight other commonly used mycobacterial laboratory strains. The resulting database may be useful for translating genetic data between various strains of mycobacteria, and the software may be applied readily to other organisms.


July 7, 2019  |  

Pilot satellitome analysis of the model plant, Physcomitrellapatens, revealed a transcribed and high-copy IGS related tandem repeat.

Satellite DNA (satDNA) constitutes a substantial part of eukaryotic genomes. In the last decade, it has been shown that satDNA is not an inert part of the genome and its function extends beyond the nuclear membrane. However, the number of model plant species suitable for studying the novel horizons of satDNA functionality is low. Here, we explored the satellitome of the model “basal” plant, Physcomitrellapatens (Hedwig, 1801) Bruch & Schimper, 1849 (moss), which has a number of advantages for deep functional and evolutionary research. Using a newly developed pyTanFinder pipeline (https://github.com/Kirovez/pyTanFinder) coupled with fluorescence in situ hybridization (FISH), we identified five high copy number tandem repeats (TRs) occupying a long DNA array in the moss genome. The nuclear organization study revealed that two TRs had distinct locations in the moss genome, concentrating in the heterochromatin and knob-rDNA like chromatin bodies. Further genomic, epigenetic and transcriptomic analysis showed that one TR, named PpNATR76, was located in the intergenic spacer (IGS) region and transcribed into long non-coding RNAs (lncRNAs). Several specific features of PpNATR76 lncRNAs make them very similar with the recently discovered human lncRNAs, raising a number of questions for future studies. This work provides new resources for functional studies of satellitome in plants using the model organism P.patens, and describes a list of tandem repeats for further analysis.


July 7, 2019  |  

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

De novo assembly is the process of reconstructing genomes from DNA fragments (reads), which may contain redundancy and errors. Longer reads simplify assembly and improve contiguity of the output, but current long-read technologies come with high error rates. A crucial step of de novo genome assembly for long reads consists of finding overlapping reads. We present Berkeley Long-Read to Long-Read Aligner and Overlapper (BELLA), which implement a novel approach to compute overlaps using Sparse Generalized Matrix Multiplication (SpGEMM). We present a probabilistic model which demonstrates the soundness of using short, fixed length k-mers to detect overlaps, avoiding expensive pairwise alignment of all reads against all others. We then introduce a notion of reliable k-mers based on our probabilistic model. The use of reliable k-mers eliminates both the k-mer set explosion that would otherwise happen with highly erroneous reads and the spurious overlaps due to k-mers originating from repetitive regions. Finally, we present a new method to separate true alignments from false positives depending on the alignment score. Using this methodology, which is employed in BELLAtextquoterights precise mode, the probability of false positives drops exponentially as the length of overlap between sequences increases. On simulated data, BELLA achieves an average of 2.26% higher recall than state-of-the-art tools in its sensitive mode and 18.90% higher precision than state-of-the-art tools in its precise mode, while being performance competitive.


July 7, 2019  |  

Hardwood tree genomics: Unlocking woody plant biology.

Woody perennial angiosperms (i.e., hardwood trees) are polyphyletic in origin and occur in most angiosperm orders. Despite their independent origins, hardwoods have shared physiological, anatomical, and life history traits distinct from their herbaceous relatives. New high-throughput DNA sequencing platforms have provided access to numerous woody plant genomes beyond the early reference genomes of Populus and Eucalyptus, references that now include willow and oak, with pecan and chestnut soon to follow. Genomic studies within these diverse and undomesticated species have successfully linked genes to ecological, physiological, and developmental traits directly. Moreover, comparative genomic approaches are providing insights into speciation events while large-scale DNA resequencing of native collections is identifying population-level genetic diversity responsible for variation in key woody plant biology across and within species. Current research is focused on developing genomic prediction models for breeding, defining speciation and local adaptation, detecting and characterizing somatic mutations, revealing the mechanisms of gender determination and flowering, and application of systems biology approaches to model complex regulatory networks underlying quantitative traits. Emerging technologies such as single-molecule, long-read sequencing is being employed as additional woody plant species, and genotypes within species, are sequenced, thus enabling a comparative (“evo-devo”) approach to understanding the unique biology of large woody plants. Resource availability, current genomic and genetic applications, new discoveries and predicted future developments are illustrated and discussed for poplar, eucalyptus, willow, oak, chestnut, and pecan.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.