Menu
July 7, 2019

COSINE: non-seeding method for mapping long noisy sequences.

Third generation sequencing (TGS) are highly promising technologies but the long and noisy reads from TGS are difficult to align using existing algorithms. Here, we present COSINE, a conceptually new method designed specifically for aligning long reads contaminated by a high level of errors. COSINE computes the context similarity of two stretches of nucleobases given the similarity over distributions of their short k-mers (k = 3-4) along the sequences. The results on simulated and real data show that COSINE achieves high sensitivity and specificity under a wide range of read accuracies. When the error rate is high, COSINE can offer substantial advantages over existing alignment methods.© The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.


July 7, 2019

A recurrence-based approach for validating structural variation using long-read sequencing technology.

Although numerous algorithms have been developed to identify structural variations (SVs) in genomic sequences, there is a dearth of approaches that can be used to evaluate their results. This is significant as the accurate identification of structural variation is still an outstanding but important problem in genomics. The emergence of new sequencing technologies that generate longer sequence reads can, in theory, provide direct evidence for all types of SVs regardless of the length of the region through which it spans. However, current efforts to use these data in this manner require the use of large computational resources to assemble these sequences as well as visual inspection of each region. Here we present VaPoR, a highly efficient algorithm that autonomously validates large SV sets using long-read sequencing data. We assessed the performance of VaPoR on SVs in both simulated and real genomes and report a high-fidelity rate for overall accuracy across different levels of sequence depths. We show that VaPoR can interrogate a much larger range of SVs while still matching existing methods in terms of false positive validations and providing additional features considering breakpoint precision and predicted genotype. We further show that VaPoR can run quickly and efficiency without requiring a large processing or assembly pipeline. VaPoR provides a long read-based validation approach for genomic SVs that requires relatively low read depth and computing resources and thus will provide utility with targeted or low-pass sequencing coverage for accurate SV assessment. The VaPoR Software is available at: https://github.com/mills-lab/vapor.© The Authors 2017. Published by Oxford University Press.


July 7, 2019

The state of whole-genome sequencing

Over the last decade, a technological paradigm shift has slashed the cost of DNA sequencing by over five orders of magnitude. Today, the cost of sequencing a human genome is a few thousand dollars, and it continues to fall. Here, we review the most cost-effective platforms for whole-genome sequencing (WGS) as well as emerging technologies that may displace or complement these. We also discuss the practical challenges of generating and analyzing WGS data, and how WGS has unlocked new strategies for discovering genes and variants underlying both rare and common human diseases.


July 7, 2019

Fragmentation of surface adsorbed and aligned DNA molecules using soft lithography for next-generation sequencing

In this study, the enzymatic in situ cutting of linearized DNA molecules at approximately 11 kbp intervals is demonstrated using a soft lithography technique. The ultimate goal is to provide a general ordered cutting method to greatly simplify the assembly process. DNA was stretched onto PMMA (Poly methyl methacrylate) coated silicon by withdrawing the substrate from a DNA solution (a process termed “combing”). The stretched lambda DNA could be linearly cut with a soft lithography stamp used to selectively apply DNase I. After cutting the DNA on the substrate, the DNA fragments are removed from the surface by incubating PMMA in the commercial NEBuffer 3.1 at 75°C. The recovered fragments desorbed into the buffer and were sequenced using the PacBio RS II sequencer without an amplification step. The mean coverage was 2870X for the approximately 11 kbp fragmented sample and 100% of the lambda genome was sequenced. Methods to extend of the technique to ordered fragmentation are discussed.


July 7, 2019

Two orangutan species have evolved different KIR alleles and haplotypes.

The immune and reproductive functions of human NK cells are regulated by interactions of the C1 and C2 epitopes of HLA-C with C1-specific and C2-specific lineage III killer cell Ig-like receptors (KIR). This rapidly evolving and diverse system of ligands and receptors is restricted to humans and great apes. In this context, the orangutan has particular relevance because it represents an evolutionary intermediate, one having the C1 epitope and corresponding KIR but lacking the C2 epitope. Through a combination of direct sequencing, KIR genotyping, and data mining from the Great Ape Genome Project, we characterized the KIR alleles and haplotypes for panels of 10 Bornean orangutans and 19 Sumatran orangutans. The orangutan KIR haplotypes have between 5 and 10 KIR genes. The seven orangutan lineage III KIR genes all locate to the centromeric region of the KIR locus, whereas their human counterparts also populate the telomeric region. One lineage III KIR gene is Bornean specific, one is Sumatran specific, and five are shared. Of 12 KIR gene-content haplotypes, 5 are Bornean specific, 5 are Sumatran specific, and 2 are shared. The haplotypes have different combinations of genes encoding activating and inhibitory C1 receptors that can be of higher or lower affinity. All haplotypes encode an inhibitory C1 receptor, but only some haplotypes encode an activating C1 receptor. Of 130 KIR alleles, 55 are Bornean specific, 65 are Sumatran specific, and 10 are shared. Copyright © 2017 by The American Association of Immunologists, Inc.


July 7, 2019

Integrating transcriptomic and proteomic data for accurate assembly and annotation of genomes.

Complementing genome sequence with deep transcriptome and proteome data could enable more accurate assembly and annotation of newly sequenced genomes. Here, we provide a proof-of-concept of an integrated approach for analysis of the genome and proteome of Anopheles stephensi, which is one of the most important vectors of the malaria parasite. To achieve broad coverage of genes, we carried out transcriptome sequencing and deep proteome profiling of multiple anatomically distinct sites. Based on transcriptomic data alone, we identified and corrected 535 events of incomplete genome assembly involving 1196 scaffolds and 868 protein-coding gene models. This proteogenomic approach enabled us to add 365 genes that were missed during genome annotation and identify 917 gene correction events through discovery of 151 novel exons, 297 protein extensions, 231 exon extensions, 192 novel protein start sites, 19 novel translational frames, 28 events of joining of exons, and 76 events of joining of adjacent genes as a single gene. Incorporation of proteomic evidence allowed us to change the designation of more than 87 predicted “noncoding RNAs” to conventional mRNAs coded by protein-coding genes. Importantly, extension of the newly corrected genome assemblies and gene models to 15 other newly assembled Anopheline genomes led to the discovery of a large number of apparent discrepancies in assembly and annotation of these genomes. Our data provide a framework for how future genome sequencing efforts should incorporate transcriptomic and proteomic analysis in combination with simultaneous manual curation to achieve near complete assembly and accurate annotation of genomes.© 2017 Prasad et al.; Published by Cold Spring Harbor Laboratory Press.


July 7, 2019

Institutional profile: translational pharmacogenomics at the Icahn School of Medicine at Mount Sinai.

For almost 50 years, the Icahn School of Medicine at Mount Sinai has continually invested in genetics and genomics, facilitating a healthy ecosystem that provides widespread support for the ongoing programs in translational pharmacogenomics. These programs can be broadly cataloged into discovery, education, clinical implementation and testing, which are collaboratively accomplished by multiple departments, institutes, laboratories, companies and colleagues. Focus areas have included drug response association studies and allele discovery, multiethnic pharmacogenomics, personalized genotyping and survey-based education programs, pre-emptive clinical testing implementation and novel assay development. This overview summarizes the current state of translational pharmacogenomics at Mount Sinai, including a future outlook on the forthcoming expansions in overall support, research and clinical programs, genomic technology infrastructure and the participating faculty.


July 7, 2019

HapCol: accurate and memory-efficient haplotype assembly from long reads.

Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of ‘future-generation’ sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions.By exploiting a feature of future-generation technologies-the uniform distribution of sequencing errors-we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/.bonizzoni@disco.unimib.itSupplementary information: Supplementary data are available at Bioinformatics online.© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.


July 7, 2019

Genomic resources and their influence on the detection of the signal of positive selection in genome scans.

Genome scans represent powerful approaches to investigate the action of natural selection on the genetic variation of natural populations and to better understand local adaptation. This is very useful, for example, in the field of conservation biology and evolutionary biology. Thanks to Next Generation Sequencing, genomic resources are growing exponentially, improving genome scan analyses in non-model species. Thousands of SNPs called using Reduced Representation Sequencing are increasingly used in genome scans. Besides, genome sequences are also becoming increasingly available, allowing better processing of short-read data, offering physical localization of variants, and improving haplotype reconstruction and data imputation. Ultimately, genome sequences are also becoming the raw material for selection inferences. Here, we discuss how the increasing availability of such genomic resources, notably genome sequences, influences the detection of signals of selection. Mainly, increasing data density and having the information of physical linkage data expand genome scans by (i) improving the overall quality of the data, (ii) helping the reconstruction of demographic history for the population studied to decrease false-positive rates and (iii) improving the statistical power of methods to detect the signal of selection. Of particular importance, the availability of a high-quality reference genome can improve the detection of the signal of selection by (i) allowing matching the potential candidate loci to linked coding regions under selection, (ii) rapidly moving the investigation to the gene and function and (iii) ensuring that the highly variable regions of the genomes that include functional genes are also investigated. For all those reasons, using reference genomes in genome scan analyses is highly recommended. © 2015 John Wiley & Sons Ltd.


July 7, 2019

rHAT: fast alignment of noisy long reads with regional hashing.

Single Molecule Real-Time (SMRT) sequencing has been widely applied in cutting-edge genomic studies. However, it is still an expensive task to align the noisy long SMRT reads to reference genome by state-of-the-art aligners, which is becoming a bot-tleneck in applications with SMRT sequencing. Novel approach is on demand for improving the efficiency and effectiveness of SMRT read alignment.We propose Regional Hashing-based Alignment Tool (rHAT), a seed-and-extension-based read alignment approach specifically designed for noisy long reads. rHAT indexes reference genome by regional hash table (RHT), a hash table-based index which describes the short tokens within local windows of reference genome. In the seeding phase, rHAT utilizes RHT for efficiently calculating the occurrences of short token matches between partial read and local genomic windows to find highly possible candidate sites. In the extension phase, a sparse dynamic programming-based heuristic approach is used for reducing the cost of aligning read to the candidate sites. By benchmarking on the real and simulated datasets from various prokaryote and eukaryote genomes, we demonstrated that rHAT can effectively align SMRT reads with outstanding throughput. rHAT is implemented in C++; the source code is available at https://github.com/derekguan/rHAT CONTACT: ydwang@hit.edu.cn. © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.


July 7, 2019

Novel FANCI mutations in Fanconi anemia with VACTERL association.

Fanconi anemia (FA) is an inherited bone marrow failure syndrome caused by mutations in DNA repair genes; some of these patients may have features of the VACTERL association. Autosomal recessive mutations in FANCI are a rare cause of FA. We identified FANCI mutations by next generation sequencing in three patients in our FA cohort among several whose mutated gene was unknown. Four of the six mutations are novel and all mutations are likely deleterious to protein function. There are now 16 reported cases of FA due to FANCI of whom 7 have at least 3 features of the VACTERL association (44%). This suggests that the VACTERL association in patients with FA may be seen in patients with FANCI mutations more often than previously recognized. © 2015 Wiley Periodicals, Inc. © 2015 Wiley Periodicals, Inc.


July 7, 2019

Timing, rates and spectra of human germline mutation.

Germline mutations are a driving force behind genome evolution and genetic disease. We investigated genome-wide mutation rates and spectra in multi-sibling families. The mutation rate increased with paternal age in all families, but the number of additional mutations per year differed by more than twofold between families. Meta-analysis of 6,570 mutations showed that germline methylation influences mutation rates. In contrast to somatic mutations, we found remarkable consistency in germline mutation spectra between the sexes and at different paternal ages. In parental germ line, 3.8% of mutations were mosaic, resulting in 1.3% of mutations being shared by siblings. The number of these shared mutations varied significantly between families. Our data suggest that the mutation rate per cell division is higher during both early embryogenesis and differentiation of primordial germ cells but is reduced substantially during post-pubertal spermatogenesis. These findings have important consequences for the recurrence risks of disorders caused by de novo mutations.


July 7, 2019

Read-based phasing of related individuals.

Read-based phasing deduces the haplotypes of an individual from sequencing reads that cover multiple variants, while genetic phasing takes only genotypes as input and applies the rules of Mendelian inheritance to infer haplotypes within a pedigree of individuals. Combining both into an approach that uses these two independent sources of information-reads and pedigree-has the potential to deliver results better than each individually.We provide a theoretical framework combining read-based phasing with genetic haplotyping, and describe a fixed-parameter algorithm and its implementation for finding an optimal solution. We show that leveraging reads of related individuals jointly in this way yields more phased variants and at a higher accuracy than when phased separately, both in simulated and real data. Coverages as low as 2× for each member of a trio yield haplotypes that are as accurate as when analyzed separately at 15× coverage per individual.https://bitbucket.org/whatshap/whatshapt.marschall@mpi-inf.mpg.de.© The Author 2016. Published by Oxford University Press.


July 7, 2019

Resolving complex structural genomic rearrangements using a randomized approach.

Complex chromosomal rearrangements are structural genomic alterations involving multiple instances of deletions, duplications, inversions, or translocations that co-occur either on the same chromosome or represent different overlapping events on homologous chromosomes. We present SVelter, an algorithm that identifies regions of the genome suspected to harbor a complex event and then resolves the structure by iteratively rearranging the local genome structure, in a randomized fashion, with each structure scored against characteristics of the observed sequencing data. SVelter is able to accurately reconstruct complex chromosomal rearrangements when compared to well-characterized genomes that have been deeply sequenced with both short and long reads.


July 7, 2019

Genomic analyses reveal that partial sequence of an earlier pseudorabies virus in China is originated from a Bartha-vaccine-like strain.

Pseudorabies virus (PRV), the causative agent of Aujeszky?s disease, has gained increased attention in China in recent years as a result of the outbreak of emergent pseudorabies. Several genomic and partial sequences are available for Chinese emergent and European-American strains of PRV, but limited sequence data exist for the earlier Chinese strains. In this study, we determined the complete genomic sequence of one earlier Chinese strain SC and one emergent strain HLJ8. Compared with other known sequences, we demonstrated that PRV strains from distinct geographical regions displayed divergent evolution. Additionally, we report for the first time, a recombination event between PRV strains, and show that strain SC is a recombinant of an endemic Chinese strain and a Bartha-vaccine-like strain. These results contribute to our understanding of PRV evolution. Copyright © 2016 Elsevier Inc. All rights reserved.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.