April 21, 2020  |  

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

Long-read RNA sequencing (RNA-seq) is promising to transcriptomics studies, however, the alignment of the reads is still a fundamental but non-trivial task due to the sequencing errors and complicated gene structures. We propose deSALT, a tailored two-pass long RNA-seq read alignment approach, which constructs graph-based alignment skeletons to sensitively infer exons, and use them to generate spliced reference sequence to produce refined alignments. deSALT addresses several difficult issues, such as small exons, serious sequencing errors and consensus spliced alignment. Benchmarks demonstrate that this approach has a better ability to produce high-quality full-length alignments, which has enormous potentials to transcriptomics studies.

April 21, 2020  |  

Whole-Genome Alignment and Comparative Annotation.

Rapidly improving sequencing technology coupled with computational developments in sequence assembly are making reference-quality genome assembly economical. Hundreds of vertebrate genome assemblies are now publicly available, and projects are being proposed to sequence thousands of additional species in the next few years. Such dense sampling of the tree of life should give an unprecedented new understanding of evolution and allow a detailed determination of the events that led to the wealth of biodiversity around us. To gain this knowledge, these new genomes must be compared through genome alignment (at the sequence level) and comparative annotation (at the gene level). However, different alignment and annotation methods have different characteristics; before starting a comparative genomics analysis, it is important to understand the nature of, and biases and limitations inherent in, the chosen methods. This review is intended to act as a technical but high-level overview of the field that should provide this understanding. We briefly survey the state of the genome alignment and comparative annotation fields and potential future directions for these fields in a new, large-scale era of comparative genomics.

April 21, 2020  |  

BGSA: a bit-parallel global sequence alignment toolkit for multi-core and many-core architectures.

Modern bioinformatics tools for analyzing large-scale NGS datasets often need to include fast implementations of core sequence alignment algorithms in order to achieve reasonable execution times. We address this need by presenting the BGSA toolkit for optimized implementations of popular bit-parallel global pairwise alignment algorithms on modern microprocessors.BGSA outperforms Edlib, SeqAn and BitPAl for pairwise edit distance computations and Parasail, SeqAn and BitPAl when using more general scoring schemes for pairwise alignments of a batch of sequence reads on both standard multi-core CPUs and Xeon Phi many-core CPUs. Furthermore, banded edit distance performance of BGSA on a Xeon Phi-7210 outperforms the highly optimized NVBio implementation on a Titan X GPU for the seed verification stage of a read mapper by a factor of 4.4.BGSA is open-source and available at https://github.com/sdu-hpcl/BGSA.Supplementary data are available at Bioinformatics online. © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

April 21, 2020  |  

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.

Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences.We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours.The source code of the benchmarking tool is available as Supplementary Materials. © The Author 2017. Published by Oxford University Press.

April 21, 2020  |  

Informatics for PacBio Long Reads.

In this article, we review the development of a wide variety of bioinformatics software implementing state-of-the-art algorithms since the introduction of SMRT sequencing technology into the field. We focus on the three major categories of development: read mapping (aligning to reference genomes), de novo assembly, and detection of structural variants. The long SMRT reads benefit all the applications, but they are achievable only through considering the nature of the long reads technology properly.

April 21, 2020  |  

Featherweight long read alignment using partitioned reference indexes.

The advent of Nanopore sequencing has realised portable genomic research and applications. However, state of the art long read aligners and large reference genomes are not compatible with most mobile computing devices due to their high memory requirements. We show how memory requirements can be reduced through parameter optimisation and reference genome partitioning, but highlight the associated limitations and caveats of these approaches. We then demonstrate how these issues can be overcome through an appropriate merging technique. We incorporated multi-index merging into the Minimap2 aligner and demonstrate that long read alignment to the human genome can be performed on a system with 2?GB RAM with negligible impact on accuracy.

April 21, 2020  |  

Accurate high throughput alignment via line sweep-based seed processing.

Accurate and fast aligners are required to handle the steadily increasing volume of sequencing data. Here we present an approach allowing performant alignments of short reads (Illumina) as well as long reads (Pacific Bioscience, Ultralong Oxford Nanopore), while achieving high accuracy, based on a universal three-stage scheme. It is also suitable for the discovery of insertions and deletions that originate from structural variants. We comprehensively compare our approach to other state-of-the-art aligners in order to confirm its performance with respect to accuracy and runtime. As part of our algorithmic scheme, we introduce two line sweep-based techniques called “strip of consideration” and “seed harmonization”. These techniques represent a replacement for chaining and do not rely on any specially tailored data structures. Additionally, we propose a refined form of seeding on the foundation of the FMD-index.

April 21, 2020  |  

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.

Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies.In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage.GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK .

April 21, 2020  |  

IMOS: improved Meta-aligner and Minimap2 On Spark.

Long reads provide valuable information regarding the sequence composition of genomes. Long reads are usually very noisy which renders their alignments on the reference genome a daunting task. It may take days to process datasets enough to sequence a human genome on a single node. Hence, it is of primary importance to have an aligner which can operate on distributed clusters of computers with high performance in accuracy and speed.In this paper, we presented IMOS, an aligner for mapping noisy long reads to the reference genome. It can be used on a single node as well as on distributed nodes. In its single-node mode, IMOS is an Improved version of Meta-aligner (IM) enhancing both its accuracy and speed. IM is up to 6x faster than the original Meta-aligner. It is also implemented to run IM and Minimap2 on Apache Spark for deploying on a cluster of nodes. Moreover, multi-node IMOS is faster than SparkBWA while executing both IM (1.5x) and Minimap2 (25x).In this paper, we purposed an architecture for mapping long reads to a reference. Due to its implementation, IMOS speed can increase almost linearly with respect to the number of nodes in a cluster. Also, it is a multi-platform application able to operate on Linux, Windows, and macOS.

October 23, 2019  |  

AAV-mediated delivery of zinc finger nucleases targeting hepatitis B virus inhibits active replication.

Despite an existing effective vaccine, hepatitis B virus (HBV) remains a major public health concern. There are effective suppressive therapies for HBV, but they remain expensive and inaccessible to many, and not all patients respond well. Furthermore, HBV can persist as genomic covalently closed circular DNA (cccDNA) that remains in hepatocytes even during otherwise effective therapy and facilitates rebound in patients after treatment has stopped. Therefore, the need for an effective treatment that targets active and persistent HBV infections remains. As a novel approach to treat HBV, we have targeted the HBV genome for disruption to prevent viral reactivation and replication. We generated 3 zinc finger nucleases (ZFNs) that target sequences within the HBV polymerase, core and X genes. Upon the formation of ZFN-induced DNA double strand breaks (DSB), imprecise repair by non-homologous end joining leads to mutations that inactivate HBV genes. We delivered HBV-specific ZFNs using self-complementary adeno-associated virus (scAAV) vectors and tested their anti-HBV activity in HepAD38 cells. HBV-ZFNs efficiently disrupted HBV target sites by inducing site-specific mutations. Cytotoxicity was seen with one of the ZFNs. scAAV-mediated delivery of a ZFN targeting HBV polymerase resulted in complete inhibition of HBV DNA replication and production of infectious HBV virions in HepAD38 cells. This effect was sustained for at least 2 weeks following only a single treatment. Furthermore, high specificity was observed for all ZFNs, as negligible off-target cleavage was seen via high-throughput sequencing of 7 closely matched potential off-target sites. These results show that HBV-targeted ZFNs can efficiently inhibit active HBV replication and suppress the cellular template for HBV persistence, making them promising candidates for eradication therapy.

Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.