September 22, 2019  |  

HapIso: An accurate method for the haplotype-specific isoforms reconstruction from long single-molecule reads

Sequencing of RNA provides the possibility to study an individual’s transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms. In this paper, we present HapIso (Haplotype-specific Isoform Reconstruction), a method able to tolerate the relatively high error-rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows our method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k-means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. We use family pedigree information to evaluate our approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error-rate and accurately partition the reads into the parental alleles of the isoform transcripts. Furthermore, our method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads. The open source Python implementation of HapIso is freely available for download at https://?github.?com/?smangul1/?HapIso/?.

September 22, 2019  |  

Progressive approach for SNP calling and haplotype assembly using single molecular sequencing data.

Haplotype information is essential to the complete description and interpretation of genomes, genetic diversity and genetic ancestry. The new technologies can provide Single Molecular Sequencing (SMS) data that cover about 90% of positions over chromosomes. However, the SMS data has a higher error rate comparing to 1% error rate for short reads. Thus, it becomes very difficult for SNP calling and haplotype assembly using SMS reads. Most existing technologies do not work properly for the SMS data.In this paper, we develop a progressive approach for SNP calling and haplotype assembly that works very well for the SMS data. Our method can handle more than 200 million non-N bases on Chromosome 1 with millions of reads, more than 100 blocks, each of which contains more than 2 million bases and more than 3K SNP sites on average. Experiment results show that the false discovery rate and false negative rate for our method are 15.7 and 11.0% on NA12878, and 16.5 and 11.0% on NA24385. Moreover, the overall switch errors for our method are 7.26 and 5.21 with average 3378 and 5736 SNP sites per block on NA12878 and NA24385, respectively. Here, we demonstrate that SMS reads alone can generate a high quality solution for both SNP calling and haplotype assembly.Source codes and results are available at

September 22, 2019  |  

HapCHAT: adaptive haplotype assembly for efficiently leveraging high coverage in long reads.

Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages.Here, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60 × coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes.Our method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result.HapCHAT is available at under the GNU Public License (GPL).

September 22, 2019  |  

A graph-based approach to diploid genome assembly.

Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community.We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50× coverage Illumina data and 10× PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants. data are available at Bioinformatics online.

September 22, 2019  |  

The genomic architecture and molecular evolution of ant odorant receptors.

The massive expansions of odorant receptor (OR) genes in ant genomes are notable examples of rapid genome evolution and adaptive gene duplication. However, the molecular mechanisms leading to gene family expansion remain poorly understood, partly because available ant genomes are fragmentary. Here, we present a highly contiguous, chromosome-level assembly of the clonal raider ant genome, revealing the largest known OR repertoire in an insect. While most ant ORs originate via local tandem duplication, we also observe several cases of dispersed duplication followed by tandem duplication in the most rapidly evolving OR clades. We found that areas of unusually high transposable element density (TE islands) were depauperate in ORs in the clonal raider ant, and found no evidence for retrotransposition of ORs. However, OR loci were enriched for transposons relative to the genome as a whole, potentially facilitating tandem duplication by unequal crossing over. We also found that ant OR genes are highly AT-rich compared to other genes. In contrast, in flies, OR genes are dispersed and largely isolated within the genome, and we find that fly ORs are not AT-rich. The genomic architecture and composition of ant ORs thus show convergence with the unrelated vertebrate ORs rather than the related fly ORs. This might be related to the greater gene numbers and/or potential similarities in gene regulation between ants and vertebrates as compared to flies.© 2018 McKenzie and Kronauer; Published by Cold Spring Harbor Laboratory Press.

September 22, 2019  |  

Integrative haplotype estimation with sub-linear complexity

The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here, we present a new method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear scaling with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPET4 in an open source format on and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.

Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.