Menu
July 7, 2019  |  

Challenges, solutions, and quality metrics of personal genome assembly in advancing precision medicine.

Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging “third generation sequencing” technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring drug therapy and detecting tumors. We believe the precision medicine would largely benefit from bioinformatics solutions, particularly for personal genome assembly.


July 7, 2019  |  

The complete chloroplast genome sequence of the medicinal plant Swertia mussotii using the PacBio RS II platform.

Swertia mussotii is an important medicinal plant that has great economic and medicinal value and is found on the Qinghai Tibetan Plateau. The complete chloroplast (cp) genome of S. mussotii is 153,431 bp in size, with a pair of inverted repeat (IR) regions of 25,761 bp each that separate an large single-copy (LSC) region of 83,567 bp and an a small single-copy (SSC) region of 18,342 bp. The S. mussotii cp genome encodes 84 protein-coding genes, 37 transfer RNA (tRNA) genes, and eight ribosomal RNA (rRNA) genes. The identity, number, and GC content of S. mussotii cp genes were similar to those in the genomes of other Gentianales species. Via analysis of the repeat structure, 11 forward repeats, eight palindromic repeats, and one reverse repeat were detected in the S. mussotii cp genome. There are 45 SSRs in the S. mussotii cp genome, the majority of which are mononucleotides found in all other Gentianales species. An entire cp genome comparison study of S. mussotii and two other species in Gentianaceae was conducted. The complete cp genome sequence provides intragenic information for the cp genetic engineering of this medicinal plant.


July 7, 2019  |  

Privacy-preserving read mapping using locality sensitive hashing and secure kmer voting

The recent explosion in the amount of available genome sequencing data imposes high computational demands on the tools designed to analyze it. Low-cost cloud computing has the potential to alleviate this burden. However, moving personal genome data analysis to the cloud raises serious privacy concerns. Read alignment is a critical and computationally intensive first step of most genomic data analysis pipelines. While significant effort has been dedicated to optimize the sensitivity and runtime efficiency of this step, few approaches have addressed outsourcing this computation securely to an untrusted party. The few secure solutions that have been proposed either do not scale to whole genome sequencing datasets or are not competitive with the state of the art in read mapping. In this paper, we present BALAUR, a privacy-preserving read mapping algorithm based on locality sensitive hashing and secure kmer voting. BALAUR securely outsources a significant portion of the computation to the public cloud by formulating the alignment task as a voting scheme between encrypted read and reference kmers. Our approach can easily handle typical genome-scale datasets and is highly competitive with non-cryptographic state-of-the-art read aligners in both accuracy and runtime performance on simulated and real read data. Moreover, our approach is significantly faster than state-of-the-art read aligners in long read mapping.


July 7, 2019  |  

Hyper-eccentric structural genes in the mitochondrial genome of the algal parasite Hemistasia phaeocysticola.

Diplonemid mitochondria are considered to have very eccentric structural genes. Coding regions of individual diplonemid mitochondrial genes are fragmented into small pieces and found on different circular DNAs. Short RNAs transcribed from each DNA molecule mature through a unique RNA maturation process involving assembly and three types of RNA editing (i.e., U insertion and A-to-I & C-to-U substitutions), although the molecular mechanism(s) of RNA maturation and the evolutionary history of these eccentric structural genes still remain to be understood. Since the gene fragmentation pattern is generally conserved among the diplonemid species studied to date, it was considered that their structural complexity has plateaued and further gene fragmentation could not occur. Here, we show the mitochondrial gene structure of Hemistasia phaeocysticola, which was recently identified as a member of a novel lineage in diplonemids, by comparison of the mitochondrial DNA sequences with cDNA sequences synthesized from mature mRNA. The genes of H. phaeocysticola are fragmented much more finely than those of other diplonemids studied to date. Furthermore, in addition to all known types of RNA editing, it is suggested that a novel processing step (i.e., secondary RNA insertion) is involved in the RNA maturation in the mitochondria of H. phaeocysticola Our findings demonstrate the tremendous plasticity of mitochondrial gene structures.© The Author(s) 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.


July 7, 2019  |  

Comparative genomics reveals Lysinibacillus sphaericus group comprises a novel species.

Early in the 1990s, it was recognized that Lysinibacillus sphaericus, one of the most popular and effective entomopathogenic bacteria, was a highly heterogeneous group. Many authors have even proposed it comprises more than one species, but the lack of phenotypic traits that guarantee an accurate differentiation has not allowed this issue to be clarified. Now that genomic technologies are rapidly advancing, it is possible to address the problem from a whole genome perspective, getting insights into the phylogeny, evolutive history and biology itself.The genome of the Colombian strain L. sphaericus OT4b.49 was sequenced, assembled and annotated, obtaining 3 chromosomal contigs and no evidence of plasmids. Using these sequences and the 13 other L. sphaericus genomes available on the NCBI database, we carried out comparative genomic analyses that included whole genome alignments, searching for mobile elements, phylogenomic metrics (TETRA, ANI and in-silico DDH) and pan-genome assessments. The results support the hypothesis about this species as a very heterogeneous group. The entomopathogenic lineage is actually a single and independent species with 3728 core genes and 2153 accessory genes, whereas each non-toxic strain seems to be a separate species, though without a clear circumscription. Toxin-encoding genes, binA, B and mtx1, 2, 3 could be acquired via horizontal gene transfer in a single evolutionary event. The non-toxic strain OT4b.31 is the most related with the type strain KCTC 3346.The current L. sphaericus is actually a sensu lato due to a sub-estimation of diversity accrued using traditional non-genomics based classification strategies. The toxic lineage is the most studied with regards to its larvicidal activity, which is a greatly conserved trait among these strains and thus, their differentiating feature. Further studies are needed in order to establish a univocal classification of the non-toxic strains that, according to our results, seem to be a paraphyletic group.


July 7, 2019  |  

Conservation genetics of an endangered grassland butterfly (Oarisma poweshiek) reveals historically high gene flow despite recent and rapid range loss

1. In poorly dispersing species gene flow can be facilitated when suitable habitat is widespread, allowing for increased dispersal between neighbouring locations. The Poweshiek skipperling [Oarisma poweshiek (Parker)], a federally endangered butterfly, has undergone a rapid, recent demographic decline following the loss of tallgrass prairie and fen habitats range wide. The loss of habitat, now restricted geographic range, and poor dispersal ability have left O. poweshiek at increased risk of extinction. 2. We studied the population genetics of six remaining populations of O. poweshiek in order to test the hypothesis that gene flow was historically high despite limited long-distance dispersal capability. Utilising nine microsatellite loci developed by PacBio sequencing, we tested for patterns of isolation by distance, low population genetic structure and alternative gene flow models. 3. Populations from southern Manitoba, Canada to the Lower Peninsula of Michigan, USA are only weakly genetically differentiated despite having low diversity. We found no support for isolation by distance, and Bayesian estimates of historical gene flow support our hypothesis that high levels of gene flow previously connected populations from Michigan to Wisconsin. 4. Prairie grasslands have been reduced tremendously over the past century, but the low mobility of O. poweshiek suggests that rapid loss of populations over the past decade cannot be simply explained by fragmentation of habitat. 5. As a species at high risk of extinction, understanding historical processes of gene flow will allow for informed management decisions with respect to head-starting individuals for population reintroductions and for conserving networks of habitat that will allow for high levels of gene flow.


July 7, 2019  |  

Probabilistic viral quasispecies assembly

Viruses are pathogens that cause infectious diseases. The swarm of virions is subject to the host’s immune pressure and possibly antiviral therapy. It may escape this selective pressure and gain selective advantage by acquiring one or more of the genomic alterations: single-nucleotide variants (SNVs), loss or gain of one or more amino acids, large deletions, for example, due to alternative splicing, or recombination of different strains. Genotypic antiretroviral drug resistance testing is performed via sequencing. Next-generation sequencing (NGS) technologies revolutionized assessing viral genetic diversity experimentally. In viral quasispecies analysis, there are two main goals: the identification of low-frequency variants and haplotype assembly on a whole-genome scale. PacBio performs single-molecule sequencing. This chapter elaborates human haplotyping and its relationship to probabilistic viral haplotype reconstruction methods. Viral quasispecies assembly has the potential to replace the current de facto diversity estimation by SNV calling. With advances in library preparation, increasing sensitivity of sequencing platforms, and more sophisticated models, it might be possible to detect all or most viral strains in a single individual.


July 7, 2019  |  

Capturing pairwise and multi-way chromosomal conformations using chromosomal walks.

Chromosomes are folded into highly compacted structures to accommodate physical constraints within nuclei and to regulate access to genomic information. Recently, global mapping of pairwise contacts showed that loops anchoring topological domains (TADs) are highly conserved between cell types and species. Whether pairwise loops synergize to form higher-order structures is still unclear. Here we develop a conformation capture assay to study higher-order organization using chromosomal walks (C-walks) that link multiple genomic loci together into proximity chains in human and mouse cells. This approach captures chromosomal structure at varying scales. Inter-chromosomal contacts constitute only 7-10% of the pairs and are restricted by interfacing TADs. About half of the C-walks stay within one chromosome, and almost half of those are restricted to intra-TAD spaces. C-walks that couple 2-4 TADs indicate stochastic associations between transcriptionally active, early replicating loci. Targeted analysis of thousands of 3-walks anchored at highly expressed genes support pairwise, rather than hub-like, chromosomal topology at active loci. Polycomb-repressed Hox domains are shown by the same approach to enrich for synergistic hubs. Together, the data indicate that chromosomal territories, TADs, and intra-TAD loops are primarily driven by nested, possibly dynamic, pairwise contacts.


July 7, 2019  |  

MICADo – Looking for mutations in targeted PacBio cancer data: an alignment-free method.

Targeted sequencing is commonly used in clinical application of NGS technology since it enables generation of sufficient sequencing depth in the targeted genes of interest and thus ensures the best possible downstream analysis. This notwithstanding, the accurate discovery and annotation of disease causing mutations remains a challenging problem even in such favorable context. The difficulty is particularly salient in the case of third generation sequencing technology, such as PacBio. We present MICADo, a de Bruijn graph based method, implemented in python, that makes possible to distinguish between patient specific mutations and other alterations for targeted sequencing of a cohort of patients. MICADo analyses NGS reads for each sample within the context of the data of the whole cohort in order to capture the differences between specificities of the sample with respect to the cohort. MICADo is particularly suitable for sequencing data from highly heterogeneous samples, especially when it involves high rates of non-uniform sequencing errors. It was validated on PacBio sequencing datasets from several cohorts of patients. The comparison with two widely used available tools, namely VarScan and GATK, shows that MICADo is more accurate, especially when true mutations have frequencies close to backgound noise. The source code is available at http://github.com/cbib/MICADo.


July 7, 2019  |  

Improve homology search sensitivity of PacBio data by correcting frameshifts.

Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than secondary generation sequencing technologies such as Illumina. The long read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and identify gene isoforms with higher accuracy in transcriptomic sequencing. However, PacBio data has high sequencing error rate and most of the errors are insertion or deletion errors. During alignment-based homology search, insertion or deletion errors in genes will cause frameshifts and may only lead to marginal alignment scores and short alignments. As a result, it is hard to distinguish true alignments from random alignments and the ambiguity will incur errors in structural and functional annotation. Existing frameshift correction tools are designed for data with much lower error rate and are not optimized for PacBio data. As an increasing number of groups are using SMRT, there is an urgent need for dedicated homology search tools for PacBio data.In this work, we introduce Frame-Pro, a profile homology search tool for PacBio reads. Our tool corrects sequencing errors and also outputs the profile alignments of the corrected sequences against characterized protein families. We applied our tool to both simulated and real PacBio data. The results showed that our method enables more sensitive homology search, especially for PacBio data sets of low sequencing coverage. In addition, we can correct more errors when comparing with a popular error correction tool that does not rely on hybrid sequencing.The source code is freely available at https://sourceforge.net/projects/frame-pro/yannisun@msu.edu. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.


July 7, 2019  |  

TeloPCR-seq: a high-throughput sequencing approach for telomeres.

We have developed a high-throughput sequencing approach that enables us to determine terminal telomere sequences from tens of thousands of individual Schizosaccharomyces pombe telomeres. This method provides unprecedented coverage of telomeric sequence complexity in fission yeast. S. pombe telomeres are composed of modular degenerate repeats that can be explained by variation in usage of the TER1 RNA template during reverse transcription. Taking advantage of this deep sequencing approach, we find that ‘like’ repeat modules are highly correlated within individual telomeres. Moreover, repeat module preference varies with telomere length, suggesting that existing repeats promote the incorporation of like repeats and/or that specific conformations of the telomerase holoenzyme efficiently and/or processively add repeats of like nature. After the loss of telomerase activity, this sequencing and analysis pipeline defines a population of telomeres with altered sequence content. This approach will be adaptable to study telomeric repeats in other organisms and also to interrogate repetitive sequences throughout the genome that are inaccessible to other sequencing methods.© 2016 Federation of European Biochemical Societies.


July 7, 2019  |  

CoLoRMap: Correcting Long Reads by Mapping short reads.

Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormapehaghshe@sfu.ca or cedric.chauve@sfu.caSupplementary data are available at Bioinformatics online.© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.


July 7, 2019  |  

Efficient, cost-effective, high-throughput, Multilocus Sequencing Typing (MLST) method, NGMLST, and the analytical software program MLSTEZ.

Multilocus sequence typing (MLST) has become the preferred method for genotyping many biological species. It can be used to identify major phylogenetic clades, molecular groups, or subpopulations of a species, as well as individual strains or clones. However, conventional MLST is costly and time consuming, which limits its power for genotyping large numbers of samples. Here, we describe a new MLST method that uses next-generation sequencing, a multiplexing protocol, and appropriate analytical software to provide accurate, rapid, and economical MLST genotyping of 96 or more isolates in a single assay.


July 7, 2019  |  

Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study.

Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated.We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1?kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms.


July 7, 2019  |  

Microbial sequence typing in the genomic era.

Next-generation sequencing (NGS), also known as high-throughput sequencing, is changing the field of microbial genomics research. NGS allows for a more comprehensive analysis of the diversity, structure and composition of microbial genes and genomes compared to the traditional automated Sanger capillary sequencing at a lower cost. NGS strategies have expanded the versatility of standard and widely used typing approaches based on nucleotide variation in several hundred DNA sequences and a few gene fragments (MLST, MLVA, rMLST and cgMLST). NGS can now accommodate variation in thousands or millions of sequences from selected amplicons to full genomes (WGS, NGMLST and HiMLST). To extract signals from high-dimensional NGS data and make valid statistical inferences, novel analytic and statistical techniques are needed. In this review, we describe standard and new approaches for microbial sequence typing at gene and genome levels and guidelines for subsequent analysis, including methods and computational frameworks. We also present several applications of these approaches to some disciplines, namely genotyping, phylogenetics and molecular epidemiology. Copyright © 2017 Elsevier B.V. All rights reserved.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.