Menu
July 7, 2019  |  

Resolving multicopy duplications de novo using polyploid phasing

While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog-specific variants. In this paper, we study the problem of resolving the variations in multicopy, long segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation-clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on average 7.0 haplotypes in 10-copy duplication datasets whereas existing algorithms reconstruct less than one copy on average.


July 7, 2019  |  

Morphological and genetic analyses of the invasive forest pathogen Phytophthora austrocedri reveal two clonal lineages colonised Britain and Argentina from a common ancestral population.

Phytophthora austrocedri is causing widespread mortality of Austrocedrus chilensis in Argentina and Juniperus communis in Britain. The pathogen has also been isolated from J. horizontalis in Germany. Isolates from Britain, Argentina and Germany are homothallic with no clear differences in the dimensions of sporangia, oogonia or oospores. Argentinian and German isolates grew faster than British isolates across a range of media and had a higher temperature tolerance although most isolates regardless of origin grew best at 15°C and all isolates were killed at 25°C. Argentinian and British isolates caused lesions on both hosts when inoculated onto A. chilensis and J. communis; however the Argentinian isolate caused longer lesions on A. chilensis than on J. communis and vice versa for the British isolate. Genetic analyses of nuclear and mitochondrial loci showed that all British isolates are identical. Argentinian isolates and the German isolate are also identical but differ from the British isolates. Single nucleotide polymorphisms are shared between the British and Argentinian isolates. It is concluded that British isolates and Argentinian isolates conform to two distinct clonal lineages of P. austrocedri founded from the same as-yet unidentified source population. These lineages should be recognised and treated as separate risks by international plant health legislation.


July 7, 2019  |  

Archetype JC polyomavirus prevails in a rare case of JC polyomavirus nephropathy and in stable renal transplant recipients with JC polyomavirus viruria.

JC polyomavirus (JCPyV) is reactivated in approximately 20% of renal transplant recipients and it may rarely cause JCPyV-associated nephropathy (JCPyVAN). Whereas progressive multifocal leukoencephalopathy of the brain is caused by rearranged neurotropic JCPyV, little is known about viral sequence variation in JCPyVAN due to the rarity of this condition.Using single-molecule real-time sequencing, characterization of full-length JCPyV genomes from urine and plasma of one JCPyVAN patient and twenty stable renal transplant recipients with JCPyV viruria was attempted. Sequence analysis of JCPyV strains was performed with the emphasis on the NCCR region, the major capsid protein gene VP1 and the large T antigen (LTag) gene.Exclusively archetype strains were identified in urine of the JCPyVAN patient. Full-length JCPyV sequences were not retrieved from plasma. Archetype strains were found in urine of nineteen stable renal transplant recipients, with JCPyV quasispecies detected in five samples. In a patient with minor graft dysfunction, a strain with archetype-like NCCR region was discovered. Individual point mutations were detected in both VP1 and LTag genes.Archetype JCPyV was dominant in the JCPyVAN patient and in stable renal transplant recipients. Archetype rather than rearranged JCPyV seems to drive the pathogenesis of JCPyVAN.


July 7, 2019  |  

Dense and accurate whole-chromosome haplotyping of individual genomes.

The diploid nature of the human genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. This lack of haplotype-level analyses can be explained by a lack of methods that can produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-specific single-cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. We provide comprehensive guidance on the required sequencing depths and reliably assign more than 95% of alleles (NA12878) to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different technologies represents an attractive solution to chart the genetic variation of diploid genomes.


July 7, 2019  |  

Estimating fitness of viral quasispecies from next-generation sequencing data.

The quasispecies model is ubiquitous in the study of viruses. While having lead to a number of insights that have stood the test of time, the quasispecies model has mostly been discussed in a theoretical fashion with little support of data. With next-generation sequencing (NGS), this situation is changing and a wealth of data can now be produced in a time- and cost-efficient manner. NGS can, after removal of technical errors, yield an exceedingly detailed picture of the viral population structure. The widespread availability of cross-sectional data can be used to study fitness landscapes of viral populations in the quasispecies model. This chapter highlights methods that estimate the strength of selection in selective sweeps, assesses marginal fitness effects of quasispecies, and finally infers the fitness landscape of a viral quasispecies, all on the basis of NGS data.


July 7, 2019  |  

HapCol: accurate and memory-efficient haplotype assembly from long reads.

Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of ‘future-generation’ sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions.By exploiting a feature of future-generation technologies-the uniform distribution of sequencing errors-we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/.bonizzoni@disco.unimib.itSupplementary information: Supplementary data are available at Bioinformatics online.© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.


July 7, 2019  |  

OxyR-dependent formation of DNA methylation patterns in OpvABOFF and OpvABON cell lineages of Salmonella enterica.

Phase variation of the Salmonella enterica opvAB operon generates a bacterial lineage with standard lipopolysaccharide structure (OpvAB(OFF)) and a lineage with shorter O-antigen chains (OpvAB(ON)). Regulation of OpvAB lineage formation is transcriptional, and is controlled by the LysR-type factor OxyR and by DNA adenine methylation. The opvAB regulatory region contains four sites for OxyR binding (OBSA-D), and four methylatable GATC motifs (GATC1-4). OpvAB(OFF) and OpvAB(ON) cell lineages display opposite DNA methylation patterns in the opvAB regulatory region: (i) in the OpvAB(OFF) state, GATC1 and GATC3 are non-methylated, whereas GATC2 and GATC4 are methylated; (ii) in the OpvAB(ON) state, GATC2 and GATC4 are non-methylated, whereas GATC1 and GATC3 are methylated. We provide evidence that such DNA methylation patterns are generated by OxyR binding. The higher stability of the OpvAB(OFF) lineage may be caused by binding of OxyR to sites that are identical to the consensus (OBSA and OBSc), while the sites bound by OxyR in OpvAB(ON) cells (OBSB and OBSD) are not. In support of this view, amelioration of either OBSB or OBSD locks the system in the ON state. We also show that the GATC-binding protein SeqA and the nucleoid protein HU are ancillary factors in opvAB control.© The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.


July 7, 2019  |  

Read-based phasing of related individuals.

Read-based phasing deduces the haplotypes of an individual from sequencing reads that cover multiple variants, while genetic phasing takes only genotypes as input and applies the rules of Mendelian inheritance to infer haplotypes within a pedigree of individuals. Combining both into an approach that uses these two independent sources of information-reads and pedigree-has the potential to deliver results better than each individually.We provide a theoretical framework combining read-based phasing with genetic haplotyping, and describe a fixed-parameter algorithm and its implementation for finding an optimal solution. We show that leveraging reads of related individuals jointly in this way yields more phased variants and at a higher accuracy than when phased separately, both in simulated and real data. Coverages as low as 2× for each member of a trio yield haplotypes that are as accurate as when analyzed separately at 15× coverage per individual.https://bitbucket.org/whatshap/whatshapt.marschall@mpi-inf.mpg.de.© The Author 2016. Published by Oxford University Press.


July 7, 2019  |  

Third-generation sequencing and the future of genomics

Third-generation long-range DNA sequencing and mapping technologies are creating a renaissance in high-quality genome sequencing. Unlike second-generation sequencing, which produces short reads a few hundred base-pairs long, third-generation single-molecule technologies generate over 10,000 bp reads or map over 100,000 bp molecules. We analyze how increased read lengths can be used to address long-standing problems in de novo genome assembly, structural variation analysis and haplotype phasing.


July 7, 2019  |  

Complete genome sequence of Bradyrhizobium sp. strain CCGE-LA001, isolated from field nodules of the enigmatic wild bean Phaseolus microcarpus.

We present the complete genome sequence of Bradyrhizobium sp. strain CCGE-LA001, a nitrogen-fixing bacterium isolated from nodules of Phaseolus microcarpus. Strain CCGE-LA001 represents the first sequenced bradyrhizobial strain obtained from a wild Phaseolus sp. Its genome revealed a large and novel symbiotic island. Copyright © 2016 Servín-Garcidueñas et al.


July 7, 2019  |  

Selecting reads for haplotype assembly

Haplotype assembly or read-based phasing is the problem of reconstructing both haplotypes of a diploid genome from next-generation sequencing data. This problem is formalized as the Minimum Error Correction (MEC) problem and can be solved using algorithms such as WhatsHap. The runtime of WhatsHap is exponential in the maximum coverage, which is hence controlled in a pre-processing step that selects reads to be used for phasing. Here, we report on a heuristic algorithm designed to choose beneficial reads for phasing, in particular to increase the connectivity of the phased blocks and the number of correctly phased variants compared to the random selection previously employed in by WhatsHap. The algorithm we describe has been integrated into the WhatsHap software, which is available under MIT licence from https://bitbucket.org/whatshap/whatshap.


July 7, 2019  |  

Bacterial genetics: SMRT-seq reveals an epigenetic switch.

Streptococcus pneumoniae uses genetic diversification as a strategy to achieve phenotypic plasticity. For example, DNA inversion of the hsdS genes of type I restriction-modification (R-M) systems determines whether S. pneumoniae forms opaque or transparent colonies, which have different colonization and virulence characteristics. Zhang and colleagues now use single-molecule, real-time sequencing (SMRT-seq) to show the allelic variation of hsdS that results from site-specific recombination forms part of an epigenetic switch.


July 7, 2019  |  

WhatsHap: fast and accurate read-based phasing

Read-based phasing allows to reconstruct the haplotype structure of a sample purely from sequencing reads. While phasing is a required step for answering questions about population genetics, compound heterozygosity, and to aid in clinical decision making, there has been a lack of an accurate, usable and standards-based software. WhatsHap is a production-ready tool for highly accurate read-based phasing. It was designed from the beginning to leverage third-generation sequencing technologies, whose long reads can span many variants and are therefore ideal for phasing. WhatsHap works also well with second-generation data, is easy to use and will phase not only SNVs, but also indels and other variants. It is unique in its ability to combine read-based with genetic phasing, allowing to further improve accuracy if multiple related samples are provided.


July 7, 2019  |  

Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study.

Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated.We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1?kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.