Menu
July 7, 2019

Completing the human genome: the progress and challenge of satellite DNA assembly.

Genomic studies rely on accurate chromosome assemblies to explore sequence-based models of cell biology, evolution and biomedical disease. However, even the extensively studied human genome has not yet reached a complete, ‘telomere-to-telomere’, chromosome assembly. The largest assembly gaps remain in centromeric regions and acrocentric short arms, sites known to contain megabase-sized arrays of tandem repeats, or satellite DNAs. This review aims to briefly address the progress and challenges of generating correct assemblies of satellite DNA arrays. Although the focus is placed on the human genome, many concepts presented here are applicable to other genomes.


July 7, 2019

Role of restriction-modification systems in prokaryotic evolution and ecology

Restriction–modification (R-M) systems are able to methylate or cleave DNA depending on methylation status of their recognition site. It allows them to protect bacterial cells from invasion by foreign DNA. Comparative analysis of a large number of available bacterial genomes and methylomes clearly demonstrates that the role of R-M systems in bacteria is wider than only defense. R-M systems maintain heterogeneity of a bacterial population and are involved in adaptation of bacteria to change in their environmental conditions. R-M systems can be essential for host colonization by pathogenic bacteria. Phase variation and intragenomic recombinations are sources of the fast evolution of the specificity of R-M systems. This review focuses on the influence of R-M systems on evolution and ecology of prokaryotes.


July 7, 2019

DNA N(6)-methyladenine: a new epigenetic mark in eukaryotes?

DNA N(6)-adenine methylation (N(6)-methyladenine; 6mA) in prokaryotes functions primarily in the host defence system. The prevalence and significance of this modification in eukaryotes had been unclear until recently. Here, we discuss recent publications documenting the presence of 6mA in Chlamydomonas reinhardtii, Drosophila melanogaster and Caenorhabditis elegans; consider possible roles for this DNA modification in regulating transcription, the activity of transposable elements and transgenerational epigenetic inheritance; and propose 6mA as a new epigenetic mark in eukaryotes.


July 7, 2019

De novo assembly of Dekkera bruxellensis: a multi technology approach using short and long-read sequencing and optical mapping.

It remains a challenge to perform de novo assembly using next-generation sequencing (NGS). Despite the availability of multiple sequencing technologies and tools (e.g., assemblers) it is still difficult to assemble new genomes at chromosome resolution (i.e., one sequence per chromosome). Obtaining high quality draft assemblies is extremely important in the case of yeast genomes to better characterise major events in their evolutionary history. The aim of this work is two-fold: on the one hand we want to show how combining different and somewhat complementary technologies is key to improving assembly quality and correctness, and on the other hand we present a de novo assembly pipeline we believe to be beneficial to core facility bioinformaticians. To demonstrate both the effectiveness of combining technologies and the simplicity of the pipeline, here we present the results obtained using the Dekkera bruxellensis genome.In this work we used short-read Illumina data and long-read PacBio data combined with the extreme long-range information from OpGen optical maps in the task of de novo genome assembly and finishing. Moreover, we developed NouGAT, a semi-automated pipeline for read-preprocessing, de novo assembly and assembly evaluation, which was instrumental for this work.We obtained a high quality draft assembly of a yeast genome, resolved on a chromosomal level. Furthermore, this assembly was corrected for mis-assembly errors as demonstrated by resolving a large collapsed repeat and by receiving higher scores by assembly evaluation tools. With the inclusion of PacBio data we were able to fill about 5 % of the optical mapped genome not covered by the Illumina data.


July 7, 2019

Single molecule sequencing of THCA synthase reveals copy number variation in modern drug-type Cannabis sativa L.

Cannabinoid expression is an important genetically determined feature of cannabis that presents clinical and legal implications for patients seeking cannabinoid specific therapies like Cannabidiol (CBD). Cannabinoid, terpenoid, and flavonoid marker assisted selection can accelerate breeding efforts by offering genetic tools to select for desired traits at an early stage in growth. To this end, multiple models for chemotype inheritance have been described suggesting a complex picture for chemical phenotype determination. Here we explore the potential role of copy number variation of THCA Synthase using phased single molecule sequencing and demonstrate that copy number and sequence variation of this gene is common and suggests a more nuanced view of chemotype prediction.


July 7, 2019

A fault-tolerant method for HLA typing with PacBio data.

Human leukocyte antigen (HLA) genes are critical genes involved in important biomedical aspects, including organ transplantation, autoimmune diseases and infectious diseases. The gene family contains the most polymorphic genes in humans and the difference between two alleles is only a single base pair substitution in many cases. The next generation sequencing (NGS) technologies could be used for high throughput HLA typing but in silico methods are still needed to correctly assign the alleles of a sample. Computer scientists have developed such methods for various NGS platforms, such as Illumina, Roche 454 and Ion Torrent, based on the characteristics of the reads they generate. However, the method for PacBio reads was less addressed, probably owing to its high error rates. The PacBio system has the longest read length among available NGS platforms, and therefore is the only platform capable of having exon 2 and exon 3 of HLA genes on the same read to unequivocally solve the ambiguity problem caused by the “phasing” issue.We proposed a new method BayesTyping1 to assign HLA alleles for PacBio circular consensus sequencing reads using Bayes’ theorem. The method was applied to simulated data of the three loci HLA-A, HLA-B and HLA-DRB1. The experimental results showed its capability to tolerate the disturbance of sequencing errors and external noise reads.The BayesTyping1 method could overcome the problems of HLA typing using PacBio reads, which mostly arise from sequencing errors of PacBio reads and the divergence of HLA genes, to some extent.


July 7, 2019

Dubowitz syndrome is a complex comprised of multiple, genetically distinct and phenotypically overlapping disorders.

Dubowitz syndrome is a rare disorder characterized by multiple congenital anomalies, cognitive delay, growth failure, an immune defect, and an increased risk of blood dyscrasia and malignancy. There is considerable phenotypic variability, suggesting genetic heterogeneity. We clinically characterized and performed exome sequencing and high-density array SNP genotyping on three individuals with Dubowitz syndrome, including a pair of previously-described siblings (Patients 1 and 2, brother and sister) and an unpublished patient (Patient 3). Given the siblings’ history of bone marrow abnormalities, we also evaluated telomere length and performed radiosensitivity assays. In the siblings, exome sequencing identified compound heterozygosity for a known rare nonsense substitution in the nuclear ligase gene LIG4 (rs104894419, NM_002312.3:c.2440C>T) that predicts p.Arg814X (MAF:0.0002) and an NM_002312.3:c.613delT variant that predicts a p.Ser205Leufs*29 frameshift. The frameshift mutation has not been reported in 1000 Genomes, ESP, or ClinSeq. These LIG4 mutations were previously reported in the sibling sister; her brother had not been previously tested. Western blotting showed an absence of a ligase IV band in both siblings. In the third patient, array SNP genotyping revealed a de novo ~ 3.89 Mb interstitial deletion at chromosome 17q24.2 (chr 17:62,068,463-65,963,102, hg18), which spanned the known Carney complex gene PRKAR1A. In all three patients, a median lymphocyte telomere length of = 1st centile was observed and radiosensitivity assays showed increased sensitivity to ionizing radiation. Our work suggests that, in addition to dyskeratosis congenita, LIG4 and 17q24.2 syndromes also feature shortened telomeres; to confirm this, telomere length testing should be considered in both disorders. Taken together, our work and other reports on Dubowitz syndrome, as currently recognized, suggest that it is not a unitary entity but instead a collection of phenotypically similar disorders. As a clinical entity, Dubowitz syndrome will need continual re-evaluation and re-definition as its constituent phenotypes are determined.


July 7, 2019

Pseudoautosomal region 1 length polymorphism in the human population.

The human sex chromosomes differ in sequence, except for the pseudoautosomal regions (PAR) at the terminus of the short and the long arms, denoted as PAR1 and PAR2. The boundary between PAR1 and the unique X and Y sequences was established during the divergence of the great apes. During a copy number variation screen, we noted a paternally inherited chromosome X duplication in 15 independent families. Subsequent genomic analysis demonstrated that an insertional translocation of X chromosomal sequence into theMa Y chromosome generates an extended PAR. The insertion is generated by non-allelic homologous recombination between a 548 bp LTR6B repeat within the Y chromosome PAR1 and a second LTR6B repeat located 105 kb from the PAR boundary on the X chromosome. The identification of the reciprocal deletion on the X chromosome in one family and the occurrence of the variant in different chromosome Y haplogroups demonstrate this is a recurrent genomic rearrangement in the human population. This finding represents a novel mechanism shaping sex chromosomal evolution.


July 7, 2019

The challenges and importance of structural variation detection in livestock.

Recent studies in humans and other model organisms have demonstrated that structural variants (SVs) comprise a substantial proportion of variation among individuals of each species. Many of these variants have been linked to debilitating diseases in humans, thereby cementing the importance of refining methods for their detection. Despite progress in the field, reliable detection of SVs still remains a problem even for human subjects. Many of the underlying problems that make SVs difficult to detect in humans are amplified in livestock species, whose lower quality genome assemblies and incomplete gene annotation can often give rise to false positive SV discoveries. Regardless of the challenges, SV detection is just as important for livestock researchers as it is for human researchers, given that several productive traits and diseases have been linked to copy number variations (CNVs) in cattle, sheep, and pig. Already, there is evidence that many beneficial SVs have been artificially selected in livestock such as a duplication of the agouti signaling protein gene that causes white coat color in sheep. In this review, we will list current SV and CNV discoveries in livestock and discuss the problems that hinder routine discovery and tracking of these polymorphisms. We will also discuss the impacts of selective breeding on CNV and SV frequencies and mention how SV genotyping could be used in the future to improve genetic selection.


July 7, 2019

The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line.

The HeLa cell line was established in 1951 from cervical cancer cells taken from a patient, Henrietta Lacks. This was the first successful attempt to immortalize human-derived cells in vitro. The robust growth and unrestricted distribution of HeLa cells resulted in its broad adoption–both intentionally and through widespread cross-contamination–and for the past 60?years it has served a role analogous to that of a model organism. The cumulative impact of the HeLa cell line on research is demonstrated by its occurrence in more than 74,000 PubMed abstracts (approximately 0.3%). The genomic architecture of HeLa remains largely unexplored beyond its karyotype, partly because like many cancers, its extensive aneuploidy renders such analyses challenging. We carried out haplotype-resolved whole-genome sequencing of the HeLa CCL-2 strain, examined point- and indel-mutation variations, mapped copy-number variations and loss of heterozygosity regions, and phased variants across full chromosome arms. We also investigated variation and copy-number profiles for HeLa S3 and eight additional strains. We find that HeLa is relatively stable in terms of point variation, with few new mutations accumulating after early passaging. Haplotype resolution facilitated reconstruction of an amplified, highly rearranged region of chromosome 8q24.21 at which integration of the human papilloma virus type 18 (HPV-18) genome occurred and that is likely to be the event that initiated tumorigenesis. We combined these maps with RNA-seq and ENCODE Project data sets to phase the HeLa epigenome. This revealed strong, haplotype-specific activation of the proto-oncogene MYC by the integrated HPV-18 genome approximately 500?kilobases upstream, and enabled global analyses of the relationship between gene dosage and expression. These data provide an extensively phased, high-quality reference genome for past and future experiments relying on HeLa, and demonstrate the value of haplotype resolution for characterizing cancer genomes and epigenomes.


July 7, 2019

Haplotype assembly in polyploid genomes and identical by descent shared tracts.

Genome-wide haplotype reconstruction from sequence data, or haplotype assembly, is at the center of major challenges in molecular biology and life sciences. For complex eukaryotic organisms like humans, the genome is vast and the population samples are growing so rapidly that algorithms processing high-throughput sequencing data must scale favorably in terms of both accuracy and computational efficiency. Furthermore, current models and methodologies for haplotype assembly (i) do not consider individuals sharing haplotypes jointly, which reduces the size and accuracy of assembled haplotypes, and (ii) are unable to model genomes having more than two sets of homologous chromosomes (polyploidy). Polyploid organisms are increasingly becoming the target of many research groups interested in the genomics of disease, phylogenetics, botany and evolution but there is an absence of theory and methods for polyploid haplotype reconstruction.In this work, we present a number of results, extensions and generalizations of compass graphs and our HapCompass framework. We prove the theoretical complexity of two haplotype assembly optimizations, thereby motivating the use of heuristics. Furthermore, we present graph theory-based algorithms for the problem of haplotype assembly using our previously developed HapCompass framework for (i) novel implementations of haplotype assembly optimizations (minimum error correction), (ii) assembly of a pair of individuals sharing a haplotype tract identical by descent and (iii) assembly of polyploid genomes. We evaluate our methods on 1000 Genomes Project, Pacific Biosciences and simulated sequence data.HapCompass is available for download at http://www.brown.edu/Research/Istrail_Lab/.Supplementary data are available at Bioinformatics online.


July 7, 2019

Strobe sequence design for haplotype assembly.

Humans are diploid, carrying two copies of each chromosome, one from each parent. Separating the paternal and maternal chromosomes is an important component of genetic analyses such as determining genetic association, inferring evolutionary scenarios, computing recombination rates, and detecting cis-regulatory events. As the pair of chromosomes are mostly identical to each other, linking together of alleles at heterozygous sites is sufficient to phase, or separate the two chromosomes. In Haplotype Assembly, the linking is done by sequenced fragments that overlap two heterozygous sites. While there has been a lot of research on correcting errors to achieve accurate haplotypes via assembly, relatively little work has been done on designing sequencing experiments to get long haplotypes. Here, we describe the different design parameters that can be adjusted with next generation and upcoming sequencing technologies, and study the impact of design choice on the length of the haplotype.We show that a number of parameters influence haplotype length, with the most significant one being the advance length (distance between two fragments of a clone). Given technologies like strobe sequencing that allow for large variations in advance lengths, we design and implement a simulated annealing algorithm to sample a large space of distributions over advance-lengths. Extensive simulations on individual genomic sequences suggest that a non-trivial distribution over advance lengths results a 1-2 order of magnitude improvement in median haplotype length.Our results suggest that haplotyping of large, biologically important genomic regions is feasible with current technologies.


July 7, 2019

Next-generation polyploid phylogenetics: rapid resolution of hybrid polyploid complexes using PacBio single-molecule sequencing.

Difficulties in generating nuclear data for polyploids have impeded phylogenetic study of these groups. We describe a high-throughput protocol and an associated bioinformatics pipeline (Pipeline for Untangling Reticulate Complexes (Purc)) that is able to generate these data quickly and conveniently, and demonstrate its efficacy on accessions from the fern family Cystopteridaceae. We conclude with a demonstration of the downstream utility of these data by inferring a multi-labeled species tree for a subset of our accessions. We amplified four c. 1-kb-long nuclear loci and sequenced them in a parallel-tagged amplicon sequencing approach using the PacBio platform. Purc infers the final sequences from the raw reads via an iterative approach that corrects PCR and sequencing errors and removes PCR-mediated recombinant sequences (chimeras). We generated data for all gene copies (homeologs, paralogs, and segregating alleles) present in each of three sets of 50 mostly polyploid accessions, for four loci, in three PacBio runs (one run per set). From the raw sequencing reads, Purc was able to accurately infer the underlying sequences. This approach makes it easy and economical to study the phylogenetics of polyploids, and, in conjunction with recent analytical advances, facilitates investigation of broad patterns of polyploid evolution.© 2016 The Authors. New Phytologist © 2016 New Phytologist Trust.


July 7, 2019

GenBank.

GenBank(®) (www.ncbi.nlm.nih.gov/genbank/) is a comprehensive database that contains publicly available nucleotide sequences for 370 000 formally described species. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or the NCBI Submission Portal. GenBank staff assign accession numbers upon data receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Nucleotide database, which links to related information such as taxonomy, genomes, protein sequences and structures, and biomedical journal literature in PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. Recent updates include changes to policies regarding sequence identifiers, an improved 16S submission wizard, targeted loci studies, the ability to submit methylation and BioNano mapping files, and a database of anti-microbial resistance genes. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.


July 7, 2019

A vast genomic deletion in the C56BL/6 genome affects different genes within the Ifi200 cluster on chromosome 1 and mediates obesity and insulin resistance.

Obesity, the excessive accumulation of body fat, is a highly heritable and genetically heterogeneous disorder. The complex, polygenic basis for the disease consisting of a network of different gene variants is still not completely known.In the current study we generated a BAC library of the obese-prone NZO strain to clarify the genomic alteration within the gene cluster Ifi200 on chr.1 including Ifi202b, an obesity gene that is in contrast to NZO not expressed in the lean B6 mouse. With the PacBio sequencing data of NZO BAC clones we identified a deletion spanning approximately 261.8 kb in the B6 reference genome. The deletion affects different members of the Ifi200 gene family which also includes the original first exon and 5′-regulatory parts of the Ifi202b gene and suggests to be the relevant cause of its expression deficiency in B6. In addition, the generation and characterization of congenic mice carrying the critical fragment on the B6 background demonstrate its crucial role for obesity and insulin resistance.Our data reveal the reconstruction of a complex genomic region on mouse chr.1 resulting from deletions and duplications of Ifi200 genes and suggest to be relevant for the development of obesity. The results further demonstrate the complexity of the disease and highlight the importance for studying rare genetic variants as they can be causal for large effects.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.