A significant portion of genes in vertebrate genomes belongs to multigene families, with each family containing several gene copies whose presence/absence, as well as isoform structure, can be highly variable across individuals. Existing de novo techniques for assaying the sequences of such highly-similar gene families fall short of reconstructing end-to-end transcripts with nucleotide-level precision or assigning alternatively spliced transcripts to their respective gene copies. We present IsoCon, a high-precision method using long PacBio Iso-Seq reads to tackle this challenge. We apply IsoCon to nine Y chromosome ampliconic gene families and show that it outperforms existing methods on both experimental and simulated data. IsoCon has allowed us to detect an unprecedented number of novel isoforms and has opened the door for unraveling the structure of many multigene families and gaining a deeper understanding of genome evolution and human diseases.
Long non-coding RNA identification: comparing machine learning based tools for long non-coding transcripts discrimination
Long noncoding RNA (lncRNA) is a kind of noncoding RNA with length more than 200 nucleotides, which aroused interest of people in recent years. Lots of studies have confirmed that human genome contains many thousands of lncRNAs which exert great influence over some critical regulators of cellular process. With the advent of high-throughput sequencing technologies, a great quantity of sequences is waiting for exploitation. Thus, many programs are developed to distinguish differences between coding and long noncoding transcripts. Different programs are generally designed to be utilised under different circumstances and it is sensible and practical to select an appropriate method according to a certain situation. In this review, several popular methods and their advantages, disadvantages, and application scopes are summarised to assist people in employing a suitable method and obtaining a more reliable result.
Emergence, retention and selection: A trilogy of origination for functional de novo proteins from ancestral lncRNAs in primates.
While some human-specific protein-coding genes have been proposed to originate from ancestral lncRNAs, the transition process remains poorly understood. Here we identified 64 hominoid-specific de novo genes and report a mechanism for the origination of functional de novo proteins from ancestral lncRNAs with precise splicing structures and specific tissue expression profiles. Whole-genome sequencing of dozens of rhesus macaque animals revealed that these lncRNAs are generally not more selectively constrained than other lncRNA loci. The existence of these newly-originated de novo proteins is also not beyond anticipation under neutral expectation, as they generally have longer theoretical lifespan than their current age, due to their GC-rich sequence property enabling stable ORFs with lower chance of non-sense mutations. Interestingly, although the emergence and retention of these de novo genes are likely driven by neutral forces, population genetics study in 67 human individuals and 82 macaque animals revealed signatures of purifying selection on these genes specifically in human population, indicating a proportion of these newly-originated proteins are already functional in human. We thus propose a mechanism for creation of functional de novo proteins from ancestral lncRNAs during the primate evolution, which may contribute to human-specific genetic novelties by taking advantage of existed genomic contexts.
Little is known about the function of most long non-coding RNAs. But a suite of new tools might change that.
Long noncoding RNAs (lncRNAs) are a group of transcripts that are longer than 200 nucleotides (nt) without coding potential. Over the past decade, tens of thousands of novel lncRNAs have been annotated in animal and plant genomes because of advanced high-throughput RNA sequencing technologies and with the aid of coding transcript classifiers. Further, a considerable number of reports have revealed the existence of stable, functional small peptides (also known as micropeptides), translated from lncRNAs. In this review, we discuss the methods of lncRNA classification, the investigations regarding their coding potential and the functional significance of the peptides they encode.
The genomics era has brought along the completed sequencing of a large number of bird genomes that cover a broad range of the avian phylogenetic tree (>30 orders), leading to major novel insights into avian biology and evolution. Among recent findings, the discovery that birds lack a large number of protein coding genes that are organized in highly conserved syntenic clusters in other vertebrates is very intriguing, given the physiological importance of many of these genes. A considerable number of them play prominent endocrine roles, suggesting that birds evolved compensatory genetic or physiological mechanisms that allowed them to survive and thrive in spite of these losses. While further studies are needed to establish the exact extent of avian gene losses, these findings point to birds as potentially highly relevant model organisms for exploring the genetic basis and possible therapeutic approaches for a wide range of endocrine functions and disorders. Copyright © 2017 Elsevier Inc. All rights reserved.
The MUMmer system and the genome sequence aligner nucmer included within it are among the most widely used alignment packages in genomics. Since the last major release of MUMmer version 3 in 2004, it has been applied to many types of problems including aligning whole genome sequences, aligning reads to a reference genome, and comparing different assemblies of the same genome. Despite its broad utility, MUMmer3 has limitations that can make it difficult to use for large genomes and for the very large sequence data sets that are common today. In this paper we describe MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes; we illustrate this with an alignment of the human and chimpanzee genomes, which allows us to compute that the two species are 98% identical across 96% of their length. With the enhancements described here, MUMmer4 can also be used to efficiently align reads to reference genomes, although it is less sensitive and accurate than the dedicated read aligners. The nucmer aligner in MUMmer4 can now be called from scripting languages such as Perl, Python and Ruby. These improvements make MUMer4 one the most versatile genome alignment packages available.
Redkmer: An Assembly-Free Pipeline for the Identification of Abundant and Specific X-Chromosome Target Sequences for X-Shredding by CRISPR Endonucleases.
CRISPR-based synthetic sex ratio distorters, which operate by shredding the X-chromosome during male meiosis, are promising tools for the area-wide control of harmful insect pest or disease vector species. X-shredders have been proposed as tools to suppress insect populations by biasing the sex ratio of the wild population toward males, thus reducing its natural reproductive potential. However, to build synthetic X-shredders based on CRISPR, the selection of gRNA targets, in the form of high-copy sequence repeats on the X chromosome of a given species, is difficult, since such repeats are not accurately resolved in genome assemblies and cannot be assigned to chromosomes with confidence. We have therefore developed the redkmer computational pipeline, designed to identify short and highly abundant sequence elements occurring uniquely on the X chromosome. Redkmer was designed to use as input minimally processed whole genome sequence data from males and females. We tested redkmer with short- and long-read whole genome sequence data of Anopheles gambiae, the major vector of human malaria, in which the X-shredding paradigm was originally developed. Redkmer established long reads as chromosomal proxies with excellent correlation to the genome assembly and used them to rank X-candidate kmers for their level of X-specificity and abundance. Among these, a high-confidence set of 25-mers was identified, many belonging to previously known X-chromosome repeats of Anopheles gambiae, including the ribosomal gene array and the selfish elements harbored within it. Data from a control strain, in which these repeats are shared with the Y chromosome, confirmed the elimination of these kmers during filtering. Finally, we show that redkmer output can be linked directly to gRNA selection and off-target prediction. In addition, the output of redkmer, including the prediction of chromosomal origin of single-molecule long reads and chromosome specific kmers, could also be used for the characterization of other biologically relevant sex chromosome sequences, a task that is frequently hampered by the repetitiveness of sex chromosome sequence content.
Repetitive DNA plays a fundamental role in the organization, size and evolution of eukaryotic genomes. The sequencing of the turbot revealed a small and compact genome, as in all flatfish studied to date. The assembly of repetitive regions is still incomplete because it is difficult to correctly identify their position, number and array. The combination of classical cytogenetic techniques along with high quality sequencing is essential to increase the knowledge of the structure and composition of these sequences and, thus, of the structure and function of the whole genome. In this work, the in silico analysis of H1 histone, 5S rDNA, telomeric and Rex repetitive sequences, was compared to their chromosomal mapping by fluorescent in situ hybridization (FISH), providing a more comprehensive picture of these elements in the turbot genome. FISH assays confirmed the location of H1 in LG8; 5S rDNA in LG4 and LG6; telomeric sequences at the end of all chromosomes whereas Rex elements were dispersed along most chromosomes. The discrepancies found between both approaches could be related to the sequencing methodology applied in this species and also to the resolution limitations of the FISH technique. Turbot cytogenomic analyses have proven to add new chromosomal landmarks in the karyotype of this species, representing a powerful tool to investigate targeted genomic sequences or regions in the genetic and physical maps of this species. Copyright © 2017 Elsevier B.V. All rights reserved.
The sea lamprey germline genome provides insights into programmed genome rearrangement and vertebrate evolution.
The sea lamprey (Petromyzon marinus) serves as a comparative model for reconstructing vertebrate evolution. To enable more informed analyses, we developed a new assembly of the lamprey germline genome that integrates several complementary data sets. Analysis of this highly contiguous (chromosome-scale) assembly shows that both chromosomal and whole-genome duplications have played significant roles in the evolution of ancestral vertebrate and lamprey genomes, including chromosomes that carry the six lamprey HOX clusters. The assembly also contains several hundred genes that are reproducibly eliminated from somatic cells during early development in lamprey. Comparative analyses show that gnathostome (mouse) homologs of these genes are frequently marked by polycomb repressive complexes (PRCs) in embryonic stem cells, suggesting overlaps in the regulatory logic of somatic DNA elimination and bivalent states that are regulated by early embryonic PRCs. This new assembly will enhance diverse studies that are informed by lampreys’ unique biology and evolutionary/comparative perspective.
Conventional and single-molecule targeted sequencing method for specific variant detection in IKBKG while bypassing the IKBKGP1 pseudogene.
In addition to Sanger sequencing, next-generation sequencing of gene panels and exomes has emerged as a standard diagnostic tool in many laboratories. However, these captures can miss regions, have poor efficiency, or capture pseudogenes, which hamper proper diagnoses. One such example is the primary immunodeficiency-associated gene IKBKG. Its pseudogene IKBKGP1 makes traditional capture methods aspecific. We therefore developed a long-range PCR method to efficiently target IKBKG, as well as two associated genes (IRAK4 and MYD88), while bypassing the IKBKGP1 pseudogene. Sequencing accuracy was evaluated using both conventional short-read technology and a newer long-read, single-molecule sequencer. Different mapping and variant calling options were evaluated in their capability to bypass the pseudogene using both sequencing platforms. Based on these evaluations, we determined a robust diagnostic application for unambiguous sequencing and variant calling in IKBKG, IRAK4, and MYD88. This method allows rapid identification of selected primary immunodeficiency diseases in patients suffering from life-threatening invasive pyogenic bacterial infections. Copyright © 2018 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Humans are a diploid species that inherit one set of chromosomes paternally and one homologous set of chromosomes maternally. Unfortunately, most human sequencing initiatives ignore this fact in that they do not directly delineate the nucleotide content of the maternal and paternal copies of the 23 chromosomes individuals possess (i.e., they do not ‘phase’ the genome) often because of the costs and complexities of doing so. We compared 11 different widely-used approaches to phasing human genomes using the publicly available ‘Genome-In-A-Bottle’ (GIAB) phased version of the NA12878 genome as a gold standard. The phasing strategies we compared included laboratory-based assays that prepare DNA in unique ways to facilitate phasing as well as purely computational approaches that seek to reconstruct phase information from general sequencing reads and constructs or population-level haplotype frequency information obtained through a reference panel of haplotypes. To assess the performance of the 11 approaches, we used metrics that included, among others, switch error rates, haplotype block lengths, the proportion of fully phase-resolved genes, phasing accuracy and yield between pairs of SNVs. Our comparisons suggest that a hybrid or combined approach that leverages: 1. population-based phasing using the SHAPEIT software suite, 2. either genome-wide sequencing read data or parental genotypes, and 3. a large reference panel of variant and haplotype frequencies, provides a fast and efficient way to produce highly accurate phase-resolved individual human genomes. We found that for population-based approaches, phasing performance is enhanced with the addition of genome-wide read data; e.g., whole genome shotgun and/or RNA sequencing reads. Further, we found that the inclusion of parental genotype data within a population-based phasing strategy can provide as much as a ten-fold reduction in phasing errors. We also considered a majority voting scheme for the construction of a consensus haplotype combining multiple predictions for enhanced performance and site coverage. Finally, we also identified DNA sequence signatures associated with the genomic regions harboring phasing switch errors, which included regions of low polymorphism or SNV density.
Eukaryotic genomes are replete with repeated sequences in the form of transposable elements (TEs) dispersed across the genome or as satellite arrays, large stretches of tandemly repeated sequences. Many satellites clearly originated as TEs, but it is unclear how mobile genetic parasites can transform into megabase-sized tandem arrays. Comprehensive population genomic sampling is needed to determine the frequency and generative mechanisms of tandem TEs, at all stages from their initial formation to their subsequent expansion and maintenance as satellites. The best available population resources, short-read DNA sequences, are often considered to be of limited utility for analyzing repetitive DNA due to the challenge of mapping individual repeats to unique genomic locations. Here we develop a new pipeline called ConTExt that demonstrates that paired-end Illumina data can be successfully leveraged to identify a wide range of structural variation within repetitive sequence, including tandem elements. By analyzing 85 genomes from five populations of Drosophila melanogaster, we discover that TEs commonly form tandem dimers. Our results further suggest that insertion site preference is the major mechanism by which dimers arise and that, consequently, dimers form rapidly during periods of active transposition. This abundance of TE dimers has the potential to provide source material for future expansion into satellite arrays, and we discover one such copy number expansion of the DNA transposon hobo to approximately 16 tandem copies in a single line. The very process that defines TEs-transposition-thus regularly generates sequences from which new satellites can arise.© 2018 McGurk and Barbash; Published by Cold Spring Harbor Laboratory Press.
Convergent adaptation provides unique insights into the predictability of evolution and ultimately into processes of biological diversification. Supergenes (beneficial gene linkage) are striking examples of adaptation, but little is known about their prevalence or evolution. A recent study on anther-smut fungi documented supergene formation by rearrangements linking two key mating-type loci, controlling pre- and post-mating compatibility. Here further high-quality genome assemblies reveal four additional independent cases of chromosomal rearrangements leading to regions of suppressed recombination linking these mating-type loci in closely related species. Such convergent transitions in genomic architecture of mating-type determination indicate strong selection favoring linkage of mating-type loci into cosegregating supergenes. We find independent evolutionary strata (stepwise recombination suppression) in several species, with extensive rearrangements, gene losses, and transposable element accumulation. We thus show remarkable convergence in mating-type chromosome evolution, recurrent supergene formation, and repeated evolution of similar phenotypes through different genomic changes.
Size and content of the sex-determining region of the Y chromosome in dioecious Mercurialis annua, a plant with homomorphic sex chromosomes.
Dioecious plants vary in whether their sex chromosomes are heteromorphic or homomorphic, but even homomorphic sex chromosomes may show divergence between homologues in the non-recombining, sex-determining region (SDR). Very little is known about the SDR of these species, which might represent particularly early stages of sex-chromosome evolution. Here, we assess the size and content of the SDR of the diploid dioecious herb Mercurialis annua, a species with homomorphic sex chromosomes and mild Y-chromosome degeneration. We used RNA sequencing (RNAseq) to identify new Y-linked markers for M. annua. Twelve of 24 transcripts showing male-specific expression in a previous experiment could be amplified by polymerase chain reaction (PCR) only from males, and are thus likely to be Y-linked. Analysis of genome-capture data from multiple populations of M. annua pointed to an additional six male-limited (and thus Y-linked) sequences. We used these markers to identify and sequence 17 sex-linked bacterial artificial chromosomes (BACs), which form 11 groups of non-overlapping sequences, covering a total sequence length of about 1.5 Mb. Content analysis of this region suggests that it is enriched for repeats, has low gene density, and contains few candidate sex-determining genes. The BACs map to a subset of the sex-linked region of the genetic map, which we estimate to be at least 14.5 Mb. This is substantially larger than estimates for other dioecious plants with homomorphic sex chromosomes, both in absolute terms and relative to their genome sizes. Our data provide a rare, high-resolution view of the homomorphic Y chromosome of a dioecious plant.