Fast and effective variant calling algorithms have been crucial to the successful application of DNA sequencing in human genetics. In particular, joint calling – in which reads from multiple individuals are pooled to increase power for shared variants – is an important tool for population surveys of variation. Joint calling was applied by the 1000 Genomes Project to identify variants across many individuals each sequenced to low coverage (about 5-fold). This approach successfully found common small variants, but broadly missed structural variants and large indels for which short-read sequencing has limited sensitivity. To support use of large variants in rare disease and common trait association studies, it is necessary to perform population-scale surveys with a technology effective at detecting indels and structural variants, such as PacBio SMRT Sequencing. For these studies, it is important to have a joint calling workflow that works with PacBio reads. We have developed pbsv, an indel and structural variant caller for PacBio reads, that provides a two-step joint calling workflow similar to that used to build the ExAC database. The first stage, discovery, is performed separately for each sample and consolidates whole genome alignments into a sparse representation of potentially variant loci. The second stage, calling, is performed on all samples together and considers only the signatures identified in the discovery stage. We applied the pbsv joint calling workflow to PacBio reads from twenty human genomes, with coverage ranging from 5-fold to 80-fold per sample for a total of 460-fold. The analysis required only 102 CPU hours, and identified over 800,000 indels and structural variants, including hundreds of inversions and translocations, many times more than discovered with short-read sequencing. The workflow is scalable to thousands of samples. The ongoing application of this workflow to thousands of samples will provide insight into the evolution and functional importance of large variants in human evolution and disease.
Satellite repeats are a structural component of centromeres and telomeres, and in some instances their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50?bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: (1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and (2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males vs. females; using Y chromosome assemblies or FIuorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59?kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions. © The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
The central goal of medical genomics is to understand the inherited basis of sequence variation that underlies human physiology, evolution, and disease. Functional association studies currently ignore millions of bases that span each centromeric region and acrocentric short arm. These regions are enriched in long arrays of tandem repeats, or satellite DNAs, that are known to vary extensively in copy number and repeat structure in the human population. Satellite sequence variation in the human genome is often so large that it is detected cytogenetically, yet due to the lack of a reference assembly and informatics tools to measure this variability, contemporary high-resolution disease association studies are unable to detect causal variants in these regions. Nevertheless, recently uncovered associations between satellite DNA variation and human disease support that these regions present a substantial and biologically important fraction of human sequence variation. Therefore, there is a pressing and unmet need to detect and incorporate this uncharacterized sequence variation into broad studies of human evolution and medical genomics. Here I discuss the current knowledge of satellite DNA variation in the human genome, focusing on centromeric satellites and their potential implications for disease.
Adaptive archaic introgression of copy number variants and the discovery of previously unknown human genes
As they migrated out of Africa and into Europe and Asia, anatomically modern humans interbred with archaic hominins, such as Neanderthals and Denisovans. The result of this genetic introgression on the recipient populations has been of considerable interest, especially in cases of selection for specific archaic genetic variants. Hsieh et al. characterized adaptive structural variants and copy number variants that are likely targets of positive selection in Melanesians. Focusing on population-specific regions of the genome that carry duplicated genes and show an excess of amino acid replacements provides evidence for one of the mechanisms by which genetic novelty can arise and result in differentiation between human genomes.Science, this issue p. eaax2083INTRODUCTIONCharacterizing genetic variants underlying local adaptations in human populations is one of the central goals of evolutionary research. Most studies have focused on adaptive single-nucleotide variants that either arose as new beneficial mutations or were introduced after interbreeding with our now-extinct relatives, including Neanderthals and Denisovans. The adaptive role of copy number variants (CNVs), another well-known form of genomic variation generated through deletions or duplications that affect more base pairs in the genome, is less well understood, despite evidence that such mutations are subject to stronger selective pressures.RATIONALEThis study focuses on the discovery of introgressed and adaptive CNVs that have become enriched in specific human populations. We combine whole-genome CNV calling and population genetic inference methods to discover CNVs and then assess signals of selection after controlling for demographic history. We examine 266 publicly available modern human genomes from the Simons Genome Diversity Project and genomes of three ancient homininstextemdasha Denisovan, a Neanderthal from the Altai Mountains in Siberia, and a Neanderthal from Croatia. We apply long-read sequencing methods to sequence-resolve complex CNVs of interest specifically in the Melanesianstextemdashan Oceanian population distributed from Papua New Guinea to as far east as the islands of Fiji and known to harbor some of the greatest amounts of Neanderthal and Denisovan ancestry.RESULTSConsistent with the hypothesis of archaic introgression outside Africa, we find a significant excess of CNV sharing between modern non-African populations and archaic hominins (P = 0.039). Among Melanesians, we observe an enrichment of CNVs with potential signals of positive selection (n = 37 CNVs), of which 19 CNVs likely introgressed from archaic hominins. We show that Melanesian-stratified CNVs are significantly associated with signals of positive selection (P = 0.0323). Many map near or within genes associated with metabolism (e.g., ACOT1 and ACOT2), development and cell cycle or signaling (e.g., TNFRSF10D and CDK11A and CDK11B), or immune response (e.g., IFNLR1). We characterize two of the largest and most complex CNVs on chromosomes 16p11.2 and 8p21.3 that introgressed from Denisovans and Neanderthals, respectively, and are absent from most other human populations. At chromosome 16p11.2, we sequence-resolve a large duplication of >383 thousand base pairs (kbp) that originated from Denisovans and introgressed into the ancestral Melanesian population 60,000 to 170,000 years ago. This large duplication occurs at high frequency (>79%) in diverse Melanesian groups, shows signatures of positive selection, and maps adjacent to Homo sapienstextendashspecific duplications that predispose to rearrangements associated with autism. On chromosome 8p21.3, we identify a Melanesian haplotype that carries two CNVs, a ~6-kbp deletion, and a ~38-kbp duplication, with a Neanderthal origin and that introgressed into non-Africans 40,000 to 120,000 years ago. This CNV haplotype occurs at high frequency (44%) and shows signals consistent with a partial selective sweep in Melanesians. Using long-read sequencing genomic and transcriptomic data, we reconstruct the structure and complex evolutionary history for these two CNVs and discover previously undescribed duplicated genes (TNFRSF10D1, TNFRSF10D2, and NPIPB16) that show an excess of amino acid replacements consistent with the action of positive selection.CONCLUSIONOur results suggest that large CNVs originating in archaic hominins and introgressed into modern humans have played an important role in local population adaptation and represent an insufficiently studied source of large-scale genetic variation that is absent from current reference genomes.Large adaptive-introgressed CNVs at chromosomes 8p21.3 and 16p11.2 in Melanesians.The magnifying glasses highlight structural differences between the archaic (top) and reference (bottom) genomes. Neanderthal (red) and Denisovan (blue) haplotypes encompassing large CNVs occur at high frequencies in Melanesians (44 and 79%, respectively) but are absent (black) in all non-Melanesians. These CNVs create positively selected genes (TNFRSF10D1, TNFRSF10D2, and NPIPB16) that are absent from the reference genome.Copy number variants (CNVs) are subject to stronger selective pressure than single-nucleotide variants, but their roles in archaic introgression and adaptation have not been systematically investigated. We show that stratified CNVs are significantly associated with signatures of positive selection in Melanesians and provide evidence for adaptive introgression of large CNVs at chromosomes 16p11.2 and 8p21.3 from Denisovans and Neanderthals, respectively. Using long-read sequence data, we reconstruct the structure and complex evolutionary history of these polymorphisms and show that both encode positively selected genes absent from most human populations. Our results collectively suggest that large CNVs originating in archaic hominins and introgressed into modern humans have played an important role in local population adaptation and represent an insufficiently studied source of large-scale genetic variation.
Recent studies suggest that closely related species can accumulate substantial genetic and phenotypic differences despite ongoing gene flow, thus challenging traditional ideas regarding the genetics of speciation. Baboons (genus Papio) are Old World monkeys consisting of six readily distinguishable species. Baboon species hybridize in the wild, and prior data imply a complex history of differentiation and introgression. We produced a reference genome assembly for the olive baboon (Papio anubis) and whole-genome sequence data for all six extant species. We document multiple episodes of admixture and introgression during the radiation of Papio baboons, thus demonstrating their value as a model of complex evolutionary divergence, hybridization, and reticulation. These results help inform our understanding of similar cases, including modern humans, Neanderthals, Denisovans, and other ancient hominins.
Crop domestication changed the course of human evolution, and domestication of maize (Zea mays L. subspecies mays), today the world’s most important crop, enabled civilizations to flourish and has played a major role in shaping the world we know today. Archaeological and ethnobotanical research help us understand the development of the cultures and the movements of the peoples who carried maize to new areas where it continued to adapt. Ancient remains of maize cobs and kernels have been found in the place of domestication, the Balsas River Valley (~9,000 years before present era), and the cultivation center, the Tehuacan Valley (~5,000 years before present era), and have been used to study the process of domestication. Paleogenomic data showed that some of the genes controlling the stem and inflorescence architecture were comparable to modern maize, while other genes controlling ear shattering and starch biosynthesis retain high levels of variability, similar to those found in the wild relative teosinte. These results indicate that the domestication process was both gradual and complex, where different genetic loci were selected at different points in time, and that the transformation of teosinte to maize was completed in the last 5,000 years. Mesoamerican native cultures domesticated teosinte and developed maize from a 6 cm long, popping-kernel ear to what we now recognize as modern maize with its wide variety in ear size, kernel texture, color, size, and adequacy for diverse uses and also invented nixtamalization, a process key to maximizing its nutrition. Used directly for human and animal consumption, processed food products, bioenergy, and many cultural applications, it is now grown on six of the world’s seven continents. The study of its evolution and domestication from the wild grass teosinte helps us understand the nature of genetic diversity of maize and its wild relatives and gene expression. Genetic barriers to direct use of teosinte or Tripsacum in maize breeding have challenged our ability to identify valuable genes and traits, let alone incorporate them into elite, modern varieties. Genomic information and newer genetic technologies will facilitate the use of wild relatives in crop improvement; hence it is more important than ever to ensure their conservation and availability, fundamental to future food security. In situ conservation efforts dedicated to preserving remnant populations of wild relatives in Mexico are key to safeguarding the genetic diversity of maize and its genepool, as well as enabling these species to continue to adapt to dynamic climate and environmental changes. Genebank ex situ efforts are crucial to securely maintain collected wild relative resources and to provide them for gene discovery and other research efforts.
Structural variation of centromeric endogenous retroviruses in human populations and their impact on cutaneous T-cell lymphoma, Sézary syndrome, and HIV infection.
Human Endogenous Retroviruses type K HML-2 (HK2) are integrated into 117 or more areas of human chromosomal arms while two newly discovered HK2 proviruses, K111 and K222, spread extensively in pericentromeric regions, are the first retroviruses discovered in these areas of our genome.We use PCR and sequencing analysis to characterize pericentromeric K111 proviruses in DNA from individuals of diverse ethnicities and patients with different diseases.We found that the 5′ LTR-gag region of K111 proviruses is missing in certain individuals, creating pericentromeric instability. K111 deletion (-/- K111) is seen in about 15% of Caucasian, Asian, and Middle Eastern populations; it is missing in 2.36% of African individuals, suggesting that the -/- K111 genotype originated out of Africa. As we identified the -/-K111 genotype in Cutaneous T-cell lymphoma (CTCL) cell lines, we studied whether the -/-K111 genotype is associated with CTCL. We found a significant increase in the frequency of detection of the -/-K111 genotype in Caucasian patients with severe CTCL and/or Sézary syndrome (n?=?35, 37.14%), compared to healthy controls (n?=?160, 15.6%) [p?=?0.011]. The -/-K111 genotype was also found to vary in HIV-1 infection. Although Caucasian healthy individuals have a similar frequency of detection of the -/- K111 genotype, Caucasian HIV Long-Term Non-Progressors (LTNPs) and/or elite controllers, have significantly higher detection of the -/-K111 genotype (30.55%; n?=?36) than patients who rapidly progress to AIDS (8.5%; n?=?47) [p?=?0.0097].Our data indicate that pericentromeric instability is associated with more severe CTCL and/or Sézary syndrome in Caucasians, and appears to allow T-cells to survive lysis by HIV infection. These findings also provide new understanding of human evolution, as the -/-K111 genotype appears to have arisen out of Africa and is distributed unevenly throughout the world, possibly affecting the severity of HIV in different geographic areas.
PacBio RS II is the first commercialized third-generation DNA sequencer able to sequence a single molecule DNA in real-time without amplification. PacBio RS II’s sequencing technology is novel and unique, enabling the direct observation of DNA synthesis by DNA polymerase. PacBio RS II confers four major advantages compared to other sequencing technologies: long read lengths, high consensus accuracy, a low degree of bias, and simultaneous capability of epigenetic characterization. These advantages surmount the obstacle of sequencing genomic regions such as high/low G+C, tandem repeat, and interspersed repeat regions. Moreover, PacBio RS II is ideal for whole genome sequencing, targeted sequencing, complex population analysis, RNA sequencing, and epigenetics characterization. With PacBio RS II, we have sequenced and analyzed the genomes of many species, from viruses to humans. Herein, we summarize and review some of our key genome sequencing projects, including full-length viral sequencing, complete bacterial genome and almost-complete plant genome assemblies, and long amplicon sequencing of a disease-associated gene region. We believe that PacBio RS II is not only an effective tool for use in the basic biological sciences but also in the medical/clinical setting.
Genetic studies of human evolution require high-quality contiguous ape genome assemblies that are not guided by the human reference. We coupled long-read sequence assembly and full-length complementary DNA sequencing with a multiplatform scaffolding approach to produce ab initio chimpanzee and orangutan genome assemblies. By comparing these with two long-read de novo human genome assemblies and a gorilla genome assembly, we characterized lineage-specific and shared great ape genetic variation ranging from single- to mega-base pair-sized variants. We identified ~17,000 fixed human-specific structural variants identifying genic and putative regulatory changes that have emerged in humans since divergence from nonhuman apes. Interestingly, these variants are enriched near genes that are down-regulated in human compared to chimpanzee cerebral organoids, particularly in cells analogous to radial glial neural progenitors. Copyright © 2018 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works.
The Ensembl project has been aggregating, processing, integrating and redistributing genomic datasets since the initial releases of the draft human genome, with the aim of accelerating genomics research through rapid open distribution of public data. Large amounts of raw data are thus transformed into knowledge, which is made available via a multitude of channels, in particular our browser (http://www.ensembl.org). Over time, we have expanded in multiple directions. First, our resources describe multiple fields of genomics, in particular gene annotation, comparative genomics, genetics and epigenomics. Second, we cover a growing number of genome assemblies; Ensembl Release 90 contains exactly 100. Third, our databases feed simultaneously into an array of services designed around different use cases, ranging from quick browsing to genome-wide bioinformatic analysis. We present here the latest developments of the Ensembl project, with a focus on managing an increasing number of assemblies, supporting efforts in genome interpretation and improving our browser.
Recent RNA-seq technology revealed thousands of splicing events that are under rapid evolution in primates, whereas the reliability of these events, as well as their combination on the isoform level, have not been adequately addressed due to its limited sequencing length. Here, we performed comparative transcriptome analyses in human and rhesus macaque cerebellum using single molecule long-read sequencing (Iso-seq) and matched RNA-seq. Besides 359 million RNA-seq reads, 4,165,527 Iso-seq reads were generated with a mean length of 14,875?bp, covering 11,466 human genes, and 10,159 macaque genes. With Iso-seq data, we substantially expanded the repertoire of alternative RNA processing events in primates, and found that intron retention and alternative polyadenylation are surprisingly more prevalent in primates than previously estimated. We then investigated the combinatorial mode of these alternative events at the whole-transcript level, and found that the combination of these events is largely independent along the transcript, leading to thousands of novel isoforms missed by current annotations. Notably, these novel isoforms are selectively constrained in general, and 1,119 isoforms have even higher expression than the previously annotated major isoforms in human, indicating that the complexity of the human transcriptome is still significantly underestimated. Comparative transcriptome analysis further revealed 502 genes encoding selectively constrained, lineage-specific isoforms in human but not in rhesus macaque, linking them to some lineage-specific functions. Overall, we propose that the independent combination of alternative RNA processing events has contributed to complex isoform evolution in primates, which provides a new foundation for the study of phenotypic difference among primates.© The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
In comparison to humans and chimpanzees, gorillas show low diversity at MHC class I genes (Gogo), as reflected by an overall reduced level of allelic variation as well as the absence of a functionally important sequence motif that interacts with killer cell immunoglobulin-like receptors (KIR). Here, we use recently generated large-scale genomic sequence data for a reassessment of allelic diversity at Gogo-C, the gorilla orthologue of HLA-C. Through the combination of long-range amplifications and long-read sequencing technology, we obtained, among the 35 gorillas reanalyzed, three novel full-length genomic sequences including a coding region sequence that has not been previously described. The newly identified Gogo-C*03:01 allele has a divergent recombinant structure that sets it apart from other Gogo-C alleles. Domain-by-domain phylogenetic analysis shows that Gogo-C*03:01 has segments in common with Gogo-B*07, the additional B-like gene that is present on some gorilla MHC haplotypes. Identified in ~ 50% of the gorillas analyzed, the Gogo-C*03:01 allele exclusively encodes the C1 epitope among Gogo-C allotypes, indicating its important function in controlling natural killer cell (NK cell) responses via KIR. We further explored the hypothesis whether gorillas experienced a selective sweep which may have resulted in a general reduction of the gorilla MHC class I repertoire. Our results provide little support for a selective sweep but rather suggest that the overall low Gogo class I diversity can be best explained by drastic demographic changes gorillas experienced in the ancient and recent past.
Bats harbor many viruses asymptomatically, including several notorious for causing extreme virulence in humans. To identify differences between antiviral mechanisms in humans and bats, we sequenced, assembled, and analyzed the genome of Rousettus aegyptiacus, a natural reservoir of Marburg virus and the only known reservoir for any filovirus. We found an expanded and diversified KLRC/KLRD family of natural killer cell receptors, MHC class I genes, and type I interferons, which dramatically differ from their functional counterparts in other mammals. Such concerted evolution of key components of bat immunity is strongly suggestive of novel modes of antiviral defense. An evaluation of the theoretical function of these genes suggests that an inhibitory immune state may exist in bats. Based on our findings, we hypothesize that tolerance of viral infection, rather than enhanced potency of antiviral defenses, may be a key mechanism by which bats asymptomatically host viruses that are pathogenic in humans. Copyright © 2018 Elsevier Inc. All rights reserved.
Plasmodium falciparum, the most virulent agent of human malaria, shares a recent common ancestor with the gorilla parasite Plasmodium praefalciparum. Little is known about the other gorilla- and chimpanzee-infecting species in the same (Laverania) subgenus as P. falciparum, but none of them are capable of establishing repeated infection and transmission in humans. To elucidate underlying mechanisms and the evolutionary history of this subgenus, we have generated multiple genomes from all known Laverania species. The completeness of our dataset allows us to conclude that interspecific gene transfers, as well as convergent evolution, were important in the evolution of these species. Striking copy number and structural variations were observed within gene families and one, stevor, shows a host-specific sequence pattern. The complete genome sequence of the closest ancestor of P. falciparum enables us to estimate the timing of the beginning of speciation to be 40,000-60,000 years ago followed by a population bottleneck around 4,000-6,000 years ago. Our data allow us also to search in detail for the features of P. falciparum that made it the only member of the Laverania able to infect and spread in humans.
Genomic insights into host adaptation between the wheat stripe rust pathogen (Puccinia striiformis f. sp. tritici) and the barley stripe rust pathogen (Puccinia striiformis f. sp. hordei).
Plant fungal pathogens can rapidly evolve and adapt to new environmental conditions in response to sudden changes of host populations in agro-ecosystems. However, the genomic basis of their host adaptation, especially at the forma specialis level, remains unclear.We sequenced two isolates each representing Puccinia striiformis f. sp. tritici (Pst) and P. striiformis f. sp. hordei (Psh), different formae speciales of the stripe rust fungus P. striiformis highly adapted to wheat and barley, respectively. The divergence of Pst and Psh, estimated to start 8.12 million years ago, has been driven by high nucleotide mutation rates. The high genomic variation within dikaryotic urediniospores of P. striiformis has provided raw genetic materials for genome evolution. No specific gene families have enriched in either isolate, but extensive gene loss events have occurred in both Pst and Psh after the divergence from their most recent common ancestor. A large number of isolate-specific genes were identified, with unique genomic features compared to the conserved genes, including 1) significantly shorter in length; 2) significantly less expressed; 3) significantly closer to transposable elements; and 4) redundant in pathways. The presence of specific genes in one isolate (or forma specialis) was resulted from the loss of the homologues in the other isolate (or forma specialis) by the replacements of transposable elements or losses of genomic fragments. In addition, different patterns and numbers of telomeric repeats were observed between the isolates.Host adaptation of P. striiformis at the forma specialis level is a complex pathogenic trait, involving not only virulence-related genes but also other genes. Gene loss, which might be adaptive and driven by transposable element activities, provides genomic basis for host adaptation of different formae speciales of P. striiformis.