During a PAG 2020 workshop, Zev Kronenberg, a senior bioinformatics engineer at PacBio, describes how he used the PacBio Iso-Seq transcriptome sequencing and analysis method to annotate great ape genomes,…
Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads.
The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes. © 2019 John Wiley & Sons Ltd/University College London.
Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. These assemblies can be created in various ways, such as use of tissues that contain single-haplotype (haploid) genomes, or by co-sequencing of parental genomes, but these approaches can be impractical in many situations. We present FALCON-Phase, which integrates long-read sequencing data and ultra-long-range Hi-C chromatin interaction data of a diploid individual to create high-quality, phased diploid genome assemblies. The method was evaluated by application to three datasets, including human, cattle, and zebra finch, for which high-quality, fully haplotype resolved assemblies were available for benchmarking. Phasing algorithm accuracy was affected by heterozygosity of the individual sequenced, with higher accuracy for cattle and zebra finch (>97%) compared to human (82%). In addition, scaffolding with the same Hi-C chromatin contact data resulted in phased chromosome-scale scaffolds.
Satellite repeats are a structural component of centromeres and telomeres, and in some instances their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50?bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: (1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and (2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males vs. females; using Y chromosome assemblies or FIuorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59?kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions. © The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Evolution and global transmission of a multidrug-resistant, community-associated MRSA lineage from the Indian subcontinent
The evolution and global transmission of antimicrobial resistance has been well documented in Gram-negative bacteria and healthcare-associated epidemic pathogens, often emerging from regions with heavy antimicrobial use. However, the degree to which similar processes occur with Gram-positive bacteria in the community setting is less well understood. Here, we trace the recent origins and global spread of a multidrug resistant, community-associated Staphylococcus aureus lineage from the Indian subcontinent, the Bengal Bay clone (ST772). We generated whole genome sequence data of 340 isolates from 14 countries, including the first isolates from Bangladesh and India, to reconstruct the evolutionary history and genomic epidemiology of the lineage. Our data shows that the clone emerged on the Indian subcontinent in the early 1970s and disseminated rapidly in the 1990s. Short-term outbreaks in community and healthcare settings occurred following intercontinental transmission, typically associated with travel and family contacts on the subcontinent, but ongoing endemic transmission was uncommon. Acquisition of a multidrug resistance integrated plasmid was instrumental in the divergence of a single dominant and globally disseminated clade in the early 1990s. Phenotypic data on biofilm, growth and toxicity point to antimicrobial resistance as the driving force in the evolution of ST772. The Bengal Bay clone therefore combines the multidrug resistance of traditional healthcare-associated clones with the epidemiological transmission of community-associated MRSA. Our study demonstrates the importance of whole genome sequencing for tracking the evolution of emerging and resistant pathogens. It provides a critical framework for ongoing surveillance of the clone on the Indian subcontinent and elsewhere.Importance The Bengal Bay clone (ST772) is a community-acquired and multidrug-resistant Staphylococcus aureus lineage first isolated from Bangladesh and India in 2004. In this study, we show that the Bengal Bay clone emerged from a virulent progenitor circulating on the Indian subcontinent. Its subsequent global transmission was associated with travel or family contact in the region. ST772 progressively acquired specific resistance elements at limited cost to its fitness and continues to be exported globally resulting in small-scale community and healthcare outbreaks. The Bengal Bay clone therefore combines the virulence potential and epidemiology of community-associated clones with the multidrug-resistance of healthcare-associated S. aureus lineages. This study demonstrates the importance of whole genome sequencing for the surveillance of highly antibiotic resistant pathogens, which may emerge in the community setting of regions with poor antibiotic stewardship and rapidly spread into hospitals and communities across the world.
A high-quality genome assembly from a single, field-collected spotted lanternfly (Lycorma delicatula) using the PacBio Sequel II system
Background A high-quality reference genome is an essential tool for applied and basic research on arthropods. Long-read sequencing technologies may be used to generate more complete and contiguous genome assemblies than alternate technologies; however, long-read methods have historically had greater input DNA requirements and higher costs than next-generation sequencing, which are barriers to their use on many samples. Here, we present a 2.3 Gb de novo genome assembly of a field-collected adult female spotted lanternfly (Lycorma delicatula) using a single Pacific Biosciences SMRT Cell. The spotted lanternfly is an invasive species recently discovered in the northeastern United States that threatens to damage economically important crop plants in the region. Results The DNA from 1 individual was used to make 1 standard, size-selected library with an average DNA fragment size of ~20 kb. The library was run on 1 Sequel II SMRT Cell 8M, generating a total of 132 Gb of long-read sequences, of which 82 Gb were from unique library molecules, representing ~36× coverage of the genome. The assembly had high contiguity (contig N50 length = 1.5 Mb), completeness, and sequence level accuracy as estimated by conserved gene set analysis (96.8% of conserved genes both complete and without frame shift errors). Furthermore, it was possible to segregate more than half of the diploid genome into the 2 separate haplotypes. The assembly also recovered 2 microbial symbiont genomes known to be associated with L. delicatula, each microbial genome being assembled into a single contig. Conclusions We demonstrate that field-collected arthropods can be used for the rapid generation of high-quality genome assemblies, an attractive approach for projects on emerging invasive species, disease vectors, or conservation efforts of endangered species.
Adaptive archaic introgression of copy number variants and the discovery of previously unknown human genes
As they migrated out of Africa and into Europe and Asia, anatomically modern humans interbred with archaic hominins, such as Neanderthals and Denisovans. The result of this genetic introgression on the recipient populations has been of considerable interest, especially in cases of selection for specific archaic genetic variants. Hsieh et al. characterized adaptive structural variants and copy number variants that are likely targets of positive selection in Melanesians. Focusing on population-specific regions of the genome that carry duplicated genes and show an excess of amino acid replacements provides evidence for one of the mechanisms by which genetic novelty can arise and result in differentiation between human genomes.Science, this issue p. eaax2083INTRODUCTIONCharacterizing genetic variants underlying local adaptations in human populations is one of the central goals of evolutionary research. Most studies have focused on adaptive single-nucleotide variants that either arose as new beneficial mutations or were introduced after interbreeding with our now-extinct relatives, including Neanderthals and Denisovans. The adaptive role of copy number variants (CNVs), another well-known form of genomic variation generated through deletions or duplications that affect more base pairs in the genome, is less well understood, despite evidence that such mutations are subject to stronger selective pressures.RATIONALEThis study focuses on the discovery of introgressed and adaptive CNVs that have become enriched in specific human populations. We combine whole-genome CNV calling and population genetic inference methods to discover CNVs and then assess signals of selection after controlling for demographic history. We examine 266 publicly available modern human genomes from the Simons Genome Diversity Project and genomes of three ancient homininstextemdasha Denisovan, a Neanderthal from the Altai Mountains in Siberia, and a Neanderthal from Croatia. We apply long-read sequencing methods to sequence-resolve complex CNVs of interest specifically in the Melanesianstextemdashan Oceanian population distributed from Papua New Guinea to as far east as the islands of Fiji and known to harbor some of the greatest amounts of Neanderthal and Denisovan ancestry.RESULTSConsistent with the hypothesis of archaic introgression outside Africa, we find a significant excess of CNV sharing between modern non-African populations and archaic hominins (P = 0.039). Among Melanesians, we observe an enrichment of CNVs with potential signals of positive selection (n = 37 CNVs), of which 19 CNVs likely introgressed from archaic hominins. We show that Melanesian-stratified CNVs are significantly associated with signals of positive selection (P = 0.0323). Many map near or within genes associated with metabolism (e.g., ACOT1 and ACOT2), development and cell cycle or signaling (e.g., TNFRSF10D and CDK11A and CDK11B), or immune response (e.g., IFNLR1). We characterize two of the largest and most complex CNVs on chromosomes 16p11.2 and 8p21.3 that introgressed from Denisovans and Neanderthals, respectively, and are absent from most other human populations. At chromosome 16p11.2, we sequence-resolve a large duplication of >383 thousand base pairs (kbp) that originated from Denisovans and introgressed into the ancestral Melanesian population 60,000 to 170,000 years ago. This large duplication occurs at high frequency (>79%) in diverse Melanesian groups, shows signatures of positive selection, and maps adjacent to Homo sapienstextendashspecific duplications that predispose to rearrangements associated with autism. On chromosome 8p21.3, we identify a Melanesian haplotype that carries two CNVs, a ~6-kbp deletion, and a ~38-kbp duplication, with a Neanderthal origin and that introgressed into non-Africans 40,000 to 120,000 years ago. This CNV haplotype occurs at high frequency (44%) and shows signals consistent with a partial selective sweep in Melanesians. Using long-read sequencing genomic and transcriptomic data, we reconstruct the structure and complex evolutionary history for these two CNVs and discover previously undescribed duplicated genes (TNFRSF10D1, TNFRSF10D2, and NPIPB16) that show an excess of amino acid replacements consistent with the action of positive selection.CONCLUSIONOur results suggest that large CNVs originating in archaic hominins and introgressed into modern humans have played an important role in local population adaptation and represent an insufficiently studied source of large-scale genetic variation that is absent from current reference genomes.Large adaptive-introgressed CNVs at chromosomes 8p21.3 and 16p11.2 in Melanesians.The magnifying glasses highlight structural differences between the archaic (top) and reference (bottom) genomes. Neanderthal (red) and Denisovan (blue) haplotypes encompassing large CNVs occur at high frequencies in Melanesians (44 and 79%, respectively) but are absent (black) in all non-Melanesians. These CNVs create positively selected genes (TNFRSF10D1, TNFRSF10D2, and NPIPB16) that are absent from the reference genome.Copy number variants (CNVs) are subject to stronger selective pressure than single-nucleotide variants, but their roles in archaic introgression and adaptation have not been systematically investigated. We show that stratified CNVs are significantly associated with signatures of positive selection in Melanesians and provide evidence for adaptive introgression of large CNVs at chromosomes 16p11.2 and 8p21.3 from Denisovans and Neanderthals, respectively. Using long-read sequence data, we reconstruct the structure and complex evolutionary history of these polymorphisms and show that both encode positively selected genes absent from most human populations. Our results collectively suggest that large CNVs originating in archaic hominins and introgressed into modern humans have played an important role in local population adaptation and represent an insufficiently studied source of large-scale genetic variation.
Recent studies suggest that closely related species can accumulate substantial genetic and phenotypic differences despite ongoing gene flow, thus challenging traditional ideas regarding the genetics of speciation. Baboons (genus Papio) are Old World monkeys consisting of six readily distinguishable species. Baboon species hybridize in the wild, and prior data imply a complex history of differentiation and introgression. We produced a reference genome assembly for the olive baboon (Papio anubis) and whole-genome sequence data for all six extant species. We document multiple episodes of admixture and introgression during the radiation of Papio baboons, thus demonstrating their value as a model of complex evolutionary divergence, hybridization, and reticulation. These results help inform our understanding of similar cases, including modern humans, Neanderthals, Denisovans, and other ancient hominins.
Newly emerged wheat blast disease is a serious threat to global wheat production. Wheat blast is caused by a distinct, exceptionally diverse lineage of the fungus causing rice blast disease. Through sequencing a recent field isolate, we report a reference genome that includes seven core chromosomes and mini-chromosome sequences that harbor effector genes normally found on ends of core chromosomes in other strains. No mini-chromosomes were observed in an early field strain, and at least two from another isolate each contain different effector genes and core chromosome end sequences. The mini-chromosome is enriched in transposons occurring most frequently at core chromosome ends. Additionally, transposons in mini-chromosomes lack the characteristic signature for inactivation by repeat-induced point (RIP) mutation genome defenses. Our results, collectively, indicate that dispensable mini-chromosomes and core chromosomes undergo divergent evolutionary trajectories, and mini-chromosomes and core chromosome ends are coupled as a mobile, fast-evolving effector compartment in the wheat pathogen genome.
We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA ) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33-79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.
In order to provide a comprehensive resource for human structural variants (SVs), we generated long-read sequence data and analyzed SVs for fifteen human genomes. We sequence resolved 99,604 insertions, deletions, and inversions including 2,238 (1.6 Mbp) that are shared among all discovery genomes with an additional 13,053 (6.9 Mbp) present in the majority, indicating minor alleles or errors in the reference. Genotyping in 440 additional genomes confirms the most common SVs in unique euchromatin are now sequence resolved. We report a ninefold SV bias toward the last 5 Mbp of human chromosomes with nearly 55% of all VNTRs (variable number of tandem repeats) mapping to this portion of the genome. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity. Copyright © 2018 Elsevier Inc. All rights reserved.
The discovery of mutations associated with human genetic dis- ease is an exercise in comparative genomics (see Glossary). Although there are many different strategies and approaches, the central premise is that affected persons harbor a significant excess of pathogenic DNA variants as com- pared with a group of unaffected persons (controls) that is either clinically defined1 or established by surveying large swaths of the general population.2 The more exclu- sive the variant is to the disease, the greater its penetrance, the larger its effect size, and the more relevant it becomes to both disease diagnosis and future therapeutic investigation. The most popular approach used by researchers in human genetics is the case–control design, but there are others that can be used to track variants and disease in a family context or that consider the probability of different classes of mutations based on evolutionary patterns of divergence or de novo mutational change.3,4 Although the approaches may be straightforward, the discovery of patho- genic variation and its mechanism of action often is less trivial, and decades of research can be required in order to identify the variants underlying both mendelian and complex genetic traits.
We present reference-quality genome assembly and annotation for the stout camphor tree (Cinnamomum kanehirae (Laurales, Lauraceae)), the first sequenced member of the Magnoliidae comprising four orders (Laurales, Magnoliales, Canellales and Piperales) and over 9,000 species. Phylogenomic analysis of 13 representative seed plant genomes indicates that magnoliid and eudicot lineages share more recent common ancestry than monocots. Two whole-genome duplication events were inferred within the magnoliid lineage: one before divergence of Laurales and Magnoliales and the other within the Lauraceae. Small-scale segmental duplications and tandem duplications also contributed to innovation in the evolutionary history of Cinnamomum. For example, expansion of the terpenoid synthase gene subfamilies within the Laurales spawned the diversity of Cinnamomum monoterpenes and sesquiterpenes.
Newly designed 16S rRNA metabarcoding primers amplify diverse and novel archaeal taxa from the environment.
High-throughput studies of microbial communities suggest that Archaea are a widespread component of microbial diversity in various ecosystems. However, proper quantification of archaeal diversity and community ecology remains limited, as sequence coverage of Archaea is usually low owing to the inability of available prokaryotic primers to efficiently amplify archaeal compared to bacterial rRNA genes. To improve identification and quantification of Archaea, we designed and validated the utility of several primer pairs to efficiently amplify archaeal 16S rRNA genes based on up-to-date reference genes. We demonstrate that several of these primer pairs amplify phylogenetically diverse Archaea with high sequencing coverage, outperforming commonly used primers. Based on comparing the resulting long 16S rRNA gene fragments with public databases from all habitats, we found several novel family- to phylum-level archaeal taxa from topsoil and surface water. Our results suggest that archaeal diversity has been largely overlooked due to the limitations of available primers, and that improved primer pairs enable to estimate archaeal diversity more accurately. © 2018 The Authors. Environmental Microbiology Reports published by Society for Applied Microbiology and John Wiley & Sons Ltd.
Morella rubra, red bayberry, is an economically important fruit tree in south China. Here, we assembled the first high-quality genome for both a female and a male individual of red bayberry. The genome size was 313-Mb, and 90% sequences were assembled into eight pseudo chromosome molecules, with 32 493 predicted genes. By whole-genome comparison between the female and male and association analysis with sequences of bulked and individual DNA samples from female and male, a 59-Kb region determining female was identified and located on distal end of pseudochromosome 8, which contains abundant transposable element and seven putative genes, four of them are related to sex floral development. This 59-Kb female-specific region was likely to be derived from duplication and rearrangement of paralogous genes and retained non-recombinant in the female-specific region. Sex-specific molecular markers developed from candidate genes co-segregated with sex in a genetically diverse female and male germplasm. We propose sex determination follow the ZW model of female heterogamety. The genome sequence of red bayberry provides a valuable resource for plant sex chromosome evolution and also provides important insights for molecular biology, genetics and modern breeding in Myricaceae family. © 2018 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.