Birds are a group with immense availability of genomic resources, and hundreds of forthcoming genomes at the doorstep. We review recent developments in whole genome sequencing, phylogenomics, and comparative genomics of birds. Short read based genome assemblies are common, largely due to efforts of the Bird 10K genome project (B10K). Chromosome-level assemblies are expected to increase due to improved long-read sequencing. The available genomic data has enabled the reconstruction of the bird tree of life with increasing confidence and resolution, but challenges remain in the early splits of Neoaves due to their explosive diversification after the Cretaceous-Paleogene (K-Pg) event. Continued genomic sampling of the bird tree of life will not just better reflect their evolutionary history but also shine new light onto the organization of phylogenetic signal and conflict across the genome. The comparatively simple architecture of avian genomes makes them a powerful system to study the molecular foundation of bird specific traits. Birds are on the verge of becoming an extremely resourceful system to study biodiversity from the nucleotide up.
Spatholobus suberectus Dunn (S. suberectus), which belongs to the Leguminosae, is an important medicinal plant in China. Owing to its long growth cycle and increased use in human medicine, wild resources of S. suberectus have decreased rapidly and may be on the verge of extinction. De novo assembly of the whole S. suberectus genome provides us a critical potential resource towards biosynthesis of the main bioactive components and seed development regulation mechanism of this plant. Utilizing several sequencing technologies such as Illumina HiSeq X Ten, single-molecule real-time sequencing, 10x Genomics, as well as new assembly techniques such as FALCON and chromatin interaction mapping (Hi-C), we assembled a chromosome-scale genome about 798?Mb in size. In total, 748?Mb (93.73%) of the contig sequences were anchored onto nine chromosomes with the longest scaffold being 103.57?Mb. Further annotation analyses predicted 31,634 protein-coding genes, of which 93.9% have been functionally annotated. All data generated in this study is available in public databases.
Development of a metabolic pathway transfer and genomic integration system for the syngas-fermenting bacterium Clostridium ljungdahlii.
Clostridium spp. can synthesize valuable chemicals and fuels by utilizing diverse waste-stream substrates, including starchy biomass, lignocellulose, and industrial waste gases. However, metabolic engineering in Clostridium spp. is challenging due to the low efficiency of gene transfer and genomic integration of entire biosynthetic pathways.We have developed a reliable gene transfer and genomic integration system for the syngas-fermenting bacterium Clostridium ljungdahlii based on the conjugal transfer of donor plasmids containing large transgene cassettes (>?5 kb) followed by the inducible activation of Himar1 transposase to promote integration. We established a conjugation protocol for the efficient generation of transconjugants using the Gram-positive origins of replication repL and repH. We also investigated the impact of DNA methylation on conjugation efficiency by testing donor constructs with all possible combinations of Dam and Dcm methylation patterns, and used bisulfite conversion and PacBio sequencing to determine the DNA methylation profile of the C. ljungdahlii genome, resulting in the detection of four sequence motifs with N6-methyladenosine. As proof of concept, we demonstrated the transfer and genomic integration of a heterologous acetone biosynthesis pathway using a Himar1 transposase system regulated by a xylose-inducible promoter. The functionality of the integrated pathway was confirmed by detecting enzyme proteotypic peptides and the formation of acetone and isopropanol by C. ljungdahlii cultures utilizing syngas as a carbon and energy source.The developed multi-gene delivery system offers a versatile tool to integrate and stably express large biosynthetic pathways in the industrial promising syngas-fermenting microorganism C. ljungdahlii. The simple transfer and stable integration of large gene clusters (like entire biosynthetic pathways) is expanding the range of possible fermentation products of heterologously expressing recombinant strains. We also believe that the developed gene delivery system can be adapted to other clostridial strains as well.
Adaptive Strategies in a Poly-Extreme Environment: Differentiation of Vegetative Cells in Serratia ureilytica and Resistance to Extreme Conditions.
Poly-extreme terrestrial habitats are often used as analogs to extra-terrestrial environments. Understanding the adaptive strategies allowing bacteria to thrive and survive under these conditions could help in our quest for extra-terrestrial planets suitable for life and understanding how life evolved in the harsh early earth conditions. A prime example of such a survival strategy is the modification of vegetative cells into resistant resting structures. These differentiated cells are often observed in response to harsh environmental conditions. The environmental strain (strain Lr5/4) belonging to Serratia ureilytica was isolated from a geothermal spring in Lirima, Atacama Desert, Chile. The Atacama Desert is the driest habitat on Earth and furthermore, due to its high altitude, it is exposed to an increased amount of UV radiation. The geothermal spring from which the strain was isolated is oligotrophic and the temperature of 54°C exceeds mesophilic conditions (15 to 45°C). Although the vegetative cells were tolerant to various environmental insults (desiccation, extreme pH, glycerol), a modified cell type was formed in response to nutrient deprivation, UV radiation and thermal shock. Scanning (SEM) and Transmission Electron Microscopy (TEM) analyses of vegetative cells and the modified cell structures were performed. In SEM, a change toward a circular shape with reduced size was observed. These circular cells possessed what appears as extra coating layers under TEM. The resistance of the modified cells was also investigated, they were resistant to wet heat, UV radiation and desiccation, while vegetative cells did not withstand any of those conditions. A phylogenomic analysis was undertaken to investigate the presence of known genes involved in dormancy in other bacterial clades. Genes related to spore-formation in Myxococcus and Firmicutes were found in S. ureilytica Lr5/4 genome; however, these genes were not enough for a full sporulation pathway that resembles either group. Although, the molecular pathway of cell differentiation in S. ureilytica Lr5/4 is not fully defined, the identified genes may contribute to the modified phenotype in the Serratia genus. Here, we show that a modified cell structure can occur as a response to extremity in a species that was previously not known to deploy this strategy. This strategy may be widely spread in bacteria, but only expressed under poly-extreme environmental conditions.
Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight.
The human genome contains “dark” gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions.Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are =?5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer’s Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in disease cases but not in controls.While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer’s disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.
Single-molecule long-read sequencing datasets were generated for a son-father-mother trio of Han Chinese descent that is part of the Genome in a Bottle (GIAB) consortium portfolio. The dataset was generated using the Pacific Biosciences Sequel System. The son and each parent were sequenced to an average coverage of 60 and 30, respectively, with N50 subread lengths between 16 and 18?kb. Raw reads and reads aligned to both the GRCh37 and GRCh38 are available at the NCBI GIAB ftp site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/). The GRCh38 aligned read data are archived in NCBI SRA (SRX4739017, SRX4739121, and SRX4739122). This dataset is available for anyone to develop and evaluate long-read bioinformatics methods.
Characterization of an NDM-5 carbapenemase-producing Escherichia coli ST156 isolate from a poultry farm in Zhejiang, China.
The emergence of carbapenem-resistant Enterobacteriaceae strains has posed a severe threat to public health in recent years. The mobile elements carrying the New Delhi metallo-ß-lactqtamase (NDM) gene have been regarded as the major mechanism leading to the rapid increase of carbapenem-resistant Enterobacteriaceae strains isolated from clinics and animals.We describe an NDM-5-producing Escherichia coli strain, ECCRA-119 (sequence type 156 [ST156]), isolated from a poultry farm in Zhejiang, China. ECCRA-119 is a multidrug-resistant (MDR) isolate that exhibited resistance to 27 antimicrobial compounds, including imipenem and meropenem, as detected by antimicrobial susceptibility testing (AST). The complete genome sequence of the ECCRA-119 isolate was also obtained using the PacBio RS II platform. Eleven acquired resistance genes were identified in the chromosome; four were detected in plasmid pTB201, while six were detected in plasmid pTB202. Importantly, the carbapenem-resistant gene blaNDM-5 was detected in the IncX3 plasmid pTB203. In addition, seven virulence genes and one metal-resistance gene were also detected. The results of conjugation experiments and the transfer regions identification indicated that the blaNDM-5-harboring plasmid pTB203 could be transferred between E. coli strains.The results reflected the severe bacterial resistance in a poultry farm in Zhejiang province and increased our understanding of the presence and transmission of the blaNDM-5 gene.
Geminiviruses cause damaging diseases in several important crop species. However, limited progress has been made in developing crop varieties resistant to these highly diverse DNA viruses. Recently, the bacterial CRISPR/Cas9 system has been transferred to plants to target and confer immunity to geminiviruses. In this study, we use CRISPR-Cas9 interference in the staple food crop cassava with the aim of engineering resistance to African cassava mosaic virus, a member of a widespread and important family (Geminiviridae) of plant-pathogenic DNA viruses.Our results show that the CRISPR system fails to confer effective resistance to the virus during glasshouse inoculations. Further, we find that between 33 and 48% of edited virus genomes evolve a conserved single-nucleotide mutation that confers resistance to CRISPR-Cas9 cleavage. We also find that in the model plant Nicotiana benthamiana the replication of the novel, mutant virus is dependent on the presence of the wild-type virus.Our study highlights the risks associated with CRISPR-Cas9 virus immunity in eukaryotes given that the mutagenic nature of the system generates viral escapes in a short time period. Our in-depth analysis of virus populations also represents a template for future studies analyzing virus escape from anti-viral CRISPR transgenics. This is especially important for informing regulation of such actively mutagenic applications of CRISPR-Cas9 technology in agriculture.
A high-quality reference genome is a fundamental resource for functional genetics, comparative genomics, and population genomics, and is increasingly important for conservation biology. PacBio Single Molecule, Real-Time (SMRT) sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. Improvements in throughput and concomitant reductions in cost have made PacBio an attractive core technology for many large genome initiatives, however, relatively high DNA input requirements (~5 µg for standard library protocol) have placed PacBio out of reach for many projects on small organisms that have lower DNA content, or on projects with limited input DNA for other reasons. Here we present a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System with chemistry 3.0 and software v6.0, generating, on average, 25 Gb of sequence per SMRT Cell with 20 h movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting curated assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes were present and full-length). In addition, this single-insect assembly now places 667 (>90%) of formerly unplaced genes into their appropriate chromosomal contexts in the AgamP4 PEST reference. We were also able to resolve maternal and paternal haplotypes for over 1/3 of the genome. By sequencing and assembling material from a single diploid individual, only two haplotypes were present, simplifying the assembly process compared to samples from multiple pooled individuals. The method presented here can be applied to samples with starting DNA amounts as low as 100 ng per 1 Gb genome size. This new low-input approach puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.
Here we describe the ways in which the sequence and annotation of the Plasmodium falciparum reference genome has changed since its publication in 2002. As the malaria species responsible for the most deaths worldwide, the richness of annotation and accuracy of the sequence are important resources for the P. falciparum research community as well as the basis for interpreting the genomes of subsequently sequenced species. At the time of publication in 2002 over 60% of predicted genes had unknown functions. As of March 2019, this number has been significantly decreased to 33%. The reduction is due to the inclusion of genes that were subsequently characterised experimentally and genes with significant similarity to others with known functions. In addition, the structural annotation of genes has been significantly refined; 27% of gene structures have been changed since 2002, comprising changes in exon-intron boundaries, addition or deletion of exons and the addition or deletion of genes. The sequence has also undergone significant improvements. In addition to the correction of a large number of single-base and insertion or deletion errors, a major miss-assembly between the subtelomeres of chromosome 7 and 8 has been corrected. As the number of sequenced isolates continues to grow rapidly, a single reference genome will not be an adequate basis for interpretating intra-species sequence diversity. We therefore describe in this publication a population reference genome of P. falciparum, called Pfref1. This reference will enable the community to map to regions that are not present in the current assembly. P. falciparum 3D7 will be continued to be maintained with ongoing curation ensuring continual improvements in annotation quality.
Structural variation of centromeric endogenous retroviruses in human populations and their impact on cutaneous T-cell lymphoma, Sézary syndrome, and HIV infection.
Human Endogenous Retroviruses type K HML-2 (HK2) are integrated into 117 or more areas of human chromosomal arms while two newly discovered HK2 proviruses, K111 and K222, spread extensively in pericentromeric regions, are the first retroviruses discovered in these areas of our genome.We use PCR and sequencing analysis to characterize pericentromeric K111 proviruses in DNA from individuals of diverse ethnicities and patients with different diseases.We found that the 5′ LTR-gag region of K111 proviruses is missing in certain individuals, creating pericentromeric instability. K111 deletion (-/- K111) is seen in about 15% of Caucasian, Asian, and Middle Eastern populations; it is missing in 2.36% of African individuals, suggesting that the -/- K111 genotype originated out of Africa. As we identified the -/-K111 genotype in Cutaneous T-cell lymphoma (CTCL) cell lines, we studied whether the -/-K111 genotype is associated with CTCL. We found a significant increase in the frequency of detection of the -/-K111 genotype in Caucasian patients with severe CTCL and/or Sézary syndrome (n?=?35, 37.14%), compared to healthy controls (n?=?160, 15.6%) [p?=?0.011]. The -/-K111 genotype was also found to vary in HIV-1 infection. Although Caucasian healthy individuals have a similar frequency of detection of the -/- K111 genotype, Caucasian HIV Long-Term Non-Progressors (LTNPs) and/or elite controllers, have significantly higher detection of the -/-K111 genotype (30.55%; n?=?36) than patients who rapidly progress to AIDS (8.5%; n?=?47) [p?=?0.0097].Our data indicate that pericentromeric instability is associated with more severe CTCL and/or Sézary syndrome in Caucasians, and appears to allow T-cells to survive lysis by HIV infection. These findings also provide new understanding of human evolution, as the -/-K111 genotype appears to have arisen out of Africa and is distributed unevenly throughout the world, possibly affecting the severity of HIV in different geographic areas.
Tandemly repeated DNA is highly mutable and causes at least 31 diseases, but it is hard to detect pathogenic repeat expansions genome-wide. Here, we report robust detection of human repeat expansions from careful alignments of long but error-prone (PacBio and nanopore) reads to a reference genome. Our method is robust to systematic sequencing errors, inexact repeats with fuzzy boundaries, and low sequencing coverage. By comparing to healthy controls, we prioritize pathogenic expansions within the top 10 out of 700,000 tandem repeats in whole genome sequencing data. This may help to elucidate the many genetic diseases whose causes remain unknown.
Complete genome sequence of the halophilic PHA-producing bacterium Halomonas sp. SF2003: insights into its biotechnological potential.
A halophilic Gram-negative eubacterium was isolated from the Iroise Sea and identified as an efficient producer of polyhydroxyalkanoates (PHA). The strain, designated SF2003, was found to belong to the Halomonas genus on the basis of 16S rRNA gene sequence similarity. Previous biochemical tests indicated that the Halomonas sp. strain SF2003 is capable of supporting various culture conditions which sometimes can be constraining for marine strains. This versatility could be of great interest for biotechnological applications. Therefore, a complete bacterial genome sequencing and de novo assembly were performed using a PacBio RSII sequencer and Hierarchical Genome Assembly Process software in order to predict Halomonas sp. SF2003 metabolisms, and to identify genes involved in PHA production and stress tolerance. This study demonstrates the complete genome sequence of Halomonas sp. SF2003 which contains a circular 4,36 Mbp chromosome, and replaces the strain in a phylogenetic tree. Genes related to PHA metabolism, carbohydrate metabolism, fatty acid metabolism and stress tolerance were identified and a comparison was made with metabolisms of relative species. Genes annotation highlighted the presence of typical genes involved in PHA biosynthesis such as phaA, phaB and phaC and enabled a preliminary analysis of their organization and characteristics. Several genes of carbohydrates and fatty acid metabolisms were also identified which provided helpful insights into both a better knowledge of the intricacies of PHA biosynthetic pathways and of production purposes. Results show the strong versatility of Halomonas sp. SF2003 to adapt to various temperatures and salinity which can subsequently be exploited for industrial applications such as PHA production.
Genome plasticity favours double chromosomal Tn4401b-blaKPC-2 transposon insertion in the Pseudomonas aeruginosa ST235 clone.
Pseudomonas aeruginosa Sequence Type 235 is a clone that possesses an extraordinary ability to acquire mobile genetic elements and has been associated with the spread of resistance genes, including genes that encode for carbapenemases. Here, we aim to characterize the genetic platforms involved in resistance dissemination in blaKPC-2-positive P. aeruginosa ST235 in Colombia.In a prospective surveillance study of infections in adult patients attended in five ICUs in five distant cities in Colombia, 58 isolates of P. aeruginosa were recovered, of which, 27 (46.6%) were resistant to carbapenems. The molecular analysis showed that 6 (22.2%) and 4 (14.8%) isolates harboured the blaVIM and blaKPC-2 genes, respectively. The four blaKPC-2-positive isolates showed a similar PFGE pulsotype and belonged to ST235. Complete genome sequencing of a representative ST235 isolate shows a unique chromosomal contig of 7097.241?bp with eight different resistance genes identified and five transposons: a Tn6162-like with ant(2?)-Ia, two Tn402-like with ant(3?)-Ia and blaOXA-2 and two Tn4401b with blaKPC-2. All transposons were inserted into the genomic islands. Interestingly, the two Tn4401b copies harbouring blaKPC-2 were adjacently inserted into a new genomic island (PAGI-17) with traces of a replicative transposition process. This double insertion was probably driven by several structural changes within the chromosomal region containing PAGI-17 in the ST235 background.This is the first report of a double Tn4401b chromosomal insertion in P. aeruginosa, just within a new genomic island (PAGI-17). This finding indicates once again the great genomic plasticity of this microorganism.
Comparative genomic and phylogenetic analyses of Populus section Leuce using complete chloroplast genome sequences
Species of Populus section Leuce are distributed throughout most parts of the Northern Hemisphere and have important economic and ecological significance. However, due to frequent hybridization within Leuce, the phylogenetic relationship between species has not been clarified. The chloroplast (cp) genome is characterized by maternal inheritance and relatively conservative mutation rates; thus, it is a powerful tool for building phylogenetic trees. In this study, we used the PacBio SEQUEL software to determine that the cp genome of Populus tomentosa has a length of 156,558 bp including a long single-copy region (84,717 bp), a small single-copy region (16,555 bp), and a pair of inverted repeat regions (27,643 bp). The cp genome contains 131 unique genes, including 37 transfer RNAs, 8 ribosomal RNAs, and 86 protein-coding genes. We compared the cp genomes of seven species of section Leuce and identified five cp DNA markers with >?1% variable sites. Phylogenetic analyses revealed two evolutionary branches for section Leuce. The species with the closest relationship with P. tomenstosa was P. adenopoda, followed by P. alba. These cp genome data will help to determine the cp evolution of section Leuce and further elucidate the origin of P. tomentosa.