Many genomes have been sequenced to high-quality draft status using Sanger capillary electrophoresis and/or newer short-read sequence data and whole genome assembly techniques. However, even the best draft genomes contain gaps and other imperfections due to limitations in the input data and the techniques used to build draft assemblies. Sequencing biases, repetitive genomic features, genomic polymorphism, and other complicating factors all come together to make some regions difficult or impossible to assemble. Traditionally, draft genomes were upgraded to “phase 3 finished” status using time-consuming and expensive Sanger-based manual finishing processes. For more facile assembly and automated finishing of draft genomes, we present here an automated approach to finishing using long-reads from the Pacific Biosciences RS (PacBio) platform. Our algorithm and associated software tool, PBJelly, (publicly available at https://sourceforge.net/projects/pb-jelly/) automates the finishing process using long sequence reads in a reference-guided assembly process. PBJelly also provides “lift-over” co-ordinate tables to easily port existing annotations to the upgraded assembly. Using PBJelly and long PacBio reads, we upgraded the draft genome sequences of a simulated Drosophila melanogaster, the version 2 draft Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the Sooty mangabey. With 24× mapped coverage of PacBio long-reads, we addressed 99% of gaps and were able to close 69% and improve 12% of all gaps in D. pseudoobscura. With 4× mapped coverage of PacBio long-reads we saw reads address 63% of gaps in our budgerigar assembly, of which 32% were closed and 63% improved. With 6.8× mapped coverage of mangabey PacBio long-reads we addressed 97% of gaps and closed 66% of addressed gaps and improved 19%. The accuracy of gap closure was validated by comparison to Sanger sequencing on gaps from the original D. pseudoobscura draft assembly and shown to be dependent on initial reference quality.
Next-generation sequencing has become the most widely used sequencing technology in genomics research, but it has inherent drawbacks when dealing with high-GC content genomes. Recently, single-molecule real-time sequencing technology (SMRT) was introduced as a third-generation sequencing strategy to compensate for this drawback. Here, we report that the unbiased and longer read length of SMRT sequencing markedly improved genome assembly with high GC content via gap filling and repeat resolution.
Third generation single molecule sequencing technology is poised to revolutionize genomics by en- abling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction. We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones. All source code and the assembly model are available open-source.
Resistance determinants and mobile genetic elements of an NDM-1-encoding Klebsiella pneumoniae strain.
Multidrug-resistant Enterobacteriaceae are emerging as a serious infectious disease challenge. These strains can accumulate many antibiotic resistance genes though horizontal transfer of genetic elements, those for ß-lactamases being of particular concern. Some ß-lactamases are active on a broad spectrum of ß-lactams including the last-resort carbapenems. The gene for the broad-spectrum and carbapenem-active metallo-ß-lactamase NDM-1 is rapidly spreading. We present the complete genome of Klebsiella pneumoniae ATCC BAA-2146, the first U.S. isolate found to encode NDM-1, and describe its repertoire of antibiotic-resistance genes and mutations, including genes for eight ß-lactamases and 15 additional antibiotic-resistance enzymes. To elucidate the evolution of this rich repertoire, the mobile elements of the genome were characterized, including four plasmids with varying degrees of conservation and mosaicism and eleven chromosomal genomic islands. One island was identified by a novel phylogenomic approach, that further indicated the cps-lps polysaccharide synthesis locus, where operon translocation and fusion was noted. Unique plasmid segments and mosaic junctions were identified. Plasmid-borne blaCTX-M-15 was transposed recently to the chromosome by ISEcp1. None of the eleven full copies of IS26, the most frequent IS element in the genome, had the expected 8-bp direct repeat of the integration target sequence, suggesting that each copy underwent homologous recombination subsequent to its last transposition event. Comparative analysis likewise indicates IS26 as a frequent recombinational junction between plasmid ancestors, and also indicates a resolvase site. In one novel use of high-throughput sequencing, homologously recombinant subpopulations of the bacterial culture were detected. In a second novel use, circular transposition intermediates were detected for the novel insertion sequence ISKpn21 of the ISNCY family, suggesting that it uses the two-step transposition mechanism of IS3. Robust genome-based phylogeny showed that a unified Klebsiella cluster contains Enterobacter aerogenes and Raoultella, suggesting the latter genus should be abandoned.
Fungi grow within their food, externally digesting it and absorbing nutrients across a semirigid chitinous cell wall. Members of the new phylum Cryptomycota were proposed to represent intermediate fungal forms, lacking a chitinous cell wall during feeding and known almost exclusively from ubiquitous environmental ribosomal RNA sequences that cluster at the base of the fungal tree [1, 2]. Here, we sequence the first Cryptomycotan genome (the water mold endoparasite Rozella allomycis) and unite the Cryptomycota with another group of endoparasites, the microsporidia, based on phylogenomics and shared genomic traits. We propose that Cryptomycota and microsporidia share a common endoparasitic ancestor, with the clade unified by a chitinous cell wall used to develop turgor pressure in the infection process [3, 4]. Shared genomic elements include a nucleotide transporter that is used by microsporidia for stealing energy in the form of ATP from their hosts . Rozella harbors a mitochondrion that contains a very rapidly evolving genome and lacks complex I of the respiratory chain. These degenerate features are offset by the presence of nuclear genes for alternative respiratory pathways. The Rozella proteome has not undergone major contraction like microsporidia; instead, several classes have undergone expansion, such as host-effector, signal-transduction, and folding proteins. Copyright © 2013 Elsevier Ltd. All rights reserved.
The architecture of a scrambled genome reveals massive levels of genomic rearrangement during development.
Programmed DNA rearrangements in the single-celled eukaryote Oxytricha trifallax completely rewire its germline into a somatic nucleus during development. This elaborate, RNA-mediated pathway eliminates noncoding DNA sequences that interrupt gene loci and reorganizes the remaining fragments by inversions and permutations to produce functional genes. Here, we report the Oxytricha germline genome and compare it to the somatic genome to present a global view of its massive scale of genome rearrangements. The remarkably encrypted genome architecture contains >3,500 scrambled genes, as well as >800 predicted germline-limited genes expressed, and some posttranslationally modified, during genome rearrangements. Gene segments for different somatic loci often interweave with each other. Single gene segments can contribute to multiple, distinct somatic loci. Terminal precursor segments from neighboring somatic loci map extremely close to each other, often overlapping. This genome assembly provides a draft of a scrambled genome and a powerful model for studies of genome rearrangement. Copyright © 2014 Elsevier Inc. All rights reserved.
Single molecule sequencing and genome assembly of a clinical specimen of Loa loa, the causative agent of loiasis.
More than 20% of the world’s population is at risk for infection by filarial nematodes and >180 million people worldwide are already infected. Along with infection comes significant morbidity that has a socioeconomic impact. The eight filarial nematodes that infect humans are Wuchereria bancrofti, Brugia malayi, Brugia timori, Onchocerca volvulus, Loa loa, Mansonella perstans, Mansonella streptocerca, and Mansonella ozzardi, of which three have published draft genome sequences. Since all have humans as the definitive host, standard avenues of research that rely on culturing and genetics have often not been possible. Therefore, genome sequencing provides an important window into understanding the biology of these parasites. The need for large amounts of high quality genomic DNA from homozygous, inbred lines; the availability of only short sequence reads from next-generation sequencing platforms at a reasonable expense; and the lack of random large insert libraries has limited our ability to generate high quality genome sequences for these parasites. However, the Pacific Biosciences single molecule, real-time sequencing platform holds great promise in reducing input amounts and generating sufficiently long sequences that bypass the need for large insert paired libraries.Here, we report on efforts to generate a more complete genome assembly for L. loa using genetically heterogeneous DNA isolated from a single clinical sample and sequenced on the Pacific Biosciences platform. To obtain the best assembly, numerous assemblers and sequencing datasets were analyzed, combined, and compared. Quiver-informed trimming of an assembly of only Pacific Biosciences reads by HGAP2 was selected as the final assembly of 96.4 Mbp in 2,250 contigs. This results in ~9% more of the genome in ~85% fewer contigs from ~80% less starting material at a fraction of the cost of previous Roche 454-based sequencing efforts.The result is the most complete filarial nematode assembly produced thus far and demonstrates the utility of single molecule sequencing on the Pacific Biosciences platform for genetically heterogeneous metazoan genomes.
Genome structure variation, including copy number variation and presence/absence variation, comprises a large extent of maize genetic diversity; however, its effect on phenotypes remains largely unexplored. Here, we describe how copy number variation underlies a rare allele that contributes to maize aluminum (Al) tolerance. Al toxicity is the primary limitation for crop production on acid soils, which make up 50% of the world’s potentially arable lands. In a recombinant inbred line mapping population, copy number variation of the Al tolerance gene multidrug and toxic compound extrusion 1 (MATE1) is the basis for the quantitative trait locus of largest effect on phenotypic variation. This expansion in MATE1 copy number is associated with higher MATE1 expression, which in turn results in superior Al tolerance. The three MATE1 copies are identical and are part of a tandem triplication. Only three maize inbred lines carrying the three-copy allele were identified from maize and teosinte diversity panels, indicating that copy number variation for MATE1 is a rare, and quite likely recent, event. These maize lines with higher MATE1 copy number are also Al-tolerant, have high MATE1 expression, and originate from regions of highly acidic soils. Our findings show a role for copy number variation in the adaptation of maize to acidic soils in the tropics and suggest that genome structural changes may be a rapid evolutionary response to new environments.
Whole genome complete resequencing of Bacillus subtilis natto by combining long reads with high-quality short reads.
De novo microbial genome sequencing reached a turning point with third-generation sequencing (TGS) platforms, and several microbial genomes have been improved by TGS long reads. Bacillus subtilis natto is closely related to the laboratory standard strain B. subtilis Marburg 168, and it has a function in the production of the traditional Japanese fermented food “natto.” The B. subtilis natto BEST195 genome was previously sequenced with short reads, but it included some incomplete regions. We resequenced the BEST195 genome using a PacBio RS sequencer, and we successfully obtained a complete genome sequence from one scaffold without any gaps, and we also applied Illumina MiSeq short reads to enhance quality. Compared with the previous BEST195 draft genome and Marburg 168 genome, we found that incomplete regions in the previous genome sequence were attributed to GC-bias and repetitive sequences, and we also identified some novel genes that are found only in the new genome.
Complete genome sequence of Sporisorium scitamineum and biotrophic interaction transcriptome with sugarcane.
Sporisorium scitamineum is a biotrophic fungus responsible for the sugarcane smut, a worldwide spread disease. This study provides the complete sequence of individual chromosomes of S. scitamineum from telomere to telomere achieved by a combination of PacBio long reads and Illumina short reads sequence data, as well as a draft sequence of a second fungal strain. Comparative analysis to previous available sequences of another strain detected few polymorphisms among the three genomes. The novel complete sequence described herein allowed us to identify and annotate extended subtelomeric regions, repetitive elements and the mitochondrial DNA sequence. The genome comprises 19,979,571 bases, 6,677 genes encoding proteins, 111 tRNAs and 3 assembled copies of rDNA, out of our estimated number of copies as 130. Chromosomal reorganizations were detected when comparing to sequences of S. reilianum, the closest smut relative, potentially influenced by repeats of transposable elements. Repetitive elements may have also directed the linkage of the two mating-type loci. The fungal transcriptome profiling from in vitro and from interaction with sugarcane at two time points (early infection and whip emergence) revealed that 13.5% of the genes were differentially expressed in planta and particular to each developmental stage. Among them are plant cell wall degrading enzymes, proteases, lipases, chitin modification and lignin degradation enzymes, sugar transporters and transcriptional factors. The fungus also modulates transcription of genes related to surviving against reactive oxygen species and other toxic metabolites produced by the plant. Previously described effectors in smut/plant interactions were detected but some new candidates are proposed. Ten genomic islands harboring some of the candidate genes unique to S. scitamineum were expressed only in planta. RNAseq data was also used to reassure gene predictions.
Problems associated with using draft genome assemblies are well documented and have become more pronounced with the use of short read data for de novo genome assembly. We set out to improve the draft genome assembly of the African cichlid fish, Metriaclima zebra, using a set of Pacific Biosciences SMRT sequencing reads corresponding to 16.5× coverage of the genome. Here we characterize the improvements that these long reads allowed us to make to the state-of-the-art draft genome previously assembled from short read data.Our new assembly closed 68 % of the existing gaps and added 90.6Mbp of new non-gap sequence to the existing draft assembly of M. zebra. Comparison of the new assembly to the sequence of several bacterial artificial chromosome clones confirmed the accuracy of the new assembly. The closure of sequence gaps revealed thousands of new exons, allowing significant improvement in gene models. We corrected one known misassembly, and identified and fixed other likely misassemblies. 63.5 Mbp (70 %) of the new sequence was classified as repetitive and the new sequence allowed for the assembly of many more transposable elements.Our improvements to the M. zebra draft genome suggest that a reasonable investment in long reads could greatly improve many comparable vertebrate draft genome assemblies.
Chemosynthetic symbiosis is one of the successful systems for adapting to a wide range of habitats including extreme environments, and the metabolic capabilities of symbionts enable host organisms to expand their habitat ranges. However, our understanding of the adaptive strategies that enable symbiotic organisms to expand their habitats is still fragmentary. Here, we report that a single-ribotype endosymbiont population in an individual of the host vent mussel, Bathymodiolus septemdierum has heterogeneous genomes with regard to the composition of key metabolic gene clusters for hydrogen oxidation and nitrate reduction. The host individual harbours heterogeneous symbiont subpopulations that either possess or lack the gene clusters encoding hydrogenase or nitrate reductase. The proportions of the different symbiont subpopulations in a host appeared to vary with the environment or with the host’s development. Furthermore, the symbiont subpopulations were distributed in patches to form a mosaic pattern in the gill. Genomic heterogeneity in an endosymbiont population may enable differential utilization of diverse substrates and confer metabolic flexibility. Our findings open a new chapter in our understanding of how symbiotic organisms alter their metabolic capabilities and expand their range of habitats.
Pineapple (Ananas comosus (L.) Merr.) is the most economically valuable crop possessing crassulacean acid metabolism (CAM), a photosynthetic carbon assimilation pathway with high water-use efficiency, and the second most important tropical fruit. We sequenced the genomes of pineapple varieties F153 and MD2 and a wild pineapple relative, Ananas bracteatus accession CB5. The pineapple genome has one fewer ancient whole-genome duplication event than sequenced grass genomes and a conserved karyotype with seven chromosomes from before the ? duplication event. The pineapple lineage has transitioned from C3 photosynthesis to CAM, with CAM-related genes exhibiting a diel expression pattern in photosynthetic tissues. CAM pathway genes were enriched with cis-regulatory elements associated with the regulation of circadian clock genes, providing the first cis-regulatory link between CAM and circadian clock regulation. Pineapple CAM photosynthesis evolved by the reconfiguration of pathways in C3 plants, through the regulatory neofunctionalization of preexisting genes and not through the acquisition of neofunctionalized genes via whole-genome or tandem gene duplication.
Three strikingly different alternative male mating morphs (aggressive ‘independents’, semicooperative ‘satellites’ and female-mimic ‘faeders’) coexist as a balanced polymorphism in the ruff, Philomachus pugnax, a lek-breeding wading bird. Major differences in body size, ornamentation, and aggressive and mating behaviors are inherited as an autosomal polymorphism. We show that development into satellites and faeders is determined by a supergene consisting of divergent alternative, dominant and non-recombining haplotypes of an inversion on chromosome 11, which contains 125 predicted genes. Independents are homozygous for the ancestral sequence. One breakpoint of the inversion disrupts the essential CENP-N gene (encoding centromere protein N), and pedigree analysis confirms the lethality of homozygosity for the inversion. We describe new differences in behavior, testis size and steroid metabolism among morphs and identify polymorphic genes within the inversion that are likely to contribute to the differences among morphs in reproductive traits.
The genome sequence assembly of the highly heterozygous Ananas comosus and its varieties is an impressive technical achievement. The sequence opens the door to a greater understanding of pineapple morphology and evolution.