As the throughput of the PacBio Systems continues to increase, so has the desire to fully utilize SMRT Cell sequencing capacity to multiplex microbes for whole genome sequencing. Multiplexing is readily achieved by incorporating a unique barcode for each microbe into the SMRTbell adapters and using a streamlined library preparation process. Incorporating barcodes without PCR amplification prevents the loss of epigenetic information and the generation of chimeric sequences, while eliminating the need to generate separate SMRTbell libraries. We multiplexed the genomes of up to 8 unique strains of H. pylori. Each genome was sheared and processed through adapter ligation in a single, addition-only reaction. The barcoded samples were pooled in equimolar quantities and a single SMRTbell library was prepared. We demonstrate successful de novo microbial assembly from all multiplexes tested (2- through 8-plex) using data generated from a single SMRTbell library, run on a single SMRT Cell with the PacBio RS II, and analyzed with standard SMRT Analysis assembly methods. This strategy was successful using both small (1.6 Mb, H. pylori) and medium (5 Mb, E. coli) genomes. This protocol facilitates the sequencing of multiple microbial genomes in a single run, greatly increasing throughput and reducing costs per genome.
Profiling complex population genomes with highly accurate single molecule reads: cow rumen microbiomes
Determining compositions and functional capabilities of complex populations is often challenging, especially for sequencing technologies with short reads that do not uniquely identify organisms or genes. Long-read sequencing improves the resolution of these mixed communities, but adoption for this application has been limited due to concerns about throughput, cost and accuracy. The recently introduced PacBio Sequel System generates hundreds of thousands of long and highly accurate single-molecule reads per SMRT Cell. We investigated how the Sequel System might increase understanding of metagenomic communities. In the past, focus was largely on taxonomic classification with 16S rRNA sequencing. Recent expansion to WGS sequencing enables functional profiling as well, with the ultimate goal of complete genome assemblies. Here we compare the complex microbiomes in 5 cow rumen samples, for which Illumina WGS sequence data was also available. To maximize the PacBio single-molecule sequence accuracy, libraries of 2 to 3 kb were generated, allowing many polymerase passes per molecule. The resulting reads were filtered at predicted single-molecule accuracy levels up to 99.99%. Community compositions of the 5 samples were compared with Illumina WGS assemblies from the same set of samples, indicating rare organisms were often missed with Illumina. Assembly from PacBio CCS reads yielded a contig >100 kb in length with 6-fold coverage. Mapping of Illumina reads to the 101 kb contig verified the PacBio assembly and contig sequence. These results illustrate ways in which long accurate reads benefit analysis of complex communities.
Determining compositions and functional capabilities of complex populations is often challenging, especially for sequencing technologies with short reads that do not uniquely identify organisms or genes. Long-read sequencing improves the resolution of these mixed communities, but adoption for this application has been limited due to concerns about throughput, cost and accuracy. The recently introduced PacBio Sequel System generates hundreds of thousands of long and highly accurate single-molecule reads per SMRT Cell. We investigated how the Sequel System might increase understanding of metagenomic communities. In the past, focus was largely on taxonomic classification with 16S rRNA sequencing. Recent expansion to WGS sequencing enables functional profiling as well, with the ultimate goal of complete genome assemblies. Here we compare the complex microbiomes in 5 cow rumen samples, for which Illumina WGS sequence data was also available. To maximize the PacBio single-molecule sequence accuracy, libraries of 2 to 3 kb were generated, allowing many polymerase passes per molecule. The resulting reads were filtered at predicted single-molecule accuracy levels up to 99.99%. Community compositions of the 5 samples were compared with Illumina WGS assemblies from the same set of samples, indicating rare organisms were often missed with Illumina. Assembly from PacBio CCS reads yielded a contig >100 kb in length with 6-fold coverage. Mapping of Illumina reads to the 101 kb contig verified the PacBio assembly and contig sequence. Scaffolding with reads from a PacBio unsheared library produced a complete genome of 2.4 Mb. These results illustrate ways in which long accurate reads benefit analysis of complex communities.
For microbial sequencing on the PacBio Sequel System, the current yield per SMRT Cell is in excess relative to project requirements. Multiplexing offers a viable solution; greatly increasing throughput, efficiency, and reducing costs per genome. This approach is achieved by incorporating a unique barcode for each microbial sample into the SMRTbell adapters and using a streamlined library preparation process. To demonstrate performance,12 unique barcodes assigned to B. subtilis and sequenced on a single SMRT Cell. To further demonstrate the applicability of this method, we multiplexed the genomes of 16 strains of H. pylori. Each DNA was sheared to 10 kb, end-repaired and ligated with a barcoded adapter in a single-tube reaction. The barcoded samples were pooled in equimolar quantities and a single SMRTbell library was prepared. Successful de novo microbial assemblies were achieved from all multiplexes tested (12-, and 16-plex) using data generated from a single SMRTbell library, run on a single SMRT Cell 1M with the PacBio Sequel System, and analyzed with standard SMRT Analysis assembly methods. Here, we describe a protocol that facilitated the multiplexing up to 12-plex of microbial genomes in one SMRT Cell 1M on the Sequel System that produced near-complete microbial de novo assemblies of <10 contigs for genomes <5 Mb in size.
Understanding interactions among plants and the complex communities of organisms living on, in and around them requires more than one experimental approach. A new method for de novo metagenome assembly,…
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
Genome data of Fusarium oxysporum f. sp. cubense race 1 and tropical race 4 isolates using long-read sequencing.
Fusarium wilt of banana is caused by the soil-borne fungal pathogen Fusarium oxysporum f. sp. cubense (Foc). We generated two chromosome-level assemblies of Foc race 1 and tropical race 4 strains using single-molecule real-time sequencing. The Foc1 and FocTR4 assemblies had 35 and 29 contigs with contig N50 lengths of 2.08 Mb and 4.28 Mb, respectively. These two new references genomes represent a greater than 100-fold improvement over the contig N50 statistics of the previous short read-based Foc assemblies. The two high-quality assemblies reported here will be a valuable resource for the comparative analysis of Foc races at the pathogenic levels.
In the wake of constant improvements in sequencing technologies, numerous insect genomes have been sequenced. Currently, 1219 insect genome-sequencing projects have been registered with the National Center for Biotechnology Information, including 401 that have genome assemblies and 155 with an official gene set of annotated protein-coding genes. Comparative genomics analysis showed that the expansion or contraction of gene families was associated with well-studied physiological traits such as immune system, metabolic detoxification, parasitism and polyphagy in insects. Here, we summarize the progress of insect genome sequencing, with an emphasis on how this impacts research on pest control. We begin with a brief introduction to the basic concepts of genome assembly, annotation and metrics for evaluating the quality of draft assemblies. We then provide an overview of genome information for numerous insect species, highlighting examples from prominent model organisms, agricultural pests and disease vectors. We also introduce the major insect genome databases. The increasing availability of insect genomic resources is beneficial for developing alternative pest control methods. However, many opportunities remain for developing data-mining tools that make maximal use of the available insect genome resources. Although rapid progress has been achieved, many challenges remain in the field of insect genomics. © 2019 The Royal Entomological Society.
The Complete Genome of the Atypical Enteropathogenic Escherichia coli Archetype Isolate E110019 Highlights a Role for Plasmids in Dissemination of the Type III Secreted Effector EspT.
Enteropathogenic Escherichia coli (EPEC) is a leading cause of moderate to severe diarrhea among young children in developing countries, and EPEC isolates can be subdivided into two groups. Typical EPEC (tEPEC) bacteria are characterized by the presence of both the locus of enterocyte effacement (LEE) and the plasmid-encoded bundle-forming pilus (BFP), which are involved in adherence and translocation of type III effectors into the host cells. Atypical EPEC (aEPEC) bacteria also contain the LEE but lack the BFP. In the current report, we describe the complete genome of outbreak-associated aEPEC isolate E110019, which carries four plasmids. Comparative genomic analysis demonstrated that the type III secreted effector EspT gene, an autotransporter gene, a hemolysin gene, and putative fimbrial genes are all carried on plasmids. Further investigation of 65 espT-containing E. coli genomes demonstrated that different espT alleles are associated with multiple plasmids that differ in their overall gene content from the E110019 espT-containing plasmid. EspT has been previously described with respect to its role in the ability of E110019 to invade host cells. While other type III secreted effectors of E. coli have been identified on insertion elements and prophages of the chromosome, we demonstrated in the current study that the espT gene is located on multiple unique plasmids. These findings highlight a role of plasmids in dissemination of a unique E. coli type III secreted effector that is involved in host invasion and severe diarrheal illness.Copyright © 2019 American Society for Microbiology.
Completing a genome is an important goal of genome assembly. However, many assemblies, including reference assemblies, are unfinished and have a number of gaps. Long reads obtained from third-generation sequencing (TGS) platforms can help close these gaps and improve assembly contiguity. However, current gap-closure approaches using long reads require extensive runtime and high memory usage. Thus, a fast and memory-efficient approach using long reads is needed to obtain complete genomes.We developed LR_Gapcloser to rapidly and efficiently close the gaps in genome assembly. This tool utilizes long reads generated from TGS sequencing platforms. Tested on de novo assembled gaps, repeat-derived gaps, and real gaps, LR_Gapcloser closed a higher number of gaps faster and with a lower error rate and a much lower memory usage than two existing, state-of-the art tools. This tool utilized raw reads to fill more gaps than when using error-corrected reads. It is applicable to gaps in the assemblies by different approaches and from large and complex genomes. After performing gap-closure using this tool, the contig N50 size of the human CHM1 genome was improved from 143 kb to 19 Mb, a 132-fold increase. We also closed the gaps in the Triticum urartu genome, a large genome rich in repeats; the contig N50 size was increased by 40%. Further, we evaluated the contiguity and correctness of six hybrid assembly strategies by combining the optimal TGS-based and next-generation sequencing-based assemblers with LR_Gapcloser. A proposed and optimal hybrid strategy generated a new human CHM1 genome assembly with marked contiguity. The contig N50 value was greater than 28 Mb, which is larger than previous non-reference assemblies of the diploid human genome.LR_Gapcloser is a fast and efficient tool that can be used to close gaps and improve the contiguity of genome assemblies. A proposed hybrid assembly including this tool promises reference-grade assemblies. The software is available at http://www.fishbrowser.org/software/LR_Gapcloser/.
Rapid antigen diversification through mitotic recombination in the human malaria parasite Plasmodium falciparum.
Malaria parasites possess the remarkable ability to maintain chronic infections that fail to elicit a protective immune response, characteristics that have stymied vaccine development and cause people living in endemic regions to remain at risk of malaria despite previous exposure to the disease. These traits stem from the tremendous antigenic diversity displayed by parasites circulating in the field. For Plasmodium falciparum, the most virulent of the human malaria parasites, this diversity is exemplified by the variant gene family called var, which encodes the major surface antigen displayed on infected red blood cells (RBCs). This gene family exhibits virtually limitless diversity when var gene repertoires from different parasite isolates are compared. Previous studies indicated that this remarkable genome plasticity results from extensive ectopic recombination between var genes during mitotic replication; however, the molecular mechanisms that direct this process to antigen-encoding loci while the rest of the genome remains relatively stable were not determined. Using targeted DNA double-strand breaks (DSBs) and long-read whole-genome sequencing, we show that a single break within an antigen-encoding region of the genome can result in a cascade of recombination events leading to the generation of multiple chimeric var genes, a process that can greatly accelerate the generation of diversity within this family. We also found that recombinations did not occur randomly, but rather high-probability, specific recombination products were observed repeatedly. These results provide a molecular basis for previously described structured rearrangements that drive diversification of this highly polymorphic gene family.
We characterized 170 complete genome assemblies from clinical Bordetella pertussis isolates representing geographic and temporal diversity in the United States. These data capture genotypic shifts, including increased pertactin deficiency, occurring amid the current pertussis disease resurgence and provide a foundation for needed research to direct future public health control strategies.
We present reference-quality genome assembly and annotation for the stout camphor tree (Cinnamomum kanehirae (Laurales, Lauraceae)), the first sequenced member of the Magnoliidae comprising four orders (Laurales, Magnoliales, Canellales and Piperales) and over 9,000 species. Phylogenomic analysis of 13 representative seed plant genomes indicates that magnoliid and eudicot lineages share more recent common ancestry than monocots. Two whole-genome duplication events were inferred within the magnoliid lineage: one before divergence of Laurales and Magnoliales and the other within the Lauraceae. Small-scale segmental duplications and tandem duplications also contributed to innovation in the evolutionary history of Cinnamomum. For example, expansion of the terpenoid synthase gene subfamilies within the Laurales spawned the diversity of Cinnamomum monoterpenes and sesquiterpenes.
A new study “recompletes” the C. elegans genome sequence, revealing hitherto unseen genes.
Long-read sequencing, CENP-A ChIP, and chromatin fiber imaging reveal the composition and organization of Drosophila melanogaster centromeres, which have long remained elusive despite the high quality of this species’ genome. assembly.