For comprehensive metabolic reconstructions and a resulting understanding of the pathways leading to natural products, it is desirable to obtain complete information about the genetic blueprint of the organisms used. Traditional Sanger and next-generation, short-read sequencing technologies have shortcomings with respect to read lengths and DNA-sequence context bias, leading to fragmented and incomplete genome information. The development of long-read, single molecule, real-time (SMRT) DNA sequencing from Pacific Biosciences, with >10,000 bp average read lengths and a lack of sequence context bias, now allows for the generation of complete genomes in a fully automated workflow. In addition to the genome sequence, DNA methylation is characterized in the process of sequencing. PacBio® sequencing has also been applied to microbial transcriptomes. Long reads enable sequencing of full-length cDNAs allowing for identification of complete gene and operon sequences without the need for transcript assembly. We will highlight several examples where these capabilities have been leveraged in the areas of industrial microbiology, including biocommodities, biofuels, bioremediation, new bacteria with potential commercial applications, antibiotic discovery, and livestock/plant microbiome interactions.
From Sequencing to Chromosomes: New de novo assembly and scaffolding methods improve the goat reference genome
Single-molecule sequencing is now routinely used to assemble complete, high-quality microbial genomes, but these assembly methods have not scaled well to large genomes. To address this problem, we previously introduced the MinHash Alignment Process (MHAP) for overlapping single-molecule reads using probabilistic, locality-sensitive hashing. Integrating MHAP with Celera Assembler (CA) has enabled reference-grade assemblies of model organisms, revealing novel heterochromatic sequences and filling low-complexity gap sequences in the GRCh38 human reference genome. We have applied our methods to assemble the San Clemente goat genome. Combining single-molecule sequencing from Pacific Biosciences and BioNano Genomics generates and assembly that is over 150-fold more contiguous than the latest Capra hircus reference. In combination with Hi-C sequencing, the assembly surpasses reference assemblies, de novo, with minimal manual intervention. The autosomes are each assembled into a single scaffold. Our assembly provides a more complete gene reconstruction, better alignments with Goat 52k chip, and improved allosome reconstruction. In addition to providing increased continuity of sequence, our assembly achieves a higher BUSCO completion score (84%) than the existing goat reference assembly suggesting better quality annotation of gene models. Our results demonstrate that single-molecule sequencing can produce near-complete eukaryotic genomes at modest cost and minimal manual effort.
Reference quality de novo genome assemblies were once solely the domain of large, well-funded genome projects. While next-generation short read technology removed some of the cost barriers, accurate chromosome-scale assembly remains a real challenge. Here we present efforts to de novo assemble the goat (Capra hircus) genome. Through the combination of single-molecule technologies from Pacific Biosciences (sequencing) and BioNano Genomics (optical mapping) coupled with high-throughput chromosome conformation capture sequencing (Hi-C), an inbred San Clemente goat genome has been sequenced and assembled to a high degree of completeness at a relatively modest cost. Starting with 38 million PacBio reads, we integrated the MinHash Alignment Process (MHAP) with the Celera Assembler (CA) to produce an assembly composed of 3110 contigs with a contig N50 size of 4.7 Mb. This assembly was scaffolded with BioNano genome maps derived from a single IrysChip into 333 scaffolds with an N50 of 23.1 Mb including the complete scaffolding of chromosome 20. Finally, cis-chromosome associations were determined by Hi-C, yielding complete reconstruction of all autosomes into single scaffolds with a final N50 of 91.7 Mb. We hope to demonstrate that our methods are not only cost effective, but improve our ability to annotate challenging genomic regions such as highly repetitive immune gene clusters.
MaSuRCA Mega-Reads Assembly Technique for haplotype resolved genome assembly of hybrid PacBio and Illumina Data
The developments in DNA sequencing technology over the past several years have enabled large number of scientists to obtain sequences for the genomes of their interest at a fairly low cost. Illumina Sequencing was the dominant whole genome sequencing technology over the past few years due to its low cost. The Illumina reads are short (up to 300bp) and thus most of those draft genomes produced from Illumina data are very fragmented which limits their usability in practical scenarios. Longer reads are needed for more contiguous genomes. Recently Pacbio sequencing made significant advances in developing cost-effective long-read (>10000bp) sequencing technology and their data, although several times more expensive than Illumina, can be used to produce high quality genomes. Pacbio data can be used for de novo assembly, however due to its high error rate high coverage of the genome is required this raising the cost barrier. A solution for cost-effective genomes is to combine Pacbio and Illumina data leveraging the low error rates of the short Illumina reads and the length of the Pacbio reads. We have developed MaSuRCA mega-reads assembler for efficient assembly of hybrid data sets and we demonstrate that it performs well compared to the other published hybrid techniques. Another important benefit of the long reads is their ability to link the haplotype differences. The mega-reads approach corrects each Pacbio read independently and thus haplotype differences are preserved. Thus, leveraging the accuracy of the Illumina data and the length of the Pacbio reads, MaSuRCA mega-reads can produce haplotype-resolved genome assemblies, where each contig has sequence from a single haplotype. We present preliminary results on haplotype-resolved genome assemblies of faux (proof-of-concept) and real data.
Profiling complex population genomes with highly accurate single molecule reads: cow rumen microbiomes
Determining compositions and functional capabilities of complex populations is often challenging, especially for sequencing technologies with short reads that do not uniquely identify organisms or genes. Long-read sequencing improves the resolution of these mixed communities, but adoption for this application has been limited due to concerns about throughput, cost and accuracy. The recently introduced PacBio Sequel System generates hundreds of thousands of long and highly accurate single-molecule reads per SMRT Cell. We investigated how the Sequel System might increase understanding of metagenomic communities. In the past, focus was largely on taxonomic classification with 16S rRNA sequencing. Recent expansion to WGS sequencing enables functional profiling as well, with the ultimate goal of complete genome assemblies. Here we compare the complex microbiomes in 5 cow rumen samples, for which Illumina WGS sequence data was also available. To maximize the PacBio single-molecule sequence accuracy, libraries of 2 to 3 kb were generated, allowing many polymerase passes per molecule. The resulting reads were filtered at predicted single-molecule accuracy levels up to 99.99%. Community compositions of the 5 samples were compared with Illumina WGS assemblies from the same set of samples, indicating rare organisms were often missed with Illumina. Assembly from PacBio CCS reads yielded a contig >100 kb in length with 6-fold coverage. Mapping of Illumina reads to the 101 kb contig verified the PacBio assembly and contig sequence. These results illustrate ways in which long accurate reads benefit analysis of complex communities.
Determining compositions and functional capabilities of complex populations is often challenging, especially for sequencing technologies with short reads that do not uniquely identify organisms or genes. Long-read sequencing improves the resolution of these mixed communities, but adoption for this application has been limited due to concerns about throughput, cost and accuracy. The recently introduced PacBio Sequel System generates hundreds of thousands of long and highly accurate single-molecule reads per SMRT Cell. We investigated how the Sequel System might increase understanding of metagenomic communities. In the past, focus was largely on taxonomic classification with 16S rRNA sequencing. Recent expansion to WGS sequencing enables functional profiling as well, with the ultimate goal of complete genome assemblies. Here we compare the complex microbiomes in 5 cow rumen samples, for which Illumina WGS sequence data was also available. To maximize the PacBio single-molecule sequence accuracy, libraries of 2 to 3 kb were generated, allowing many polymerase passes per molecule. The resulting reads were filtered at predicted single-molecule accuracy levels up to 99.99%. Community compositions of the 5 samples were compared with Illumina WGS assemblies from the same set of samples, indicating rare organisms were often missed with Illumina. Assembly from PacBio CCS reads yielded a contig >100 kb in length with 6-fold coverage. Mapping of Illumina reads to the 101 kb contig verified the PacBio assembly and contig sequence. Scaffolding with reads from a PacBio unsheared library produced a complete genome of 2.4 Mb. These results illustrate ways in which long accurate reads benefit analysis of complex communities.
PacBio Sequencing is characterized by very long sequence reads (averaging > 10,000 bases), lack of GC-bias, and high consensus accuracy. These features have allowed the method to provide a new…
In this AGBT 2017 talk, PacBio CSO Jonas Korlach provided a technology roadmap for the Sequel System, including plans the continue performance and throughput increases through early 2019. Per SMRT…
User Group Meeting: Unbiased characterization of metagenome composition and function using HiFi sequencing on the PacBio Sequel II System
In this PacBio User Group Meeting presentation, PacBio scientist Meredith Ashby shared several examples of analysis — from full-length 16S sequencing to shotgun sequencing — showing how SMRT Sequencing enables…
Understanding interactions among plants and the complex communities of organisms living on, in and around them requires more than one experimental approach. A new method for de novo metagenome assembly,…
A Gram-stain-negative bacterial strain, designated CA10T, was isolated from bovine raw milk sampled in Anseong, Republic of Korea. Cells were yellow-pigmented, aerobic, non-motile bacilli and grew optimally at 30?°C and pH 7.0 on tryptic soy agar without supplementation of NaCl. Phylogenetic analysis based on the 16S rRNA gene sequences revealed that strain CA10T belonged to the genus Chryseobacterium, family Flavobacteriaceae, and was most closely related to Chryseobacterium indoltheticum ATCC 27950T (98.75?% similarity). The average nucleotide identity and digital DNA-DNA hybridization values of strain CA10T were 94.4 and 56.9?%, respectively, relative to Chryseobacterium scophthalmum DSM 16779T, being lower than the cut-off values of 95-96?and 70?%, respectively. The predominant respiratory quinone was menaquinone-6; major polar lipid, phosphatidylethanolamine; major fatty acids, iso-C15?:?0, summed feature 9 (iso-C17?:?1?9c and/or C16?:?0 10-methyl), summed feature 3 (iso-C15?:?0 2-OH and/or C16?:?1?7c) and iso-C17?:?0 3-OH. The results of physiological, chemotaxonomic and biochemical analyses suggested that strain CA10T is a novel species of genus Chryseobacterium, for which the name Chryseobacterium mulctrae sp. nov. is proposed. The type strain is CA10T (=KACC 21234T=JCM 33443T).
Background Assemblies of diploid genomes are generally unphased, pseudo-haploid representations that do not correctly reconstruct the two parental haplotypes present in the individual sequenced. Instead, the assembly alternates between parental haplotypes and may contain duplications in regions where the parental haplotypes are sufficiently different. Trio binning is an approach to genome assembly that uses short reads from both parents to classify long reads from the offspring according to maternal or paternal haplotype origin, and is thus helped rather than impeded by heterozygosity. Using this approach, it is possible to derive two assemblies from an individual, accurately representing both parental contributions in their entirety with higher continuity and accuracy than is possible with other methods.Results We used trio binning to assemble reference genomes for two species from a single individual using an interspecies cross of yak (Bos grunniens) and cattle (Bos taurus). The high heterozygosity inherent to interspecies hybrids allowed us to confidently assign >99% of long reads from the F1 offspring to parental bins using unique k-mers from parental short reads. Both the maternal (yak) and paternal (cattle) assemblies contain over one third of the acrocentric chromosomes, including the two largest chromosomes, in single haplotigs.Conclusions These haplotigs are the first vertebrate chromosome arms to be assembled gap-free and fully phased, and the first time assemblies for two species have been created from a single individual. Both assemblies are the most continuous currently available for non-model vertebrates.MbmegabaseskbkilobasesMYAmillions of years agoMHCmajor histocompatibility complexSMRTsingle molecule real time
Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. These assemblies can be created in various ways, such as use of tissues that contain single-haplotype (haploid) genomes, or by co-sequencing of parental genomes, but these approaches can be impractical in many situations. We present FALCON-Phase, which integrates long-read sequencing data and ultra-long-range Hi-C chromatin interaction data of a diploid individual to create high-quality, phased diploid genome assemblies. The method was evaluated by application to three datasets, including human, cattle, and zebra finch, for which high-quality, fully haplotype resolved assemblies were available for benchmarking. Phasing algorithm accuracy was affected by heterozygosity of the individual sequenced, with higher accuracy for cattle and zebra finch (>97%) compared to human (82%). In addition, scaffolding with the same Hi-C chromatin contact data resulted in phased chromosome-scale scaffolds.
The ruminants are one of the most successful mammalian lineages, exhibiting morphological and habitat diversity and containing several key livestock species. To better understand their evolution, we generated and analyzed de novo assembled genomes of 44 ruminant species, representing all six Ruminantia families. We used these genomes to create a time-calibrated phylogeny to resolve topological controversies, overcoming the challenges of incomplete lineage sorting. Population dynamic analyses show that population declines commenced between 100,000 and 50,000 years ago, which is concomitant with expansion in human populations. We also reveal genes and regulatory elements that possibly contribute to the evolution of the digestive system, cranial appendages, immune system, metabolism, body size, cursorial locomotion, and dentition of the ruminants. Copyright © 2019 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works.
The Genome Sequence of the Halobacterium salinarum Type Strain Is Closely Related to That of Laboratory Strains NRC-1 and R1.
High-coverage long-read sequencing of the Halobacterium salinarum type strain (91-R6) revealed a 2.17-Mb chromosome and two large plasmids (148 and 102 kb). Population heterogeneity and long repeats were observed. Strain 91-R6 and laboratory strain R1 showed 99.63% sequence identity in common chromosomal regions and only 38 strain-specific segments. This information resolves the previously uncertain relationship between type and laboratory strains.Copyright © 2019 Pfeiffer et al.