The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.
Improved sequencing accuracy was obtained with 16S amplicons from environmental samples and a known pure culture when upgraded Pacific Biosciences (PacBio) hardware and enzymes were used for the single molecule, real-time (SMRT) sequencing platform. The new PacBio RS II system with P4/C2 chemistry, when used with previously constructed libraries (Mosher et al., 2013) surpassed the accuracy of Roche/454 pyrosequencing platform. With accurate read lengths of >1400 base pairs, the PacBio system opens up the possibility of identifying microorganisms to the species level in environmental samples. Copyright © 2014 Elsevier B.V. All rights reserved.
A near complete snapshot of the Zea mays seedling transcriptome revealed from ultra-deep sequencing.
RNA-sequencing (RNA-seq) enables in-depth exploration of transcriptomes, but typical sequencing depth often limits its comprehensiveness. In this study, we generated nearly 3 billion RNA-Seq reads, totaling 341 Gb of sequence, from a Zea mays seedling sample. At this depth, a near complete snapshot of the transcriptome was observed consisting of over 90% of the annotated transcripts, including lowly expressed transcription factors. A novel hybrid strategy combining de novo and reference-based assemblies yielded a transcriptome consisting of 126,708 transcripts with 88% of expressed known genes assembled to full-length. We improved current annotations by adding 4,842 previously unannotated transcript variants and many new features, including 212 maize transcripts, 201 genes, 10 genes with undocumented potential roles in seedlings as well as maize lineage specific gene fusion events. We demonstrated the power of deep sequencing for large transcriptome studies by generating a high quality transcriptome, which provides a rich resource for the research community.
IsoSeq analysis and functional annotation of the infratentorial ependymoma tumor tissue on PacBio RSII platform.
Here, we sequenced and functionally annotated the long reads (1-2 kb) cDNAs library of an infratentorial ependymoma tumor tissue on PacBio RSII by Iso-Seq protocol using SMRT technology. 577 MB, data was generated from the brain tissues of ependymoma tumor patient, producing 1,19,313 high-quality reads assembled into 19,878 contigs using Celera assembler followed by Quiver pipelines, which produced 2952 unique protein accessions in the nr protein database and 307 KEGG pathways. Additionally, when we compared GO terms of second and third level with alternative splicing data obtained through HTA Array2.0. We identified four and twelve transcript cluster IDs in Level-2 and Level-3 scores respectively with alternative splicing index predicting mainly the major pathways of hallmarks of cancer. Out of these transcript cluster IDs only transcript cluster IDs of gene PNMT, SNN and LAMB1 showed Reads Per Kilobase of exon model per Million mapped reads (RPKM) values at gene-level expression (GE) and transcript-level (TE) track. Most importantly, brain-specific genes–PNMT, SNN and LAMB1 show their involvement in Ependymoma.
PacBio RS II is the first commercialized third-generation DNA sequencer able to sequence a single molecule DNA in real-time without amplification. PacBio RS II’s sequencing technology is novel and unique, enabling the direct observation of DNA synthesis by DNA polymerase. PacBio RS II confers four major advantages compared to other sequencing technologies: long read lengths, high consensus accuracy, a low degree of bias, and simultaneous capability of epigenetic characterization. These advantages surmount the obstacle of sequencing genomic regions such as high/low G+C, tandem repeat, and interspersed repeat regions. Moreover, PacBio RS II is ideal for whole genome sequencing, targeted sequencing, complex population analysis, RNA sequencing, and epigenetics characterization. With PacBio RS II, we have sequenced and analyzed the genomes of many species, from viruses to humans. Herein, we summarize and review some of our key genome sequencing projects, including full-length viral sequencing, complete bacterial genome and almost-complete plant genome assemblies, and long amplicon sequencing of a disease-associated gene region. We believe that PacBio RS II is not only an effective tool for use in the basic biological sciences but also in the medical/clinical setting.
16S rRNA long-read sequencing of the granulation tissue from nonsmokers and smokers-severe chronic periodontitis patients
Smoking has been associated with increased risk of periodontitis. The aim of the present study was to compare the periodontal disease severity among smokers and nonsmokers which may help in better understanding of predisposition to this chronic inflammation mediated diseases. We selected deep-seated infected granulation tissue removed during periodontal flap surgery procedures for identification and differential abundance of residential bacterial species among smokers and nonsmokers through long-read sequencing technology targeting full-length 16S rRNA gene. A total of 8 phyla were identified among which Firmicutes and Bacteroidetes were most dominating. Differential abundance analysis of OTUs through PICRUST showed significant (p>0.05) abundance of Phyla-Fusobacteria (Streptobacillus moniliformis); Phyla-Firmicutes (Streptococcus equi), and Phyla Proteobacteria (Enhydrobacter aerosaccus) in nonsmokers compared to smokers. The differential abundance of oral metagenomes in smokers showed significant enrichment of host genes modulating pathways involving primary immunodeficiency, citrate cycle, streptomycin biosynthesis, vitamin B6 metabolism, butanoate metabolism, glycine, serine, and threonine metabolism pathways. While thiamine metabolism, amino acid metabolism, homologous recombination, epithelial cell signaling, aminoacyl-tRNA biosynthesis, phosphonate/phosphinate metabolism, polycyclic aromatic hydrocarbon degradation, synthesis and degradation of ketone bodies, translation factors, Ascorbate and aldarate metabolism, and DNA replication pathways were significantly enriched in nonsmokers, modulation of these pathways in oral cavities due to differential enrichment of metagenomes in smokers may lead to an increased susceptibility to infections and/or higher formation of DNA adducts, which may increase the risk of carcinogenesis.
High-throughput sequencing of 16S rRNA gene amplicons has revolutionized the capacity and depth of microbial community profiling. Several sequencing platforms are available, but most phylogenetic studies are performed on the 454-pyrosequencing platform because its longer reads can give finer phylogenetic resolution. The Pacific Biosciences (PacBio) sequencing platform is significantly less expensive per run, does not rely on amplification for library generation, and generates reads that are, on average, four times longer than those from 454 (C2 chemistry), but the resulting high error rates appear to preclude its use in phylogenetic profiling. Recently, however, the PacBio platform was used to characterize four electrosynthetic microbiomes to the genus-level for less than USD 1,000 through the use of PacBio’s circular consensus sequence technology. Here, we describe in greater detail: 1) the output from successful 16S rRNA gene amplicon profiling with PacBio, 2) how the analysis was contingent upon several alterations to standard bioinformatic quality control workflows, and 3) the advantages and disadvantages of using the PacBio platform for community profiling.
Dynamic regulation of HIV-1 mRNA populations analyzed by single-molecule enrichment and long-read sequencing.
Alternative RNA splicing greatly expands the repertoire of proteins encoded by genomes. Next-generation sequencing (NGS) is attractive for studying alternative splicing because of the efficiency and low cost per base, but short reads typical of NGS only report mRNA fragments containing one or few splice junctions. Here, we used single-molecule amplification and long-read sequencing to study the HIV-1 provirus, which is only 9700 bp in length, but encodes nine major proteins via alternative splicing. Our data showed that the clinical isolate HIV-1(89.6) produces at least 109 different spliced RNAs, including a previously unappreciated ~1 kb class of messages, two of which encode new proteins. HIV-1 message populations differed between cell types, longitudinally during infection, and among T cells from different human donors. These findings open a new window on a little studied aspect of HIV-1 replication, suggest therapeutic opportunities and provide advanced tools for the study of alternative splicing.
Personal transcriptomes in which all of an individual’s genetic variants (e.g., single nucleotide variants) and transcript isoforms (transcription start sites, splice sites, and polyA sites) are defined and quantified for full-length transcripts are expected to be important for understanding individual biology and disease, but have not been described previously. To obtain such transcriptomes, we sequenced the lymphoblastoid transcriptomes of three family members (GM12878 and the parents GM12891 and GM12892) by using a Pacific Biosciences long-read approach complemented with Illumina 101-bp sequencing and made the following observations. First, we found that reads representing all splice sites of a transcript are evident for most sufficiently expressed genes =3 kb and often for genes longer than that. Second, we added and quantified previously unidentified splicing isoforms to an existing annotation, thus creating the first personalized annotation to our knowledge. Third, we determined SNVs in a de novo manner and connected them to RNA haplotypes, including HLA haplotypes, thereby assigning single full-length RNA molecules to their transcribed allele, and demonstrated Mendelian inheritance of RNA molecules. Fourth, we show how RNA molecules can be linked to personal variants on a one-by-one basis, which allows us to assess differential allelic expression (DAE) and differential allelic isoforms (DAI) from the phased full-length isoform reads. The DAI method is largely independent of the distance between exon and SNV–in contrast to fragmentation-based methods. Overall, in addition to improving eukaryotic transcriptome annotation, these results describe, to our knowledge, the first large-scale and full-length personal transcriptome.
Background: Alterations of oral microbiota are the main cause of the progression of caries. The goal of this study was to characterize the oral microbiota in childhood caries based on single-molecule real-time sequencing. Methods: A total of 21 preschoolers, aged 3-5 years old with severe early childhood caries, and 20 age-matched, caries-free children as controls were recruited. Saliva samples were collected, followed by DNA extraction, Pacbio sequencing and phylogenetic analyses of the oral microbial communities. Results: 876 species derived from 13 known bacterial phyla and 110 genera were detected from 41 children using Pacbio sequencing. At the species level, 38 species, including Veillonella spp., Streptococcus spp., Prevotella spp. and Lactobacillus spp., showed higher abundance in the caries group compared to the caries-free group (p<0.05). The core microbiota at the genus and species levels was more stable in the caries-free micro-ecological niche. At follow-up, oral examinations 6 months after sample collection, development of new dental caries was observed in 5 children (the transitional group) among the 21 caries free children. Compared with the caries-free children, in the transitional and caries groups, 6 species, which were more abundant in the caries-free group, exhibited a relatively low abundance in both the caries group and the transitional group (p<0.05). We conclude that Abiotrophia spp., Neisseria spp. and Veillonella spp., are essential for maintaining a healthy oral microbial ecosystem. Prevotella spp., Lactobacillus spp., Dialister spp. and Filifactor spp. may be related to the pathogenesis and progression of dental caries.
Exploring the genome and transcriptome of the cave nectar bat Eonycteris spelaea with PacBio long-read sequencing.
In the past two decades, bats have emerged as an important model system to study host-pathogen interactions. More recently, it has been shown that bats may also serve as a new and excellent model to study aging, inflammation, and cancer, among other important biological processes. The cave nectar bat or lesser dawn bat (Eonycteris spelaea) is known to be a reservoir for several viruses and intracellular bacteria. It is widely distributed throughout the tropics and subtropics from India to Southeast Asia and pollinates several plant species, including the culturally and economically important durian in the region. Here, we report the whole-genome and transcriptome sequencing, followed by subsequent de novo assembly, of the E. spelaea genome solely using the Pacific Biosciences (PacBio) long-read sequencing platform.The newly assembled E. spelaea genome is 1.97 Gb in length and consists of 4,470 sequences with a contig N50 of 8.0 Mb. Identified repeat elements covered 34.65% of the genome, and 20,640 unique protein-coding genes with 39,526 transcripts were annotated.We demonstrated that the PacBio long-read sequencing platform alone is sufficient to generate a comprehensive de novo assembled genome and transcriptome of an important bat species. These results will provide useful insights and act as a resource to expand our understanding of bat evolution, ecology, physiology, immunology, viral infection, and transmission dynamics.
Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data.
The use of sequencing technologies to investigate the microbiome of a sample can positively impact patient healthcare by providing therapeutic targets for personalized disease treatment. However, these samples contain genomic sequences from various sources that complicate the identification of pathogens.Here we present Clinical PathoScope, a pipeline to rapidly and accurately remove host contamination, isolate microbial reads, and identify potential disease-causing pathogens. We have accomplished three essential tasks in the development of Clinical PathoScope. First, we developed an optimized framework for pathogen identification using a computational subtraction methodology in concordance with read trimming and ambiguous read reassignment. Second, we have demonstrated the ability of our approach to identify multiple pathogens in a single clinical sample, accurately identify pathogens at the subspecies level, and determine the nearest phylogenetic neighbor of novel or highly mutated pathogens using real clinical sequencing data. Finally, we have shown that Clinical PathoScope outperforms previously published pathogen identification methods with regard to computational speed, sensitivity, and specificity.Clinical PathoScope is the only pathogen identification method currently available that can identify multiple pathogens from mixed samples and distinguish between very closely related species and strains in samples with very few reads per pathogen. Furthermore, Clinical PathoScope does not rely on genome assembly and thus can more rapidly complete the analysis of a clinical sample when compared with current assembly-based methods. Clinical PathoScope is freely available at: http://sourceforge.net/projects/pathoscope/.
Recent advances in sequencing technologies have transformed the field of virus discovery and virome analysis. Once mostly confined to the traditional Sanger sequencing based individual virus discovery, is now entirely replaced by high throughput sequencing (HTS) based virus metagenomics that can be used to characterize the nature and composition of entire viromes. To better harness the potential of HTS for the study of viromes, sample preparation methodologies use different approaches to exclude amplification of non-viral components that can overshadow low-titer viruses. These virus-sequence enrichment approaches mostly focus on the sample preparation methods, like enzymatic digestion of non-viral nucleic acids and size exclusion of non-viral constituents by column filtration, ultrafiltration or density gradient centrifugation. However, recently a new approach of virus-sequence enrichment called virome-capture sequencing, focused on the amplification or HTS library preparation stage, was developed to increase the ability of virome characterization. This new approach has the potential to further transform the field of virus discovery and virome analysis, but its technical complexity and sequence-dependence warrants further improvements. In this review we discuss the different methods, their applications and evolution, for selective sequencing based virome analysis and also propose refinements needed to harness the full potential of HTS for virome analysis. Copyright © 2017 Elsevier B.V. All rights reserved.
The methylome of the gut microbiome: disparate Dam methylation patterns in intestinal Bacteroides dorei
Despite the large interest in the human microbiome in recent years, there are no reports of bacterial DNA methylation in the microbiome. Here metagenomic sequencing using the Pacific Biosciences platform allowed for rapid identification of bacterial GATC methylation status of a bacterial species in human stool samples. For this work, two stool samples were chosen that were dominated by a single species, Bacteroides dorei. Based on 16S rRNA analysis, this species represented over 45% of the bacteria present in these two samples. The B. dorei genome sequence from these samples was determined and the GATC methylation sites mapped. The Bacteroides dorei genome from one subject lacked any GATC methylation and lacked the DNA adenine methyltransferase genes. In contrast, B. dorei from another subject contained 20,551 methylated GATC sites. Of the 4970 open reading frames identified in the GATC methylated B. dorei genome, 3184 genes were methylated as well as 1735 GATC methylations in intergenic regions. These results suggest that DNA methylation patterns are important to consider in multi-omic analyses of microbiome samples seeking to discover the diversity of bacterial functions and may differ between disease states.
Genome-wide transcriptome profiling of the medicinal plant Zanthoxylum planispinum using a single-molecule direct RNA sequencing approach.
High-throughput RNA sequencing has revolutionized transcriptome-based studies of candidate genes, key pathways and gene regulation in non-model organisms. We analyzed full-length cDNA sequences in Zanthoxylum planispinum (Z. planispinum), a medicinal herb in major parts of East Asia. The full-length mRNA derived from tissues of leaf, early fruit and maturing fruit stage were sequenced using PacBio RSII platform to identify isoform transcriptome. We obtained 51,402 unigenes, with average 1781?bp per gene in 82.473?Mb gene lengths. Among 51,402, 3963 unigenes showed variety of isoform. By selection of one representative gene among each of the various isoforms, we finalized 46,306 unique gene set for this herb. We identified 76 cytochrome P450 (CYP450) and related isoforms that are of the wide diversity in the molecular function and biological process. These transcriptome data of Z. planispinum will provide a good resource to study metabolic engineering for the production of valuable medicinal drugs and phytochemicals. Copyright © 2018. Published by Elsevier Inc.