Background: Microbial ecology is reshaping our understanding of the natural world by revealing the large phylogenetic and functional diversity of microbial life. However the vast majority of these microorganisms remain poorly understood, as most cultivated representatives belong to just four phylogenetic groups and more than half of all identified phyla remain uncultivated. Characterization of this microbial ‘dark matter’ will thus greatly benefit from new metagenomic methods for in situ analysis. For example, sensitive high throughput methods for the characterization of community composition and structure from the sequencing of conserved marker genes. Methods: Here we utilize Single Molecule Real-Time (SMRT) sequencing of full-length 16S rRNA amplicons to phylogenetically profile microbial communities to below the genus-level. We test this method on a mock community of known composition, as well as a previously studied microbial community from a lake known to predominantly contain poorly characterized phyla. These results are compared to traditional 16S tag sequencing from short-read technologies and subsets of the full-length data corresponding to the same regions of the 16S gene. Results: We explore the benefits of using full-length amplicons for estimating community structure and diversity. In addition, we investigate the possible effects of context-specific and GC-content biases known to affect short-read sequencing technologies on the predicted community structure. We characterize the potential benefits of profiling metagenomic communities with full-length 16S rRNA genes from SMRT sequencing relative to standard methods.
The assembly of metagenomes is dramatically improved by the long read lengths of SMRT Sequencing. This is demonstrated in an experimental design to sequence a mock community from the Human Microbiome Project, and assemble the data using the hierarchical genome assembly process (HGAP) at Pacific Biosciences. Results of this analysis are promising, and display much improved contiguity in the assembly of the mock community as compared to publicly available short-read data sets and assemblies. Additionally, the use of base modification information to make further associations between contigs provides additional data to improve assemblies, and to distinguish between members within a microbial community. The epigenetic approach is a novel validation method unique to SMRT Sequencing. In addition to whole-genome shotgun sequencing, SMRT Sequencing also offers improved classification resolution and reliability of metagenomic and microbiome samples by the full-length sequencing of 16S rRNA (~1500 bases long). Microbial communities can be detected at the species level in some cases, rather than being limited to the genus taxonomic classification as constrained by short-read technologies. The performance of SMRT Sequencing for these metagenomic samples achieved >99% predicted concordance to reference sequences in cecum, soil, water, and mock control investigations for bacterial 16S. Community samples are estimated to contain from 2.3 and up to 15 times as many species with abundance levels as low as 0.05% compared to the identification of phyla groups.
Significant advances in bioinformatics tool development have been made to more efficiently leverage and deliver high-quality genome assemblies with PacBio long-read data. Current data throughput of SMRT Sequencing delivers average read lengths ranging from 10-15 kb with the longest reads exceeding 40 kb. This has resulted in consistent demonstration of a minimum 10-fold improvement in genome assemblies with contig N50 in the megabase range compared to assemblies generated using only short- read technologies. This poster highlights recent advances and resources available for advanced bioinformaticians and developers interested in the current state-of-the-art large genome solutions available as open-source code from PacBio and third-party solutions, including HGAP, MHAP, and ECTools. Resources and tools available on GitHub are reviewed, as well as datasets representing major model research organisms made publically available for community evaluation or interested developers.
The free-living flatworm, Macrostomum lignano, much like its better known planarian relative, Schmidtea mediterranea, has an impressive regenerative capacity. Following injury, this species has the ability to regenerate almost an entirely new organism. This is attributable to the presence of an abundant somatic stem cell population, the neoblasts. These cells are also essential for the ongoing maintenance of most tissues, as their loss leads to irreversible degeneration of the animal. This set of unique properties makes a subset of flatworms attractive organisms for studying the evolution of pathways involved in tissue self-renewal, cell fate specification, and regeneration. The use of these organisms as models, however, is hampered by the lack of a well-assembled and annotated genome sequences, fundamental to modern genetic and molecular studies. Here we report the genomic sequence of Macrostomum lignano and an accompanying characterization of its transcriptome. The genome structure of M. lignano is remarkably complex, with ~75% of its sequence being comprised of simple repeats and transposon sequences. This has made high quality assembly from Illumina reads alone impossible (N50=222 bp). We therefore generated 130X coverage by long sequencing reads from the PacBio platform to create a substantially improved assembly with an N50 of 64 Kbp. We complemented the reference genome with an assembled and annotated transcriptome, and used both of these datasets in combination to probe gene expression patterns during regeneration, examining pathways important to stem cell function. As a whole, our data will provide a crucial resource for the community for the study not only of invertebrate evolution and phylogeny but also of regeneration and somatic pluripotency.
Full-length gene capture solutions offer opportunities to screen and characterize structural variations and genetic diversity to understand key traits in plants and animals. Through a combined Roche NimbleGen probe capture and SMRT Sequencing strategy, we demonstrate the capability to resolve complex gene structures often observed in plant defense and developmental genes spanning multiple kilobases. The custom panel includes members of the WRKY plant-defense-signaling family, members of the NB-LRR disease-resistance family, and developmental genes important for flowering. The presence of repetitive structures and low-complexity regions makes short-read sequencing of these genes difficult, yet this approach allows researchers to obtain complete sequences for unambiguous resolution of gene models. This strategy has been applied to genomic DNA samples from soybean coupled with barcoding for multiplexing.
Collection of major HLA allele sequences in Japanese population toward the precise NGS based HLA DNA typing at the field 4 level
We previously reported on the use of the Ion PGM next generation sequencing (NGS) platform to genotype HLA class I and class II genes by a super-high resolution, single-molecule, sequence-based typing (SS-SBT) method (Shiina et al. 2012). However, HLA alleles could not be assigned at the field 4 level at some HLA loci such as DQA1, DPA1 and DPB1 because the SNP and indel densities were too low to identify and separate both of the phases. In this regard, we have now added the single molecule, real-time (SMRT) DNA sequencer PacBio RS II method to our analysis in order to test whether it might determine the HLA allele sequences in some of the loci with which we previously had difficulties. In this study, we report on sequence-based genotyping of entire HLA gene sequences from the promoter-enhancer region to 3’UTR of the major HLA loci (A, B, C, DRB1, DRB345, DQA1, DQB1, DPA1 and DPB1) using 46 Japanese reference subjects who represented a distribution of more than 99.5% of the HLA alleles at each of the HLA loci and the PacBio RS II and Ion PGM systems.
Early detection of colorectal cancer (CRC) and its precursor lesions (adenomas) is crucial to reduce mortality rates. The fecal immunochemical test (FIT) is a non-invasive CRC screening test that detects the blood-derived protein hemoglobin. However, FIT sensitivity is suboptimal especially in detection of CRC precursor lesions. As adenoma-to-carcinoma progression is accompanied by alternative splicing, tumor-specific proteins derived from alternatively spliced RNA transcripts might serve as candidate biomarkers for CRC detection.
An important need in analyzing complex genomes is the ability to separate and phase haplotypes. While whole genome assembly can deliver this information, it cannot reveal whether there is allele-specific gene or isoform expression. The PacBio Iso-Seq method, which can produce high-quality transcript sequences of 10 kb and longer, has been used to annotate many important plant and animal genomes. We present an algorithm called IsoPhase that post-processes Iso-Seq data for transcript-based haplotyping. We applied IsoPhase to a maize Iso-Seq dataset consisting of two homozygous parents and two F1 cross hybrids. We validated the majority of the SNPs called with IsoPhase against matching short read data and identified cases of allele-specific, gene-level and isoform-level expression.
Streamlines SMRTbell library generation using addition-only, single tube strategy for all library types reduces time to results
We have streamlined the SMRTbell library generation protocols with improved workflows to deliver seamless end-to-end solutions from sample to analysis. A key improvement is the development of a single-tube reaction strategy that shortened hands-on time needed to generate each SMRTbell library, reduced time-consuming AM Pure purification steps, and minimized sample-handling induced gDNA damage to improve the integrity of long-insert SMRTbell templates for sequencing. The improved protocols support all large-insert genomic libraries, multiplexed microbial genomes, and amplicon sequencing. These advances enable completion of library preparation in less than a day (approximately 4 hours) and opens opportunities for automated library preparation for large-scale projects. Here we share data summarizing performance of the new SMRTbell Express Template Kit 2.0 representing our solutions for 10 kb and >50 kb large-insert genomic libraries, complete microbial genome assemblies, and high-throughput amplicon sequencing. The improved throughput of the Sequel System with read lengths up to 30 kb and high consensus accuracy (> 99.999% accuracy) makes sequencing with high-quality results increasingly assessible to the community.
Background: The sequencing and haplotype phasing of entire gene sequences improves the understanding of the genetic basis of disease and drug response. One example is cystic fibrosis (CF). Cystic fibrosis transmembrane conductance regulator (CFTR) modulator therapies have revolutionized CF treatment, but only in a minority of CF subjects. Observed heterogeneity in CFTR modulator efficacy is related to the range of CFTR mutations; revertant mutations can modify the response to CFTR modulators, and other intronic variations in the ~200 kb CFTR gene have been linked to disease severity. Heterogeneity in the CFTR gene may also be linked to differential responses to CFTR modulators. The Targeted Locus Amplification (TLA) technology from Cergentis can be used to selectively amplify, sequence and phase the entire CFTR gene. With PacBio long-read SMRT Sequencing, TLA amplicons are sequenced intact and long-range phasing information of all fragments in entire amplicons is retrieved. Experimental Design and Methods: The TLA process produces amplicons consisting of 5-10 proximity ligated DNA fragments. TLA was performed on cell line and genomic DNA from Coriell GM12878, which has few heterozygous SNVs in CFTR, and the IB3 cell line, with known haplotypes but heterozygous for the delta508 mutation. All sample types were prepared with high and low density TLA primer sets, targeting coverage of >100 kb of the CFTR gene. Conclusion: We have demonstrated the power and utility of TLA with long-read SMRT Sequencing as a valuable research tool in sequencing and phasing across very long regions of the human genome. This process can be done in an efficient manner, multiplexing multiple genes and samples per SMRT Cell in a process amenable to high-throughput sequencing.
Amplification-free targeted enrichment powered by CRISPR-Cas9 and long-read Single Molecule Real-Time (SMRT) Sequencing can efficiently and accurately sequence challenging repeat expansion disorders
Genomic regions with extreme base composition bias and repetitive sequences have long proven challenging for targeted enrichment methods, as they rely upon some form of amplification. Similarly, most DNA sequencing technologies struggle to faithfully sequence regions of low complexity. This has been especially trying for repeat expansion disorders such as Fragile-X disease, Huntington disease and various Ataxias, where the repetitive elements range from several hundreds of bases to tens of kilobases. We have developed a robust, amplification-free targeted enrichment technique, called No-Amp Targeted Sequencing, that employs the CRISPR-Cas9 system. In conjunction with SMRT Sequencing, which delivers long reads spanning the entire repeat expansion, high consensus accuracy, and uniform coverage, these previously inaccessible regions are now accessible. This method is completely amplification-free, therefore removing any PCR errors and biases from the experiment. Furthermore, this technique also preserves native DNA molecules, allowing for direct detection and characterization of epigenetic signatures. The No-Amp method is a two-day protocol that is compatible with multiplexing of multiple targets and multiple samples in a single reaction, using as little as 1 µg of genomic DNA input per sample. We have successfully targeted a number of repeat expansion disorder loci including HTT, FMR1, C9orf7,2 as well as built an Ataxia panel which consists of 15 different disease-causing repeat expansion regions. Using the No-Amp method we have isolated hundreds of individual on-target molecules, allowing for reliable repeat size estimation, mosaicism detection and identification of interruption sequences with alleles as long as >2700 repeat unites ( >13 kb). In addition to multiplexing several targets, we have also multiplexed at least 20 samples in one experiment making the No-Amp Targeted Sequencing method a cost-effective option. Combining the CRISPR-Cas9 enrichment method with Single Molecule, Real-Time Sequencing provided us with base-level resolution of previously inaccessible regions of the genome, like disease-causing repeat expansions. No-Amp Targeted Sequencing captures, in one experiment, many aspects of repeat expansion disorders which are important for better understanding the underlying disease mechanisms.
AGBT Virtual Poster: Using the PacBio Iso-Seq method to search for novel colorectal cancer biomarkers
Early detection of colorectal cancer (CRC) and its precursor lesions (adenomas) is crucial to reduce mortality rates. The fecal immunochemical test (FIT) is a non-invasive CRC screening test that detects…
In this PacBio User Group Meeting presentation, PacBio scientist Kristin Mars speaks about recent updates, such as the single-day library prep that’s now possible with the Iso-Seq Express workflow. She…
In this presentation, Emily Hatas of PacBio offers a look a how SMRT Sequencing has changed over the years as well as the most common applications in human genome analysis:…
Video Poster: A new approach to Thalassemia and Ataxia carrier screening panels using CRISPR-Cas9 enrichment and long-read sequencing
Although PCR is a cost-effective way to enrich for genomic regions of interest for DNA sequencing, amplifying regions with extreme GC-content and long stretches of short tandem repeat (STR) sequences…