With the introduction of P6-C4 chemistry, PacBio has made significant strides with Single Molecule, Real-Time (SMRT) Sequencing . Read lengths averaging between 10 and 15 kb can be now be achieved with extreme reads in the distribution of > 60 kb. The chemistry attains a consensus accuracy of 99.999% (QV50) at 30x coverage which coupled with an increased throughput from the PacBio RS II platform (500 Mb – 1 Gb per SMRT Cell) makes larger genome projects more tractable. These combined advancements in technology deliver results that rival the quality of Sanger “clone-by-clone” sequencing efforts; resulting in closed microbial genomes and highly contiguous de novo assembly of complex eukaryotes on multi-Gbase scale using SMRT Sequencing as the standalone technology. We present here the guidelines and best practices to achieve optimal results when employing PacBio-only whole genome shotgun sequencing strategies. Specific sequencing examples for plant and animal genomes are discussed with SMRTbell library preparation and purification methods for obtaining long insert libraries to generate optimal sequencing results. The benefits of long reads are demonstrated by the highly contiguous assemblies yielding contig N50s of over 5 Mb compared to similar assemblies using next-generation short-read approaches. Finally, guidelines will be presented for planning out projects for the de novo assembly of large genomes.
A large number of distinct HIV-1 genomes can be present in a single clinical sample from a patient chronically infected with HIV-1. We examined samples containing complex mixtures of near-full-length HIV-1 genomes. Single molecules were sequenced as near-full-length (9.6 kb) amplicons directly from PCR products without shearing. Mathematical analysis techniques deconvolved the complex mixture of reads into estimates of distinct near-full-length viral genomes with their relative abundances. We correctly estimated the originating genomes to single-base resolution along with their relative abundances for mixtures where the truth was known exactly by independent sequencing methods. Correct estimates were made even when genomes diverged by a single base. Minor abundances of 5% were reliably detected. SMRT Sequencing data contained near-full-length continuous reads for each sample including some runs with greater than 10,000 near-full-length-genome reads in a three-hour collection time. SMRT Sequencing yields long- read sequencing results from individual DNA molecules with a rapid time-to-result. The single-molecule, full-length nature of the sequencing method allows us to estimate variant subspecies and relative abundances even from samples containing complex mixtures of genomes that differ by single bases. These results open the possibility of cost-effective full-genome sequencing of HIV-1 in mixed populations for applications such as incorporated-HIV-1 screening. In screening, genomes can differ by one to many thousands of bases and the ability to measure them can help scientifically inform treatment strategies.
Multiplexing human HLA class I & II genotyping with DNA barcode adapters for high throughput research.
Human MHC class I genes HLA-A, -B, -C, and class II genes HLA-DR, -DP and -DQ, play a critical role in the immune system as major factors responsible for organ transplant rejection. The have a direct or linkage-based association with several diseases, including cancer and autoimmune diseases, and are important targets for clinical and drug sensitivity research. HLA genes are also highly polymorphic and their diversity originates from exonic combinations as well as recombination events. A large number of new alleles are expected to be encountered if these genes are sequenced through the UTRs. Thus allele-level resolution is strongly preferred when sequencing HLA genes. Pacific Biosciences has developed a method to sequence the HLA genes in their entirety within the span of a single read taking advantage of long read lengths (average >10 kb) facilitated by SMRT technology. A highly accurate consensus sequence (=99.999 or QV50 demonstrated) is generated for each allele in a de novo fashion by our SMRT Analysis software. In the present work, we have combined this imputation-free, fully phased, allele-specific consensus sequence generation workflow and a newly developed DNA-barcode-tagged SMRTbell sample preparation approach to multiplex 96 individual samples for sequencing all of the HLA class I and II genes. Commercially available NGS-go reagents for full-length HLA class I and relevant exons of class II genes were amplified for hi-resolution HLA sequencing. The 96 samples included 72 that are part of UCLA reference panel and had pre-typing information available for 2 fields, based on gold standard SBT methods. SMRTbell adapters with 16 bp barcode tags were ligated to long amplicons in symmetric pairing. PacBio sequencing was highly effective in generating accurate, phased sequences of full-length alleles of HLA genes. In this work we demonstrate scalability of HLA sequencing using off the shelf assays for research applications to find biological significance in full-length sequencing.
In 2012, NIST convened the Genome in a Bottle Consortium to develop the metrology infrastructure needed to enable confidence in human whole genome variant calls.
Whole genome sequencing can provide comprehensive information important for determining the biochemical and genetic nature of all elements inside a genome. The high-quality genome references produced from past genome projects and advances in short-read sequencing technologies have enabled quick and cheap analysis for simple variants. However even with the focus on genome-wide resequencing for SNPs, the heritability of more than 50% of human diseases remains elusive. For non-human organisms, high-contiguity references are deficient, limiting the analysis of genomic features. The long and unbiased reads from single molecule, real-time (SMRT) Sequencing and new de novo assembly approaches have demonstrated the ability to detect more complicated variants and chromosome-level phasing. Moreover, with the recent advance of bioinformatics algorithms and tools, the computation tasks for completing high-quality de novo assembly of large genomes becomes feasible with commodity hardware. Ongoing development in sequencing technologies and bioinformatics will likely lead to routine generation of high-quality reference assemblies in the future. We discuss the current state of art and the challenges in bioinformatics toward such a goal. More specifically, explicit examples of pragmatic computational requirements for assembling mammalian-size genomes and algorithms suitable for processing diploid genomes are discussed.
AGBT 2015 Workshop Presentation Slides: Dick McCombie from Cold Spring Harbor Laboratory (CSHL) described the use of SMRT Sequencing to analyze a breast cancer cell line with complex genomic events. Still ongoing, the project has already uncovered structural variants missed by other sequencers.
The comprehensive characterization of cancer genomes and epigenomes for understanding drug resistance remains an important challenge in the field of oncology. For example, PC-9, a non-small cell lung cancer (NSCL) cell line, contains a deletion mutation in exon 19 (DelE746A750) of EGRF that renders it sensitive to erlotinib, an EGFR inhibitor. However, sustained treatment of these cells with erlotinib leads to drug-tolerant cell populations that grow in the presence of erlotinib. However, the resistant cells can be resensitized to erlotinib upon treatment with methyltransferase inhibitors, suggesting a role of epigenetic modification in development of drug resistance. We have characterized for the first time cancer genomes of both drug-sensitive and drug-resistant PC- 9 cells using long-read PacBio sequencing. The PacBio data allowed us to generate a high-quality, de novo assembly of this cancer genome, enabling the detection of forms of genomic variations at all size scales, including SNPs, structural variations, copy number alterations, gene fusions, and translocations. The data simultaneously provide a global view of epigenetic DNA modifications such as methylation. We will present findings on large-scale changes in the methylation status across the cancer genome as a function of drug sensitivity.
Complete resequencing of extended genomic regions using fosmid target capture and single molecule real-time (SMRT) long read sequencing technology.
A longstanding goal of genomic analysis is the identification of causal genetic factors contributing to disease. While the common disease/common variant hypothesis has been tested in many genome-wide association studies, few advancements in identifying causal variation have been realized, and instead recent findings point away from common variants towards aggregate rare variants as causal. A challenge is obtaining complete phased genomic sequences over extended genomic regions from sufficient numbers of cases and controls to identify all potential variation causal of a disease. To address this, we modified methods for targeted DNA isolation using fosmid technology and single-molecule, long-sequence-read generaton that combine for complete, haplotype-resolved resequencing across extended genomic subregions. As proof of principal, we validated the approach by resequencing four 800 kbp segments that span a major histocompatibility complex (MHC) common extended haplotype (CEH) associated with disease. The data revealed the extent of conservation exposing a near identity among four DR4 CEHs over conserved regions, detailing rare variation and measuring sequence accuracy. In a second test, we sequenced the complete KIR haplotypes from 8 individuals within a specific timeframe and cost. Single molecule long-read sequencing technology generated contiguous full-length fosmid sequences of 30 to 40 kb in a single read, allowing assembly of resolved haplotypes with very little data processing. All of the sequences produced from these projects were contiguous, phased, with accuracy above 99.99%. The results demonstrated that cost-effective scale-up is possible to generate scores to hundreds of phased chromosomal sequences of extended lengths that can encompass genomic regions associated with disease.
Un-zipping diploid genomes – revealing all kinds of heterozygous variants from comprehensive hapltotig assemblies
Outside of the simplest cases (haploid, bacteria, or inbreds), genomic information is not carried in a single reference per individual, but rather has higher ploidy (n=>2) for almost all organisms. The existence of two or more highly related sequences within an individual makes it extremely difficult to build high quality, highly contiguous genome assemblies from short DNA fragments. Based on the earlier work on a polyploidy aware assembler, FALCON (https://github.com/PacificBiosciences/FALCON), we developed new algorithms and software (FALCON-unzip) for de novo haplotype reconstructions from SMRT Sequencing data. We apply the algorithms and the prototype software for (1) a highly repetitive diploid fungal genome (Clavicorona pyxidata) and (2) an F1 hybrid from two inbred Arabidopsis strains: CVI-0 and COL-0. For the fungal genome, we achieved an N50 of 1.53 Mb (of the 1n assembly contigs) of the ~42 Mb 1n genome and an N50 of the haplotigs of 872 kb from a 95X read length N50 ~16 kb dataset. We found that ~ 45% of the genome was highly heterozygous and ~55% of the genome was highly homozygous. We developed methods to assess the base-level accuracy and local haplotype phasing accuracy of the assembly with short-read data from the Illumina platform. For the Arabidopsis F1 hybrid genome, we found that 80% of the genome could be separated into haplotigs. The long range accuracy of phasing haplotigs was evaluated by comparing them to the assemblies from the two inbred parental lines. We show that a more complete view of all haplotypes could provide useful biological insights through improved annotation, characterization of heterozygous variants of all sizes, and resolution of differential allele expression. Finally, we applied this method to WGS human data sets to demonstrate the potential for resolving complicated, medically-relevant genomic regions.
Profiling the microbiome in fecal microbiota transplantation using circular consensus and Single Molecule, Real-Time Sequencing
There are many sequencing-based approaches to understanding complex metagenomic communities spanning targeted amplification to whole-sample shotgun sequencing. While targeted approaches provide valuable data at low sequencing depth, they are limited by primer design and PCR. Whole-sample shotgun experiments generally use short-read sequencing, which results in data processing difficulties. For example, reads less than 500bp in length will rarely cover a complete gene or region of interest, and will require assembly. This not only introduces the possibility of incorrectly combining sequence from different community members, it requires a high depth of coverage. As such, rare community members may not be represented in the resulting assembly. Circular-consensus, single molecule, real-time (SMRT®) Sequencing reads in the 1-3kb range, with >99% accuracy can be efficiently generated for low amounts of input DNA. 10 ng of input DNA sequenced in 4 SMRT Cells on the PacBio RS II would generate >100,000 such reads. While throughput is lower compared to short-read sequencing methods, the reads are a true random sampling of the underlying community since SMRT Sequencing has been shown to have very low sequence-context bias. With reads >1 kb at >99% accuracy it is reasonable to expect a high percentage of reads include gene fragments useful for analysis without the need for de novo assembly. Here we present the results of circular consensus sequencing for an individual’s microbiome, before and after undergoing fecal microbiota transplantation (FMT) in order to treat a chronic Clostridium difficile infection. We show that even with relatively low sequencing depth, the long-read, assembly-free, random sampling allows us to profile low abundance community members at the species level. We also show that using shotgun sampling with long reads allows a level of functional insight not possible with classic targeted 16S, or short read sequencing, due to entire genes being covered in single reads.
Microbial genome sequencing can be done quickly, easily, and efficiently with the PacBio sequencing instruments, resulting in complete de novo assemblies. Alternative protocols have been developed to reduce the amount of purified DNA required for SMRT Sequencing, to broaden applicability to lower-abundance samples. If 50-100 ng of microbial DNA is available, a 10-20 kb SMRTbell library can be made. The resulting library can be loaded onto multiple SMRT Cells, yielding more than enough data for complete assembly of microbial genomes using the SMRT Portal assembly program HGAP, plus base modification analysis. The entire process can be done in less than 3 days by standard laboratory personnel. This approach is particularly important for analysis of metagenomic communities, in which genomic DNA is often limited. From these samples, full-length 16S amplicons can be generated, prepped with the standard SMRTbell library prep protocol, and sequenced. Alternatively, a 2 kb sheared library, made from a few ng of input DNA, can also be used to elucidate the microbial composition of a community, and may provide information about biochemical pathways present in the sample. In both these cases, 1-2 kb reads with >99.9% accuracy can be obtained from Circular Consensus Sequencing.
The constituents and intra-communal interactions of microbial populations have garnered increasing interest in areas such as water remediation, agriculture and human health. One popular, efficient method of profiling communities is to amplify and sequence the evolutionarily conserved 16S rRNA sequence. Currently, most targeted amplification focuses on short, hypervariable regions of the 16S sequence. Distinguishing information not spanned by the targeted region is lost and species-level classification is often not possible. SMRT Sequencing easily spans the entire 1.5 kb 16S gene, and in combination with highly-accurate single-molecule sequences, can improve the identification of individual species in a metapopulation. However, when amplifying a mixture of sequences with close similarities, the products may contain chimeras, or recombinant molecules, at rates as high as 20-30%. These PCR artifacts make it difficult to identify novel species, and reduce the amount of productive sequences. We investigated multiple factors that have been hypothesized to contribute to chimera formation, such as template damage, denaturing time before and during cycling, polymerase extension time, and reaction volume. Of the factors tested, we found two major related contributors to chimera formation: the amount of input template into the PCR reaction and the number of PCR cycles. Sequence errors generated during amplification and sequencing can also confound the analysis of complex populations. Circular Consensus Sequencing (CCS) can generate single-molecule reads with >99% accuracy, and the SMRT Analysis software provides filtering of these reads to >99.99% accuracies. Remaining substitution errors in these highly-filtered reads are likely dominated by mis-incorporations during amplification. Therefore, we compared the impact of several commercially-available high-fidelity PCR kits with full-length 16S amplification. We show results of our experiments and describe an optimized protocol for full-length 16S amplification for SMRT Sequencing. These optimizations have broader implications for other applications that use PCR amplification to phase variations across targeted regions and to generate highly accurate reference sequences.
The sensitivity, speed, and reduced cost associated with Next-Generation Sequencing (NGS) technologies have made them indispensable for the molecular profiling of cancer samples. For effective use, it is critical that the NGS methods used are not only robust but can also accurately detect low frequency somatic mutations. Single Molecule, Real-Time (SMRT) Sequencing offers several advantages, including the ability to sequence single molecules with very high accuracy (>QV40) using the circular consensus sequencing (CCS) approach. The availability of genetically defined, human genomic reference standards provides an industry standard for the development and quality control of molecular assays. Here we characterize SMRT Sequencing for the detection of low-frequency somatic variants using the Quantitative Multiplex DNA Reference Standard from Horizon Diagnostics, combined with amplification of the variants using the Multiplicom Tumor Hotspot MASTR Plus assay. The Horizon Diagnostics reference sample contains precise allelic frequencies from 1% to 24.5% for major oncology targets verified using digital PCR. It recapitulates the complexity of tumor composition and serves as a well-characterized control. The control sample was amplified using the Multiplicom Tumor Hotspot Master Plus assay that targets 252 amplicons (121-254 bp) from 26 relevant cancer genes, which includes all 11 variants in the control sample. The amplicons were sequenced and analyzed using SMRT Sequencing to identify the variants and determine the observed frequency. The random error profile and high accuracy CCS reads make it possible to accurately detect low frequency somatic variants.
Long-read assembly of the Aedes aegypti Aag2 cell line genome resolves ancient endogenous viral elements
Transmission of arboviruses such as Dengue and Zika viruses by Aedes aegypti causes widespread and debilitating disease across the globe. Disease in humans can include severe acute symptoms such as hemorrhagic fever, organ failure, and encephalitis; and yet, mosquitoes tolerate high titers of virus in a persistent infection. The mechanisms responsible for tolerance to viral infection in mosquitoes are still unclear. Recent publications have highlighted the integration of genetic material from non-retroviral RNA viruses into the genome of the host during infection that relies upon endogenous retro-transcriptase activity from transposons. These endogenous viral elements (EVEs) found in the genome are predicted to be ancient and at least some EVEs are under purifying selection, which suggests that they are beneficial to the host. In order characterize EVE biogenesis in a tractable system we sequenced the Ae. aegypti cell line, Aag2, to 58X coverage and here present a de novo assembly of the genome. The assembly consists of 1.7 Gb of genomic and 255 Mb of alternative haplotype specific sequence, made up of contigs with a N50 of 1.4 Mb; a value that, when compared with other assemblies of the Aedes genus, is from 1-3 orders of magnitude longer. The Aag2 genome is highly repetitive (70%), most of which is classified as transposable elements (60%). We identify a plethora of EVEs in the genome homologous to a diverse range of extant viruses, many of which cluster in these regions of highly repetitive DNA. The highly contiguous nature of this assembly allows for a more comprehensive identification of the transposable elements and EVEs that are most likely to be lost in assemblies lacking the read length of SMRT Sequencing. Transmission of arboviruses such as Dengue Virus by Aedes aegypti causes widespread and debilitating disease across the globe. Disease in humans can include severe acute symptoms such as hemorrhagic fever, organ failure, and encephalitis; and yet, mosquitoes tolerate high titers of virus in a persistent infection. The mechanisms responsible for tolerance to viral infection in mosquitoes are still unclear. Recent publications have highlighted the integration of genetic material from non-retroviral RNA viruses into the genome of the host during infection that relies upon endogenous retro-transcriptase activity from transposons. These endogenous viral elements (EVEs) found in the genome are predicted to be ancient and at least some EVEs are under purifying selection, which suggests that they are beneficial to the host. In order characterize EVE biogenesis in a tractable system we sequenced the Ae. aegypti cell line, Aag2, to 58X coverage and here present a de novo assembly of the genome. The assembly consists of 1.7 Gb of genomic and 255 Mb of alternative haplotype specific sequence, made up of contigs with a N50 of 1.4 Mb; a value that, when compared with other assemblies of the Aedes genus, is from 1-3 orders of magnitude longer. The Aag2 genome is highly repetitive (70%), most of which is classified as transposable elements (60%). We identify a plethora of EVEs in the genome homologous to a diverse range of extant viruses, many of which cluster in these regions of highly repetitive DNA. The highly contiguous nature of this assembly allows for a more comprehensive identification of the transposable elements and EVEs that are most likely to be lost in assemblies lacking the read length of SMRT Sequencing. Transmission of arboviruses such as Dengue Virus by Aedes aegypti causes widespread and debilitating disease across the globe. Disease in humans can include severe acute symptoms such as hemorrhagic fever, organ failure, and encephalitis; and yet, mosquitoes tolerate high titers of virus in a persistent infection. The mechanisms responsible for tolerance to viral infection in mosquitoes are still unclear.
An improved circular consensus algorithm with an application to detection of HIV-1 Drug-Resistance Associated Mutations (DRAMs)
Scientists who require confident resolution of heterogeneous populations across complex regions have been unable to transition to short-read sequencing methods. They continue to depend on Sanger Sequencing despite its cost and time inefficiencies. Here we present a new redesigned algorithm that allows the generation of circular consensus sequences (CCS) from individual SMRT Sequencing reads. With this new algorithm, dubbed CCS2, it is possible to reach arbitrarily high quality across longer insert lengths at a lower cost and higher throughput than Sanger Sequencing. We apply this new algorithm, dubbed CCS2, to the characterization of the HIV-1 K103N drug-resistance associated mutation, which is both important clinically, and represents a challenge due to regional sequence context. A mutation was introduced into the 3rd position of amino acid position 103 (A>C substitution) of the RT gene on a pNL4-3 backbone by site-directed mutagenesis. Regions spanning ~1,300 bp were PCR amplified from both the non-mutated and mutant (K103N) plasmids, and were sequenced individually and as a 50:50 mixture. Sequencing data were analyzed using the new CCS2 algorithm, which uses a fully-generative probabilistic model of our SMRT Sequencing process to polish consensus sequences to arbitrarily high accuracy. This result, previously demonstrated for multi-molecule consensus sequences with the Quiver algorithm, is made possible by incorporating per-Zero Mode Waveguide (ZMW) characteristics, thus accounting for the intrinsic changes in the sequencing process that are unique to each ZMW. With CCS2, we are able to achieve a per-read empirical quality of QV30 with 19X coverage. This yields ~5000 1.3 kb consensus sequences with a collective empirical quality of ~QV40. Additionally, we demonstrate a 0% miscall rate in both unmixed samples, and estimate a 48:52% frequency for the K103N mutation in the mixed sample, consistent with data produced by orthogonal platforms.