In today’s clinical diagnostic laboratories, the detection of the disease causing mutations is either done through genotyping or Sanger sequencing. Whether done singly or in a multiplex assay, genotyping works only if the exact molecular change is known. Sanger sequencing is the gold standard method that captures both known and novel molecular changes in the disease gene of interest. Most clinical Sanger sequencing assays involve PCR-amplifying the coding sequences of the disease target gene followed by bi-directional sequencing of the amplified products. Therefore for every patient sample, one generates multiple amplicons singly and each amplicon leads to two separate sequencing reactions. Single Molecule, Real-Time (SMRT) sequencing offers several advantages to Sanger sequencing including long read lengths, first-in-first-out processing, fast time to result, high-levels of multiplexing and substantially reduced costs. For our first proof-of-concept experiment, we queried 3 known disease-associated mutations in de-identified clinical samples. We started off with 3 autosomal recessive diseases found at an increased frequency in the Ashkenazi Jewish population: Tay Sachs disease, Niemann-Pick disease and Canavan disease. The mutated gene in Tays Sachs is HEXA, Niemann-Pick is SMPD1 and Canavan is ASPA. Coding exons were amplified in multiple (6-13) amplicons for each gene from both non-carrier and carriers. Amplicons were purified, concentrations normalized, and combined prior to SMRTbell™ Library prep. A single SMRTbell library was sequenced for each gene from each patient using standard Pacific Biosciences C2 chemistry and protocols. Average read lengths of 4,000 bp across samples allowed for high-quality Circular Consensus Sequences (CCS) across all amplicons (all less than 1 kb). This high quality CCS data permitted the clean partitioning of reads from a patient in the presence of heterozygous events. Using non-carrier sequencing as a control, we were able to correctly identify the known events in carrier genes. This suggests the potential utility of SMRT sequencing in a clinical setting, enabling a cost-effective method of replacing targeted mutation detection with sequencing of the entire gene.
The newer hierarchical genome assembly process (HGAP) performs de novo assembly using data from a single PacBio long insert library. To assess the benefits of this method, DNA from several Salmonella enterica serovars was isolated from a pure culture. Genome sequencing was performed using Pacific Biosciences RS sequencing technology. The HGAP process enabled us to close sixteen Salmonella subsp. enterica genomes and their associated mobile elements: The ten serotypes include: Salmonella enterica subsp. enterica serovar Enteritidis (S. Enteritidis) S. Bareilly, S. Heidelberg, S. Cubana, S. Javiana and S. Typhimurium, S. Newport, S. Montevideo, S. Agona, and S. Tennessee. In addition, we were able to detect novel methyltransferases (MTases) by using the Pacific Biosciences kinetic score distributions showing that each serovar appears to have a novel methylation pattern. For example while all Salmonella serovars examined so far have methylase specific activity for 5’-GATC-3’/3’-CTAG-5’ and 5’-CAGAG-3’/3’-GTCTC-5’ (underlined base indicates a modification), S. Heidelberg is uniquely specific for 5’-ACCANCC-3’/3’-TGGTNGG-5’, while S. Typhimurium has uniquely methylase specific for 5′-GATCAG-3’/3′- CTAGTC-5′ sites, for the samples examined so far. We believe that this may be due to the unique environments and phages that these serotypes have been exposed to. Furthermore, our analysis identified and closed a variety of plasmids such as mobilization plasmids, antimicrobial resistance plasmids and IncX plasmids carrying a Type IV secretion system (T4SS). The VirB/D4 T4SS apparatus is important in that it assists with rapid dissemination of antibiotic resistance and virulence determinants. Presently, only limited information exists regarding the genotypic characterization of drug resistance in S. Heidelberg isolates derived from various host species. Here, we characterize two S. Heidelberg outbreak isolates from two different outbreaks. Both isolates contain the IncX plasmid of approximately 35 kb, and carried the genes virB1, virB2, virB3/4, virB5, virB6, virB7, virB8, virB9, virB10, virB11, virD2, and virD4, that are associated with the T4SS. In addition, the outbreak isolate associated with ground turkey carries a 4,473 bp mobilization plasmid and an incompatibility group (Inc) I1 antimicrobial resistance plasmid encoding resistance to gentamicin (aacC2), beta-lactam (bl2b_tem), streptomycin (aadAI) and tetracycline (tetA, tetR) while the outbreak isolate associated with chicken breast carries the IncI1 plasmid encoding resistance to gentamicin (aacC2), streptomycin (aadAI) and sulfisoxazole (sul1). Using this new technology we explored the genetic elements present in resistant pathogens which will achieve a better understanding of the evolution of Salmonella.
Background: HIV-1 proviruses in peripheral blood mononuclear cells (PBMCs) are felt to be an important reservoir of HIV-1 infection. Given that this pool represents an archival library, it can be used to study virus evolution and CD4+ T cell survival. Accurate study of this pool is burdened by difficulties encountered in sequencing a full-length proviral genome, typically accomplished by assembling overlapping pieces and imputing the full genome. Methodology: Cryopreserved PBMCs collected from a total of 8 HIV+ patients from 1997-2001 were used for genomic DNA extraction. Patients had been receiving cART for 2-8 years at the time samples were obtained. 7 patients had pVL >50 copies/mL (mean: 312,282, range: 18,372-683,400) and 1 had pVL <50. Genomic DNA was subjected to limiting dilution prior to amplification of near-full-length genomes by a newly developed nested PCR. The predicted size of the PCR product was 9.0 kb, spanning from the 5’ LTR through the 3’ LTR. Single molecules were sequenced as near-full-length amplicons directly from PCR products without shearing using commercially available P4-C2 reagents and standard protocols on a PacBio RS II instrument. Quality of the genomes was validated by clonal positive controls and synthetic mixtures. Results: Near-full-length provirus genome sequences were successfully obtained from all 8 patients as continuous long reads from single molecules. PacBio sequencing required approximately 10% of the PCR product needed for Sanger sequencing and generated 325 MB per 3-hour run including 1,800 full-length intact genome reads on average. One patient’s sample was not at a limiting dilution and analysis revealed multiple subspecies. For 8 near-fulllength provirus genomes derived from the other 7 patients, large internal deletions were noted in 2 proviruses; APOBEC-mediated hypermutations were seen in 2 proviruses; and 4 proviruses appeared to be intact genomes. All of the defective proviruses showed a complete absence of resistance mutations in either RT or protease, even after 2-8 years of cART. On the contrary, all of the intact proviruses contained evidence of ART-resistance associated mutations suggesting that they represented relatively recent variants. Conclusions: Combining a novel protocol for full-length limiting dilution amplification of proviruses with PacBio SMRT sequencing allowed for the generation of near-full-length genomes with good quality and an ability to detect minor variants at the 1-10% level. Preliminary data analyses suggest that defective proviruses may represent archival variants that persist long-term in host cells, while intact proviruses within the PBMC pool showing evidence of active virus replication may represent more recent variants.
A comparison of assemblers and strategies for complex, large-genome sequencing with PacBio long reads.
PacBio sequencing holds promise for addressing large-genome complexities, such as long, highly repetitive, low-complexity regions and duplication events that are difficult to resolve with short-read technologies. Several strategies, with varying outcomes, are available for de novo sequencing and assembling of larger genomes. Using a diploid fungal genome, estimated to be ~80 Mb in size, as the basis dataset for comparison, we highlight assembly options when using only PacBio sequencing or a combined strategy leveraging data sets from multiple sequencing technologies. Data generated from SMRT Sequencing was subjected to assembly using different large-genome assemblers, and comparisons of the results will be shown. These include results generated with HGAP, Celera Assembler, MIRA, PBJelly, and other assembly tools currently in development. Improvements observed include a near 50% reduction in the number of contigs coupled with at least a doubling of contig N50 size in genome assemblies incorporating SMRT Sequencing data. We further show how incorporating long reads also highlights new challenges and missed insights of short-read assemblies arising from heterozygosity inherent in multiploid genomes.
An interactive workflow for the analysis of contigs from the metagenomic shotgun assembly of SMRT Sequencing data.
The data throughput of next-generation sequencing allows whole microbial communities to be analyzed using a shotgun sequencing approach. Because a key task in taking advantage of these data is the ability to cluster reads that belong to the same member in a community, single-molecule long reads of up to 30 kb from SMRT Sequencing provide a unique capability in identifying those relationships and pave the way towards finished assemblies of community members. Long reads become even more valuable as samples get more complex with lower intra-species variation, a larger number of closely related species, or high intra-species variation. Here we present a collection of tools tailored for PacBio data for the analysis of these fragmented metagenomic assembles, allowing improvements in the assembly results, and greater insight into the communities themselves. Supervised classification is applied to a large set of sequence characteristics, e.g., GC content, raw-read coverage, k-mer frequency, and gene prediction information, allowing the clustering of contigs from single or highly related species. A unique feature of SMRT Sequencing data is the availability of base modification / methylation information, which can be used to further analyze clustered contigs expected to be comprised of single or very closely related species. Here we show base modification information can be used to further study variation, based on differences in the methylated DNA motifs involved in the restriction modification system. Application of these techniques is demonstrated on a monkey intestinal microbiome sample and an in silico mix of real sequencing data from distinct bacterial samples.
Unique haplotype structure determination in human genome using Single Molecule, Real-Time (SMRT) Sequencing of targeted full-length fosmids.
Determination of unique individual haplotypes is an essential first step toward understanding how identical genotypes having different phases lead to different biological interpretations of function, phenotype, and disease. Genome-wide methods for identifying individual genetic variation have been limited in their ability to acquire phased, extended, and complete genomic sequences that are long enough to assemble haplotypes with high confidence. We explore a recombineering approach for isolation and sequencing of a tiling of targeted fosmids to capture interesting regions from human genome. Each individual fosmid contains large genomic fragments (~35?kb) that are sequenced with long-read SMRT technology to generate contiguous long reads. These long reads can be easily de novo assembled for targeted haplotype resolution within an individual’s genomes. The P5-C3 chemistry for SMRT Sequencing generated contiguous, full-length fosmid sequences of 30 to 40 kb in a single read, allowing assembly of resolved haplotypes with minimal data processing. The phase preserved in fosmid clones spanned at least two heterozygous variant loci, providing the essential detail of precise haplotype structures. We show complete assembly of haplotypes for various targeted loci, including the complex haplotypes of the KIR locus (~150 to 200 kb) and conserved extended haplotypes (CEHs) of the MHC region. This method is easily applicable to other regions of the human genome, as well as other genomes.
The killer immunoglobulin-like receptors (KIR) genes belong to the immunoglobulin superfamily and are widely studied due to the critical role they play in coordinating the innate immune response to infection and disease. Highly accurate, contiguous, long reads, like those generated by SMRT Sequencing, when combined with target-enrichment protocols, provide a straightforward strategy for generating complete de novo assembled KIR haplotypes. We have explored two different methods to capture the KIR region; one applying the use of fosmid clones and one using Nimblegen capture.
Human genomic variations range in size from single nucleotide substitutions to large chromosomal rearrangements. Sequencing technologies tend to be optimized for detecting particular variant types and sizes. Short reads excel at detecting SNVs and small indels, while long or linked reads are typically used to detect larger structural variants or phase distant loci. Long reads are more easily mapped to repetitive regions, but tend to have lower per-base accuracy, making it difficult to call short variants. The PacBio Sequel System produces two main data types: long continuous reads (up to 100 kbp), generated by single passes over a long template, and Circular Consensus Sequence (CCS) reads, generated by calculating the consensus of many sequencing passes over a single shorter template (500 bp to 20 kbp). The long-range information in continuous reads is useful for genome assembly and structural variant detection. The higher base accuracy of CCS effectively detects and phases short variants in single molecules. Recent improvements in library preparation protocols and sequencing chemistry have increased the length, accuracy, and throughput of CCS reads. For the human sample HG002, we collected 28-fold coverage 15 kbp high-fidelity CCS reads with an average read quality above Q20 (99% accuracy). The length and accuracy of these reads allow us to detect SNVs, indels, and structural variants not only in the Genome in a Bottle (GIAB) high confidence regions, but also in segmental duplications, HLA loci, and clinically relevant “difficult-to-map” genes. As with continuous long reads, we call structural variants at 90.0% recall compared to the GIAB structural variant benchmark “truth” set, with the added advantages of base pair resolution for variant calls and improved recall at compound heterozygous loci. With minimap2 alignments, GATK4 HaplotypeCaller variant calls, and simple variant filtration, we have achieved a SNP F-Score of 99.51% and an INDEL F-Score of 80.10% against the GIAB short variant benchmark “truth” set, in addition to calling variants outside of the high confidence region established by GIAB using previous technologies. With the long-range information available in 15 kbp reads, we applied the read-backed phasing tool WhatsHap to generate phase blocks with a mean length of 65 kbp across the entire genome. Using an alignment-based approach, we typed all major MHC class I and class II genes to at least 3-field precision. This new data type has the potential to expand the GIAB high confidence regions and “truth” benchmark sets to many previously difficult-to-map genes and allow a single sequencing protocol to address both short variants and large structural variants.
Recent improvements in sequencing chemistry and instrument performance combine to create a new PacBio data type, Single Molecule High-Fidelity reads (HiFi reads). Increased read length and improvement in library construction enables average read lengths of 10-20 kb with average sequence identity greater than 99% from raw single molecule reads. The resulting reads have the accuracy comparable to short read NGS but with 50-100 times longer read length. Here we benchmark the performance of this data type by sequencing and genotyping the Genome in a Bottle (GIAB) HG0002 human reference sample from the National Institute of Standards and Technology (NIST). We further demonstrate the general utility of HiFi reads by analyzing multiple clones of Cabernet Sauvignon. Three different clones were sequenced and de novo assembled with the CANU assembly algorithm, generating draft assemblies of very high contiguity equal to or better than earlier assembly efforts using PacBio long reads. Using the Cabernet Sauvignon Clone 8 assembly as a reference, we mapped the HiFi reads generated from Clone 6 and Clone 47 to identify single nucleotide polymorphisms (SNPs) and structural variants (SVs) that are specific to each of the three samples.
De novo assemblies of human genomes from accurate (85-90%), continuous long reads (CLR) now approach the human reference genome in contiguity, but the assembly base pair accuracy is typically below QV40 (99.99%), an order-of-magnitude lower than the standard for finished references. The base pair errors complicate downstream interpretation, particularly false positive indels that lead to false gene loss through frameshifts. PacBio HiFi sequence data, which are both long (>10 kb) and very accurate (>99.9%) at the individual sequence read level, enable a new paradigm in human genome assembly. Haploid human assemblies using HiFi data achieve similar contiguity to those using CLR data and are highly accurate at the base level1. Furthermore, HiFi assemblies resolve more high-identity sequences such as segmental duplications2. To enable HiFi assembly in diploid human samples, we have extended the FALCON-Unzip assembler to work directly with HiFi reads. Here we present phased human diploid genome assemblies from HiFi sequencing of HG002, HG005, and the Vertebrate Genome Project (VGP) mHomSap1 trio on the PacBio Sequel II System. The HiFi assemblies all exceed the VGP’s quality guidelines, approaching QV50 (99.999%) accuracy. For HG002, 60% of the genome was haplotype-resolved, with phase-block N50 of 143Kbp and phasing accuracy of 99.6%. The overall mean base accuracy of the assembly was QV49.7. In conclusion, HiFi data show great promise towards complete, contiguous, and accurate diploid human assemblies.
PacBio SMRT Sequencing is fast changing the genomics space with its long reads and high consensus sequence accuracy, providing the most comprehensive view of the genome and transcriptome. In this…
In this PacBio User Group Meeting presentation, Mitchell Vollger of the University of Washington used HiFi reads from SMRT Sequencing to study segmental duplications in the human genome. The technique…
In this presentation, Emily Hatas of PacBio offers a look a how SMRT Sequencing has changed over the years as well as the most common applications in human genome analysis:…
Over the past decade, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs. However, as next-generation sequencing technologies have developed, so too has RNA-seq. Now, RNA-seq methods are available for studying many different aspects of RNA biology, including single-cell gene expression, translation (the translatome) and RNA structure (the structurome). Exciting new applications are being explored, such as spatial transcriptomics (spatialomics). Together with new long-read and direct RNA-seq technologies and better computational tools for data analysis, innovations in RNA-seq are contributing to a fuller understanding of RNA biology, from questions such as when and where transcription occurs to the folding and intermolecular interactions that govern RNA function.
Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. These assemblies can be created in various ways, such as use of tissues that contain single-haplotype (haploid) genomes, or by co-sequencing of parental genomes, but these approaches can be impractical in many situations. We present FALCON-Phase, which integrates long-read sequencing data and ultra-long-range Hi-C chromatin interaction data of a diploid individual to create high-quality, phased diploid genome assemblies. The method was evaluated by application to three datasets, including human, cattle, and zebra finch, for which high-quality, fully haplotype resolved assemblies were available for benchmarking. Phasing algorithm accuracy was affected by heterozygosity of the individual sequenced, with higher accuracy for cattle and zebra finch (>97%) compared to human (82%). In addition, scaffolding with the same Hi-C chromatin contact data resulted in phased chromosome-scale scaffolds.