The killer immunoglobulin-like receptors (KIR) genes belong to the immunoglobulin superfamily and are widely studied due to the critical role they play in coordinating the innate immune response to infection and disease. Highly accurate, contiguous, long reads, like those generated by SMRT Sequencing, when combined with target-enrichment protocols, provide a straightforward strategy for generating complete de novo assembled KIR haplotypes. We have explored two different methods to capture the KIR region; one applying the use of fosmid clones and one using Nimblegen capture.
The increased sequencing throughput creates a need for multiplexing for several applications. We are here detailing different barcoding strategies for microbial sequencing, targeted sequencing, Iso-Seq full-length isoform sequencing, and Roche NimbleGen’s target enrichment method.
Highly contiguous de novo human genome assembly and long-range haplotype phasing using SMRT Sequencing
The long reads, random error, and unbiased sampling of SMRT Sequencing enables high quality, de novo assembly of the human genome. PacBio long reads are capable of resolving genomic variations at all size scales, including SNPs, insertions, deletions, inversions, translocations, and repeat expansions, all of which are both important in understanding the genetic basis for human disease, and difficult to access via other technologies. In demonstration of this, we report a new high-quality, diploid-aware de novo assembly of Craig Venter’s well-studied genome.
The constituents and intra-communal interactions of microbial populations have garnered increasing interest in areas such as water remediation, agriculture and human health. Amplification and sequencing of the evolutionarily conserved 16S rRNA gene is an efficient method of profiling communities. Currently, most targeted amplification focuses on short, hypervariable regions of the 16S sequence. Distinguishing information not spanned by the targeted region is lost, and species-level classification is often not possible. PacBio SMRT Sequencing easily spans the entire 1.5 kb 16S gene in a single read, producing highly accurate single-molecule sequences that can improve the identification of individual species in a metapopulation.However, this process still relies upon PCR amplification from a mixture of similar sequences, which may result in chimeras, or recombinant molecules, at rates upwards of 20%. These PCR artifacts make it difficult to identify novel species, and reduce the amount of informative sequences. We investigated multiple factors that may contribute to chimera formation, such as template damage, denaturation time before and during thermocycling, polymerase extension time, and reaction volume. We found two related factors that contribute to chimera formation: the amount of input template into the PCR reaction, and the number of PCR cycles.A second problem that can confound analysis is sequence errors generated during amplification and sequencing. With the updated algorithm for circular consensus sequencing (CCS2), single-molecule reads can be filtered to 99.99% predicted accuracy. Substitution errors in these highly filtered reads may be dominated by mis-incorporations during amplification. Sequence differences in full-length 16S amplicons from several commercial high-fidelity PCR kits were compared.We show results of our experiments and describe our optimized protocol for full-length 16S amplification for SMRT Sequencing. These optimizations have broader implications for other applications that use PCR amplification to phase variations across targeted regions and generate highly accurate reference sequences.
An improved circular consensus algorithm with an application to detect HIV-1 Drug Resistance Associated Mutations (DRAMs)
Scientists who require confident resolution of heterogeneous populations across complex regions have been unable to transition to short-read sequencing methods. They continue to depend on Sanger sequencing despite its cost and time inefficiencies. Here we present a new redesigned algorithm that allows the generation of circular consensus sequences (CCS) from individual SMRT Sequencing reads. With this new algorithm, dubbed CCS2, it is possible to reach high quality across longer insert lengths at a lower cost and higher throughput than Sanger sequencing. We applied CCS2 to the characterization of the HIV-1 K103N drug-resistance associated mutation in both clonal and patient samples. This particular DRAM has previously proved to be clinically relevant, but challenging to characterize due to regional sequence context. First, a mutation was introduced into the 3rd position of amino acid position 103 (A>C substitution) of the RT gene on a pNL4-3 backbone by site-directed mutagenesis. Regions spanning ~1.3 kb were PCR amplified from both the non-mutated and mutant (K103N) plasmids, and were sequenced individually and as a 50:50 mixture. Additionally, the proviral reservoir of a subject with known dates of virologic failure of an Efavirenz-based regimen and with documented emergence of drug resistant (K103N) viremia was sequenced at several time points as a proof-of-concept study to determine the kinetics of retention and decay of K103N.Sequencing data were analyzed using the new CCS2 algorithm, which uses a fully-generative probabilistic model of our SMRT Sequencing process to polish consensus sequences to high accuracy. With CCS2, we are able to achieve a per-read empirical quality of QV30 (99.9% accuracy) at 19X coverage. A total of ~5000 1.3 kb consensus sequences with a collective empirical quality of ~QV40 (99.99%) were obtained for each sample. We demonstrate a 0% miscall rate in both unmixed control samples, and estimate a 48:52 frequency for the K103N mutation in the mixed (50:50) plasmid sample, consistent with data produced by orthogonal platforms. Additionally, the K103N escape variant was only detected in proviral samples from time points subsequent (19%) to the emergence of drug resistant viremia. This tool might be used to monitor the HIV reservoir for stable evolutionary changes throughout infection.
Workflow for processing high-throughput, Single Molecule, Real-Time Sequencing data for analyzing the microbiome of patients undergoing fecal microbiota transplantation
There are many sequencing-based approaches to understanding complex metagenomic communities spanning targeted amplification to whole-sample shotgun sequencing. While targeted approaches provide valuable data at low sequencing depth, they are limited by primer design and PCR. Whole-sample shotgun experiments generally use short-read sequencing, which results in data processing difficulties. For example, reads less than 500 bp in length will rarely cover a complete gene or region of interest, and will require assembly. This not only introduces the possibility of incorrectly combining sequence from different community members, it requires a high depth of coverage. As such, rare community members may not be represented in the resulting assembly. Circular-consensus, Single Molecule, Real-Time (SMRT) Sequencing reads in the 1-3 kb range, with >99% accuracy can be generated using the previous generation PacBio RS II or, in much higher throughput, using the new Sequel System. While throughput is lower compared to short-read sequencing methods, the reads are a true random sampling of the underlying community since SMRT Sequencing has been shown to have very low sequence-context bias. With single-molecule reads >1 kb at >99% consensus accuracy, it is reasonable to expect a high percentage of reads to include genes or gene fragments useful for analysis without the need for de novo assembly. Here we present the results of circular consensus sequencing for an individual’s microbiome, before and after undergoing fecal microbiota transplantation (FMT) in order to treat a chronic Clostridium difficile infection. We show that even with relatively low sequencing depth, the long-read, assembly-free, random sampling allows us to profile low abundance community members at the species level. We also show that using shotgun sampling with long reads allows a level of functional insight not possible with classic targeted 16S, or short read sequencing, due to entire genes being covered in single reads.
As the throughput of the PacBio Systems continues to increase, so has the desire to fully utilize SMRT Cell sequencing capacity to multiplex microbes for whole genome sequencing. Multiplexing is readily achieved by incorporating a unique barcode for each microbe into the SMRTbell adapters and using a streamlined library preparation process. Incorporating barcodes without PCR amplification prevents the loss of epigenetic information and the generation of chimeric sequences, while eliminating the need to generate separate SMRTbell libraries. We multiplexed the genomes of up to 8 unique strains of H. pylori. Each genome was sheared and processed through adapter ligation in a single, addition-only reaction. The barcoded samples were pooled in equimolar quantities and a single SMRTbell library was prepared. We demonstrate successful de novo microbial assembly from all multiplexes tested (2- through 8-plex) using data generated from a single SMRTbell library, run on a single SMRT Cell with the PacBio RS II, and analyzed with standard SMRT Analysis assembly methods. This strategy was successful using both small (1.6 Mb, H. pylori) and medium (5 Mb, E. coli) genomes. This protocol facilitates the sequencing of multiple microbial genomes in a single run, greatly increasing throughput and reducing costs per genome.
Fecal samples were obtained from human subjects in the first blinded, placebo-controlled trial to evaluate the efficacy and safety of fecal microbiota transplant (FMT) for treatment of recurrent C. difficile infection. Samples included pre-and post-FMT transplant, post-placebo transplant, and the donor control; samples were taken at 2 and 8 week post-FMT. Sequencing was done on the PacBio Sequel System, with the goal of obtaining high quality sequences covering whole genes or gene clusters, which will be used to better understand the relationship between the composition and functional capabilities of intestinal microbiomes and patient health. Methods: Samples were randomly sheared to 2-3 kb fragments, a sufficient length to cover most genes, and SMRTbell libraries were prepared using standard protocols. Libraries were run on the Sequel System, which has a throughput of hundreds of thousands of reads per SMRT Cell, adequate yield to sample the complex microbiomes of post-transplant and donor samples.Results: Here we characterize samples, describe library prep methods and detail Sequel System operation, including run conditions. Descriptive statistics of data output (primary analysis) are presented, along with SMRT Analysis reports on circular consensus sequence (CCS) reads generated using an updated algorithm (CCS2). Final sequencing yields are filtered at various levels of predicted accuracy from 90% to 99.9%. Previous studies done using the PacBio RS II System demonstrated the ability to profile at the species level, and in some cases the strain level, and provided functional insight. Conclusions: These results demonstrate that the Sequel System is well-suited for characterization of complex microbial communities, with the ability for high-throughput generation of extremely accurate single-molecule sequences, each several kilobases in length. The entire process from shearing and library prep through sequencing and CCS analysis can be completed in less than 48 hours.
Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome using long-read sequencing
Sequence-based estimation of genetic diversity of Plasmodium falciparum, the most lethal malarial parasite, has proved challenging due to a lack of a complete genomic assembly. The skewed AT-richness (~80.6% (A+T)) of its genome and the lack of technology to assemble highly polymorphic sub-telomeric regions that contain clonally variant, multigene virulence families (i.e. var and rifin) have confounded attempts using short-read NGS technologies. Using single molecule, real-time (SMRT) sequencing, we successfully compiled all 14 nuclear chromosomes of the P. falciparum genome from telomere-to-telomere in single contigs. Specifically, amplification-free sequencing generated reads of average length 12 kb, with =50% of the reads between 15.5 and 50 kb in length. A hierarchical genome assembly process (HGAP), was used to assemble the P. falciparum genome de novo. This assembly accurately resolved centromeres (~90-99% (A+T)) and sub-telomeric regions, and identified large insertions and duplications in the genome that added extra genes to the var and rifin virulence families, along with smaller structural variants such as homopolymer tract expansions. These regions can be used as markers for genetic diversity during comparative genome analyses. Moreover, identifying the polymorphic and repetitive sub-telomeric sequences of parasite populations from endemic areas might inform the link between structural variation and phenotypes such as virulence, drug resistance and disease transmission.
Multiplex target enrichment using barcoded multi-kilobase fragments and probe-based capture technologies
Target enrichment capture methods allow scientists to rapidly interrogate important genomic regions of interest for variant discovery, including SNPs, gene isoforms, and structural variation. Custom targeted sequencing panels are important for characterizing heterogeneous, complex diseases and uncovering the genetic basis of inherited traits with more uniform coverage when compared to PCR-based strategies. With the increasing availability of high-quality reference genomes, customized gene panels are readily designed with high specificity to capture genomic regions of interest, thus enabling scientists to expand their research scope from a single individual to larger cohort studies or population-wide investigations. Coupled with PacBio® long-read sequencing, these technologies can capture 5 kb fragments of genomic DNA (gDNA), which are useful for interrogating intronic, exonic, and regulatory regions, characterizing complex structural variations, distinguishing between gene duplications and pseudogenes, and interpreting variant haplotyes. In addition, SMRT® Sequencing offers the lowest GC-bias and can sequence through repetitive regions. We demonstrate the additional insights possible by using in-depth long read capture sequencing for key immunology, drug metabolizing, and disease causing genes such as HLA, filaggrin, and cancer associated genes.
A method for the identification of variants in Alzheimer’s disease candidate genes and transcripts using hybridization capture combined with long-read sequencing
Alzheimer’s disease (AD) is a devastating neurodegenerative disease that is genetically complex. Although great progress has been made in identifying fully penetrant mutations in genes such as APP, PSEN1 and PSEN2 that cause early-onset AD, these still represent a very small percentage of AD cases. Large-scale, genome-wide association studies (GWAS) have identified at least 20 additional genetic risk loci for the more common form of late-onset AD. However, the identified SNPs are typically not the actual risk variants, but are in linkage disequilibrium with the presumed causative variant (Van Cauwenberghe C, et al., The genetic landscape of Alzheimer disease: clinical implications and perspectives. Genet Med 2015;18:421-430). Long-read sequencing together with hybrid-capture targeting technologies provides a powerful combination to target candidate genes/transcripts of interest. Shearing the genomic DNA to ~5 kb fragments and then capturing with probes that span the whole gene(s) of interest can provide uniform coverage across the entire region, identifying variants and allowing for phasing into two haplotypes. Furthermore, capturing full-length cDNA from the same sample using the same capture probes can also provide an understanding of isoforms that are generated and allow them to be assigned to their corresponding haplotype. Here we present a method for capturing genomic DNA and cDNA from an AD sample using a panel of probes targeting approximately 20 late-onset AD candidate genes which includes CLU, ABCA7, CD33, TREM2, TOMM40, PSEN2, APH1 and BIN1. By combining xGen® Lockdown® probes with SMRT Sequencing, we provide completely sequenced candidate genes as well as their corresponding transcripts. In addition, we are also able to evaluate structural variants that due to their size, repetitive nature, or low sequence complexity have been un-sequenceable using short-read technologies.
Genes associated with several neurological disorders have been shown to be highly polymorphic. Targeted sequencing of these genes using NGS technologies is a powerful way to increase the cost-effectiveness of variant discovery and detection. However, for a comprehensive view of these target genes, it is necessary to have complete and uniform coverage across regions of interest. Unfortunately, short-read sequencing technologies are not ideal for these types of studies as they are prone to mis-mapping and often fail to span repetitive regions. Targeted sequencing with PacBio long reads provides the unique advantage of single-molecule observations of complex genomic regions. PacBio long reads not only provide continuous sequence data though polymorphic or repetitive regions, but also have no GC bias. Here we describe the characterization of the poly-T locus in TOMM40, a gene known to be associated with progression to Alzheimer’s, using PacBio long reads. Probes were designed to capture a 20 kb region comprising the TOMM40 and ApoE genes. Target regions were captured in multiple cell lines and sequencing libraries made using standard sample preparation methods. We will present our results on the poly-T structural variants that we observed in TOMM40 in these cell lines. We will also present our results on probe design optimization and barcoding strategies for a cost-effective solution.
Nucleotide repeat expansions are a major cause of neurological and neuromuscular disease in humans, however, the nature of these genomic regions makes characterizing them extremely challenging. Accurate DNA sequencing of repeat expansions using short-read sequencing technologies is difficult, as short-read technologies often cannot read through regions of low sequence complexity. Additionally, these short reads do not span the entire region of interest and therefore sequence assembly is required. Lastly, most target enrichment methods are reliant upon amplification which adds the additional caveat of PCR bias. We have developed a novel, amplification-free enrichment technique that employs the CRISPR/Cas9 system for specific targeting of individual human genes. This method, in conjunction with PacBio’s long reads and uniform coverage, enables sequencing of complex genomic regions that cannot be investigated with other technologies. Using human genomic DNA samples and this strategy, we have successfully targeted the loci of Huntington’s Disease (HTT; CAG repeat), Fragile X (FMR1; CGG repeat), ALS (C9orf72; GGGGCC repeat), and Spinocerebellar ataxia type 10 (SCA10; variable ATTCT repeat) for examination. With this data, we demonstrate the ability to isolate hundreds of individual on-target molecules in a single SMRT Cell and accurately sequence through long repeat stretches, regardless of the extreme GC-content. The method is compatible with multiplexing of multiple targets and multiple samples in a single reaction. This technique also captures native DNA molecules for sequencing, allowing for the possibility of direct detection and characterization of epigenetic signatures.
In recent years, human genomic research has focused on comparing short-read data sets to a single human reference genome. However, it is becoming increasingly clear that significant structural variations present in individual human genomes are missed or ignored by this approach. Additionally, remapping short-read data limits the phasing of variation among individual chromosomes. This reduces the newly sequenced genome to a table of single nucleotide polymorphisms (SNPs) with little to no information as to the co-linearity (phasing) of these variants, resulting in a “mosaic” reference representing neither of the parental chromosomes. The variation between the homologous chromosomes is lost in this representation, including allelic variations, structural variations, or even genes present in only one chromosome, leading to lost information regarding allelic-specific gene expression and function. To address these limitations, we have made significant progress integrating haplotype information directly into genome assembly process with long reads. The FALCON-Unzip algorithm leverages a string graph assembly approach to facilitate identification and separation of heterozygosity during the assembly process to produce a highly contiguous assembly with phased haplotypes representing the genome in its diploid state. The outputs of the assembler are pairs of sequences (haplotigs) containing the allelic differences, including SNPs and structural variations, present in the two sets of chromosomes. The development and testing of our de-novo diploid assembler was facilitated and carefully validated using inbred reference model organisms and F1 progeny, which allowed us to ascertain the accuracy and concordance of haplotigs relative to the two inbred parental assemblies. Examination of the results confirmed that our haplotype-resolved assemblies are “Gold Level” reference genomes having a quality similar to that of Sanger-sequencing, BAC-based assembly approaches. We further sequenced and assembled two well-characterized human samples into their respective phased diploid genomes with gap-free contig N50 sizes greater than 23 Mb and haplotig N50 sizes greater than 380 kb. Results of these assemblies and a comparison between the haplotype sets are presented.
Effect of coverage depth and haplotype phasing on structural variant detection with PacBio long reads
Each human genome has thousands of structural variants compared to the reference assembly, up to 85% of which are difficult or impossible to detect with Illumina short reads and are only visible with long, multi-kilobase reads. The PacBio RS II and Sequel single molecule, real-time (SMRT) sequencing platforms have made it practical to generate long reads at high throughput. These platforms enable the discovery of structural variants just as short-read platforms did for single nucleotide variants. Numerous software algorithms call structural variants effectively from PacBio long reads, but algorithm sensitivity is lower for insertion variants and all heterozygous variants. Furthermore, the impact of coverage depth and read lengths on sensitivity is not fully characterized. To quantify how zygosity, coverage depth, and read lengths impact the sensitivity of structural variant detection, we obtained high coverage PacBio sequences for three human samples: haploid CHM1, diploid NA12878, and diploid SK-BR-3. For each dataset, reads were randomly subsampled to titrate coverage from 0.5- to 50-fold. The structural variants detected at each coverage were compared to the set at “full” 50-fold coverage. For the diploid samples, additional titrations were performed with reads first partitioned by phase using single nucleotide variants for essentially haploid structural variant discovery. Even at low coverages (1- to 5-fold), PacBio long reads reveal hundreds of structural variants that are not seen in deep 50-fold Illumina whole genome sequences. At moderate 10-fold PacBio coverage, a majority of structural variants are detected. Sensitivity begins to level off at around 40-fold coverage, though it does not fully saturate before 50-fold. Phasing improves sensitivity for all variant types, especially at moderate 10- to 20-fold coverage. Long reads are an effective tool to identify and phase structural variants in the human genome. The majority of variants are detected at moderate 10-fold coverage, and even extremely low long-read coverage (1- to 5-fold) reveals variants that are invisible to short-read sequencing. Performance will continue to improve with better software and longer reads, which will empower studies to connect structural variants to healthy and disease traits in the human population.