The complex immune regions of the genome, including MHC and KIR, contain large copy number variants (CNVs), a high density of genes, hyper-polymorphic gene alleles, and conserved extended haplotypes (CEH) with enormous linkage disequilibrium (LDs). This level of complexity and inherent biases of short-read sequencing make it challenging for extracting immune region haplotype information from reference-reliant, shotgun sequencing and GWAS methods. As NGS based genome and exome sequencing and SNP arrays have become a routine for population studies, numerous efforts are being made for developing software to extract and or impute the immune gene information from these datasets. Despite these efforts, the fine mapping of causal variants of immune genes for their well-documented association with cancer, drug-induced hypersensitivity and immune-related diseases, has been slower than expected. This has in many ways limited our understanding of the mechanisms leading to immune disease. In the present work, we demonstrate the advantages of long reads delivered by SMRT Sequencing for assembling complete haplotypes of MHC and KIR gene clusters, as well as calling correct genotypes of genes comprised within them. All the genotype information is detected at allele- level with full phasing information across SNP-poor regions. Genotypes were called correctly from targeted gene amplicons, haplotypes, as well as from a completely assembled 5 Mb contig of the MHC region from a de novo assembly of whole genome shotgun data. De novo analysis pipeline used in all these approaches allowed for reference-free analysis without imputation, a key for interrogation without prior knowledge about ethnic backgrounds. These methods are thus easily adoptable for previously uncharacterized human or non-human species.
Enrichment of unamplified DNA and long-read SMRT Sequencing in unlocking the underlying biological disease mechanisms of repeat expansion disorders
For many of the repeat expansion disorders, the disease gene has been discovered, however the underlying biological mechanisms have not yet been fully understood. This is mainly due to technological limitations that do not allow for the needed base-pair resolution of the long, repetitive genomic regions. We have developed a novel, amplification-free enrichment technique that uses the CRISPR/Cas9 system to target large repeat expansions. This method, in conjunction with PacBio’s long reads and uniform coverage, enables sequencing of these complex genomic regions. By using a PCR-free amplification method, we are able to access not only the repetitive elements and interruption sequences accurately, but also the epigenetic information.
We have developed several candidate gene screening applications for both Neuromuscular and Neurological disorders. The power behind these applications comes from the use of long-read sequencing. It allows us to access previously unresolvable and even unsequencable genomic regions. SMRT Sequencing offers uniform coverage, a lack of sequence context bias, and very high accuracy. In addition, it is also possible to directly detect epigenetic signatures and characterize full-length gene transcripts through assembly-free isoform sequencing. In addition to calling the bases, SMRT Sequencing uses the kinetic information from each nucleotide to distinguish between modified and native bases.
Highly contiguous de novo human genome assembly and long-range haplotype phasing using SMRT Sequencing
The long reads, random error, and unbiased sampling of SMRT Sequencing enables high quality, de novo assembly of the human genome. PacBio long reads are capable of resolving genomic variations at all size scales, including SNPs, insertions, deletions, inversions, translocations, and repeat expansions, all of which are both important in understanding the genetic basis for human disease, and difficult to access via other technologies. In demonstration of this, we report a new high-quality, diploid-aware de novo assembly of Craig Venter’s well-studied genome.
Multiplex target enrichment using barcoded multi-kilobase fragments and probe-based capture technologies
Target enrichment capture methods allow scientists to rapidly interrogate important genomic regions of interest for variant discovery, including SNPs, gene isoforms, and structural variation. Custom targeted sequencing panels are important for characterizing heterogeneous, complex diseases and uncovering the genetic basis of inherited traits with more uniform coverage when compared to PCR-based strategies. With the increasing availability of high-quality reference genomes, customized gene panels are readily designed with high specificity to capture genomic regions of interest, thus enabling scientists to expand their research scope from a single individual to larger cohort studies or population-wide investigations. Coupled with PacBio® long-read sequencing, these technologies can capture 5 kb fragments of genomic DNA (gDNA), which are useful for interrogating intronic, exonic, and regulatory regions, characterizing complex structural variations, distinguishing between gene duplications and pseudogenes, and interpreting variant haplotyes. In addition, SMRT® Sequencing offers the lowest GC-bias and can sequence through repetitive regions. We demonstrate the additional insights possible by using in-depth long read capture sequencing for key immunology, drug metabolizing, and disease causing genes such as HLA, filaggrin, and cancer associated genes.
A method for the identification of variants in Alzheimer’s disease candidate genes and transcripts using hybridization capture combined with long-read sequencing
Alzheimer’s disease (AD) is a devastating neurodegenerative disease that is genetically complex. Although great progress has been made in identifying fully penetrant mutations in genes such as APP, PSEN1 and PSEN2 that cause early-onset AD, these still represent a very small percentage of AD cases. Large-scale, genome-wide association studies (GWAS) have identified at least 20 additional genetic risk loci for the more common form of late-onset AD. However, the identified SNPs are typically not the actual risk variants, but are in linkage disequilibrium with the presumed causative variant (Van Cauwenberghe C, et al., The genetic landscape of Alzheimer disease: clinical implications and perspectives. Genet Med 2015;18:421-430). Long-read sequencing together with hybrid-capture targeting technologies provides a powerful combination to target candidate genes/transcripts of interest. Shearing the genomic DNA to ~5 kb fragments and then capturing with probes that span the whole gene(s) of interest can provide uniform coverage across the entire region, identifying variants and allowing for phasing into two haplotypes. Furthermore, capturing full-length cDNA from the same sample using the same capture probes can also provide an understanding of isoforms that are generated and allow them to be assigned to their corresponding haplotype. Here we present a method for capturing genomic DNA and cDNA from an AD sample using a panel of probes targeting approximately 20 late-onset AD candidate genes which includes CLU, ABCA7, CD33, TREM2, TOMM40, PSEN2, APH1 and BIN1. By combining xGen® Lockdown® probes with SMRT Sequencing, we provide completely sequenced candidate genes as well as their corresponding transcripts. In addition, we are also able to evaluate structural variants that due to their size, repetitive nature, or low sequence complexity have been un-sequenceable using short-read technologies.
Genes associated with several neurological disorders have been shown to be highly polymorphic. Targeted sequencing of these genes using NGS technologies is a powerful way to increase the cost-effectiveness of variant discovery and detection. However, for a comprehensive view of these target genes, it is necessary to have complete and uniform coverage across regions of interest. Unfortunately, short-read sequencing technologies are not ideal for these types of studies as they are prone to mis-mapping and often fail to span repetitive regions. Targeted sequencing with PacBio long reads provides the unique advantage of single-molecule observations of complex genomic regions. PacBio long reads not only provide continuous sequence data though polymorphic or repetitive regions, but also have no GC bias. Here we describe the characterization of the poly-T locus in TOMM40, a gene known to be associated with progression to Alzheimer’s, using PacBio long reads. Probes were designed to capture a 20 kb region comprising the TOMM40 and ApoE genes. Target regions were captured in multiple cell lines and sequencing libraries made using standard sample preparation methods. We will present our results on the poly-T structural variants that we observed in TOMM40 in these cell lines. We will also present our results on probe design optimization and barcoding strategies for a cost-effective solution.
Nucleotide repeat expansions are a major cause of neurological and neuromuscular disease in humans, however, the nature of these genomic regions makes characterizing them extremely challenging. Accurate DNA sequencing of repeat expansions using short-read sequencing technologies is difficult, as short-read technologies often cannot read through regions of low sequence complexity. Additionally, these short reads do not span the entire region of interest and therefore sequence assembly is required. Lastly, most target enrichment methods are reliant upon amplification which adds the additional caveat of PCR bias. We have developed a novel, amplification-free enrichment technique that employs the CRISPR/Cas9 system for specific targeting of individual human genes. This method, in conjunction with PacBio’s long reads and uniform coverage, enables sequencing of complex genomic regions that cannot be investigated with other technologies. Using human genomic DNA samples and this strategy, we have successfully targeted the loci of Huntington’s Disease (HTT; CAG repeat), Fragile X (FMR1; CGG repeat), ALS (C9orf72; GGGGCC repeat), and Spinocerebellar ataxia type 10 (SCA10; variable ATTCT repeat) for examination. With this data, we demonstrate the ability to isolate hundreds of individual on-target molecules in a single SMRT Cell and accurately sequence through long repeat stretches, regardless of the extreme GC-content. The method is compatible with multiplexing of multiple targets and multiple samples in a single reaction. This technique also captures native DNA molecules for sequencing, allowing for the possibility of direct detection and characterization of epigenetic signatures.
Target enrichment using a neurology panel for 12 barcoded genomic DNA samples on the PacBio SMRT Sequencing platform
Target enrichment is a powerful tool for studies involved in understanding polymorphic SNPs with phasing, tandem repeats, and structural variations. With increasing availability of reference genomes, researchers can easily design a cost-effective targeted investigation with custom probes specific to regions of interest. Using PacBio long-read technology in conjunction with probe capture, we were able to sequence multi-kilobase enriched regions to fully investigate intronic and exonic regions, distinguish haplotypes, and characterize structural variations. Furthermore, we demonstrate this approach is advantageous for studying complex genomic regions previously inaccessible through other sequencing platforms. In the present work, 12 barcoded genomic DNA (gDNA) samples were sheared to 6 kb for target enrichment analysis using the Neurology panel provided by Roche NimbleGen. Probe-captured DNA was used to make SMRTbell libraries for SMRT Sequencing on the PacBio RS II. Our results demonstrate the ability to multiplex 12 samples and achieve 1300x enrichment of targeted regions. In addition, we achieved an even representation of on-target rate of 70% across the 12 barcoded genomic DNA samples.
Targeted SMRT Sequencing of difficult regions of the genome using a Cas9, non-amplification based method
Targeted sequencing has proven to be an economical means of obtaining sequence information for one or more defined regions of a larger genome. However, most target enrichment methods are reliant upon some form of amplification. Amplification removes the epigenetic marks present in native DNA, and some genomic regions, such as those with extreme GC content and repetitive sequences, are recalcitrant to faithful amplification. Yet, a large number of genetic disorders are caused by expansions of repeat sequences. Furthermore, for some disorders, methylation status has been shown to be a key factor in the mechanism of disease. We have developed a novel, amplification-free enrichment technique that employs the CRISPR/Cas9 system for specific targeting of individual human genes. This method, in conjunction with SMRT Sequencing’s long reads, high consensus accuracy, and uniform coverage, allows the sequencing of complex genomic regions that cannot be investigated with other technologies.
Targeted enrichment without amplification and SMRT Sequencing of repeat-expansion disease causative genomic regions
Targeted sequencing has proven to be an economical means of obtaining sequence information for one or more defined regions of a larger genome. However, most target enrichment methods are reliant upon some form of amplification. Amplification removes the epigenetic marks present in native DNA, and some genomic regions, such as those with extreme GC content and repetitive sequences, are recalcitrant to faithful amplification. Yet, a large number of genetic disorders are caused by expansions of repeat sequences. Furthermore, for some disorders, methylation status has been shown to be a key factor in the mechanism of disease. We have developed a novel, amplification-free enrichment technique that employs the CRISPR/Cas9 system for specific targeting of individual human genes. This method, in conjunction with SMRT Sequencing’s long reads, high consensus accuracy, and uniform coverage, allows the sequencing of complex genomic regions that cannot be investigated with other technologies. Using human genomic DNA samples and this strategy, we have successfully targeted the loci of a number of repeat expansion disorders (HTT, FMR1, ATXN10, C9orf72). With this data, we demonstrate the ability to isolate hundreds of individual on-target molecules and accurately sequence through long repeat stretches, regardless of the extreme GC-content, followed by accurate sequencing on a single PacBio RS II SMRT Cell or Sequel SMRT Cell 1M. The method is compatible with multiplexing of multiple targets and multiple samples in a single reaction. Furthermore, this technique also preserves native DNA molecules for sequencing, allowing for the possibility of direct detection and characterization of epigenetic signatures. We demonstrate detection of 5-mC in human promoter sequences and CpG islands.
Structural variants (genomic differences =50 base pairs) contribute to the evolution of traits and disease. Most structural variants (SVs) are too small to detect with array comparative genomic hybridization and too large to reliably discover with short-read DNA sequencing.
High-quality de novo genome assembly and intra-individual mitochondrial instability in the critically endangered kakapo
The kakapo (Strigops habroptila) is a large, flightless parrot endemic to New Zealand. It is highly endangered with only ~150 individuals remaining, and intensive conservation efforts are underway to save this iconic species from extinction. These include genetic studies to understand critical genes relevant to fertility, adaptation and disease resistance, and genetic diversity across the remaining population for future breeding program decisions. To aid with these efforts, we have generated a high-quality de novo genome assembly using PacBio long-read sequencing. Using the new diploid-aware FALCON-Unzip assembler, the resulting genome of 1.06 Gb has a contig N50 of 5.6 Mb (largest contig 29.3 Mb), >350-times more contiguous compared to a recent short-read assembly of a closely related parrot (kea) species. We highlight the benefits of the higher contiguity and greater completeness of the kakapo genome assembly through examples of fully resolved genes important in wildlife conservation (contrasted with fragmented and incomplete gene resolution in short-read assemblies), in some cases even providing sequence for regions orthologous to gaps of missing sequence in the chicken reference genome. We also highlight the complete resolution of the kakapo mitochondrial genome, fully containing the mitochondrial control region which is missing from the previous dedicated kakapomitochondrial genome NCBI entry. For this region, we observed a marked heterogeneity in the number of tandem repeats in different mtDNAmolecules from a single bird tissue, highlighting the enhanced molecular resolution uniquely afforded by long-read, single-molecule PacBio sequencing.
Targeted sequencing has proven to be an economical means of obtaining sequence information for one or more defined regions of a larger genome. However, most target enrichment methods are reliant upon some form of amplification. Amplification removes the epigenetic marks present in native DNA, and some genomic regions, such as those with extreme GC content and repetitive sequences, are recalcitrant to faithful amplification. Yet, a large number of genetic disorders are caused by expansions of repeat sequences. Furthermore, for some disorders, methylation status has been shown to be a key factor in the mechanism of disease.
Plant and animal whole genome sequencing has proven to be challenging, particularly due to genome size, high density of repetitive elements and heterozygosity. The Sequel System delivers long reads, high consensus accuracy and uniform coverage, enabling more complete, accurate, and contiguous assemblies of these large complex genomes. The latest Sequel chemistry increases yield up to 8 Gb per SMRT Cell for long insert libraries >20 kb and up to 10 Gb per SMRT Cell for libraries >40 kb. In addition, the recently released SMRTbell Express Template Prep Kit reduces the time (~3 hours) and DNA input (~3 µg), making the workflow easy to use for multi- SMRT Cell projects. Here, we recommend the best practices for whole genome sequencing and de novo assembly of complex plant and animal genomes. Guidelines for constructing large-insert SMRTbell libraries (>30 kb) to generate optimal read lengths and yields using the latest Sequel chemistry are presented. We also describe ways to maximize library yield per preparation from as littles as 3 µg of sheared genomic DNA. The combination of these advances makes plant and animal whole genome sequencing a practical application of the Sequel System.