Menu
June 1, 2021

Structural variant in the RNA Binding Motif Protein, X-Linked 2 (RBMX2) gene found to be linked to bipolar disorder

Bipolar disorder (BD) is a phenotypically and genetically complex neurological disorder that affects 1% of the worldwide population. There is compelling evidence from family, twin and adoption studies supporting the involvement of a genetic predisposition with estimated heritability up to ~ 80%. The risk in first-degree relatives is ten times higher than in the general population. Linkage and association studies have implicated multiple putative chromosomal loci for BD susceptibility, however no disease genes have yet to be identified. Here, we have fully characterized a ~12 Mb significantly linked (lod score=3.54) genomic region on chromosome Xq24-q27 in an extended family from a genetic isolate that was using long-read single molecule, real-time (SMRT) sequencing. The family segregates BD in at least 4 generations with 16 individuals out of 61 affected. Thus, this family portrays a highly elevated reoccurrence risk compared to the general population. It is expected that the genetic complexity would be reduced in isolated populations, even in genetically complex disorders such as BD, as in the case of this extended family. We selected 16 key individuals from the X-chromosomally linked family to be sequenced. These selected individuals either carried the disease haplotype, were non-carriers of the disease haplotype, or served as married-in controls. We designed a Nimblegen capture array enriching for 5-9 kb fragments spanning the entire 12 Mb region that were then sequenced using long-read SMRT sequencing to screen for causative structural variants (SVs) explaining the increased risk for BD in this extended family. Altogether, 192 SVs were detected in the critically linked region however most of these represented common variants that could be seen across many of the family members regardless of the disease status. One SV stood out that showed perfect segregation among all affected individuals that were carriers of the disease haplotype. This was a 330bp Alu deletion in intron 4 of the RNA Binding Motif Protein, X-Linked 2 (RBMX2) gene that has previously been shown to play a central role in brain development and function. Moreover, Alu elements in general have also previously been associated with at least 37 neurological and neurodegenerative disorders. In order to validate the finding and the functionality of the identified SV further studies like isoform characterization are warranted.


June 1, 2021

Amplification-free protocol for targeted enrichment of repeat expansion genomic regions and SMRT Sequencing

Many genetic disorders are associated with repeat sequence expansions. Obtaining accurate DNA sequence information from these regions will facilitate researchers to further establish the relationship between these genetic disorders and underlying disease mechanisms. Moreover, repeat interruptions have also been shown to act as phenotypic modifiers in some disorders. Targeted sequencing is an economical way to obtain sequence information from one or more defined regions in a genome. However, most targeted enrichment and sequencing methods require some form of DNA amplification. Amplifying large regions with extreme GC content as seen in repeat expansion disorders is challenging and prone to introducing sequence artifacts. DNA amplification also removes any epigenetic signatures present in native DNA. This technique also preserves native DNA molecules for the possibility of direct characterization of epigenetic signatures.


June 1, 2021

A complete solution for full-length transcript sequencing using the PacBio Sequel II System

Long read mRNA sequencing methods such as PacBio’s Iso-Seq method offers high-throughput transcriptome profiling in prokaryotic and eukaryotic cells. By avoiding the transcript assembly problem and instead sequencing full-length cDNA, Iso-Seq has emerged as the most reliable technology for annotating isoforms and, in turn, improving proteome predictions in a wide variety of organisms. Improvements in library preparation, sequencing throughput, and bioinformatics has enabled the Iso-Seq method to be complete solution for transcript characterization. The Iso-Seq Express kit is a one-day library prep requiring 60-300 ng of total RNA. The PacBio Sequel II system produces 4-5 million full-length reads, sufficient to profile a whole human transcriptome. Finally, the SQANTI2 software is a powerful tool for categorizing the complex isoforms against reference annotations, while also incorporating orthogonal information such as CAGE peak data, public RNA-seq junction data, and ORF predictions.


June 1, 2021

New advances in SMRT Sequencing facilitate multiplexing for de novo and structural variant studies

The latest advancements in Sequel II SMRT Sequencing have increased average read lengths up to 50% compared to Sequel II chemistry 1.0 which allows multiplexing of 2-3 small organisms (<500 Mb) such as insects and worms for producing reference quality assemblies, calling structural variants for up to 2 samples with ~3 Gb genomes, analysis of 48 microbial genomes, and up to 8 communities for metagenomic profiling in a single SMRT Cell 8M. With the improved processivity of the new Sequel II sequencing polymerase, more SMRTbell molecules reach rolling circle mode resulting in longer overall read lengths, thus allowing efficient detection of barcodes (up to 80%) in the SMRTbell templates. Multiplexing of genomes larger than microbial organisms is now achievable. In collaboration with the Wellcome Sanger Institute, we have developed a workflow for multiplexing two individual Anopheles coluzzii using as low as 150 ng genomic DNA per individual. The resulting assemblies had high contiguity (contig N50s over 3 Mb) and completeness (>98% of conserved genes) for both individuals. For microbial multiplexing, we multiplexed 48 microbes with varying complexities and sizes ranging 1.6-8.0 Mb in single SMRT Cell 8M. Using a new end-to-end analysis (Microbial Assembly Analysis, SMRT Link 8.0), assemblies resulted in complete circularized genomes (>200-fold coverage) and efficient detection of >3-200 kb plasmids. Finally, the long read lengths (>90 kb) allows detection of barcodes in large insert SMRTbell templates (>15 kb) thus facilitating multiplex of two human samples in 1 SMRT Cell 8M for detecting SVs, Indels and CNVs. Here, we present results and describe workflows for multiplexing samples for specific applications for SMRT Sequencing.


June 1, 2021

Copy-number variant detection with PacBio long reads

Long-read sequencing of diverse humans has revealed more than 20,000 insertion, deletion, and inversion structural variants spanning more than 12 Mb in a healthy human genome. Most of these variants are too large to detect with short reads and too small for array comparative genome hybridization (aCGH). While the standard approaches to calling structural variants with long reads thrive in the 50 bp to 10 kb size range, they tend to miss exactly the large (>50 kb) copy-number variants that are called more readily with aCGH. Standard algorithms rely on reference-based mapping of reads that fully span a variant or on de novo assembly; and copy-number variants are often too large to be spanned by a single read and frequently involve segmentally duplicated sequence that is not yet included in most de novo assemblies. To comprehensively detect large variants in human genomes, we extended pbsv – a structural variant caller for long reads – to call copy-number variants (CNVs) from read-clipping and read-depth signatures. In human germline benchmark samples, we detect more than 300 CNVs spanning around 10 Mb, and we call hundreds of additional events in re-arranged cancer samples. Together with insertion, deletion, inversion, duplication, and translocation calling from spanning reads, this allows pbsv to comprehensively detect large variants from a single data type.


June 1, 2021

Comprehensive variant detection in a human genome with highly accurate long reads

Introduction: Long-read sequencing has been applied successfully to assemble genomes and detect structural variants. However, due to high raw-read error rates (10-15%), it has remained difficult to call small variants from long reads. Recent improvements in library preparation and sequencing chemistry have increased length, accuracy, and throughput of PacBio circular consensus sequencing (CCS) reads, resulting in 15-20kb reads with average read quality above 99%. Materials and Methods: We sequenced a library from human reference sample HG002 to 18-fold coverage on the PacBio Sequel II with two SMRT Cells 8M. The CCS algorithm was used to generate highly accurate (average 99.9%) 12.9kb reads, which were mapped to the hg19 reference with pbmm2. We detected small variants using Google DeepVariant with a model trained for CCS and phased the variants using WhatsHap. Structural variants were detected with pbsv. Variant calls were evaluated against Genome in a Bottle (GIAB) benchmarks. Results: With these reads, DeepVariant achieves SNP and Indel F1 scores of 99.70% and 96.59% against the GIAB truth set, and pbsv achieves 97.72% recall on structural variants longer than 50bp. Using WhatsHap, small variants were phased into haplotype blocks with 145kb N50. The improved mappability of long reads allows us to align to and detect variants in medically relevant genes such as CYP2D6 and PMS2 that have proven “difficult-to-map” with short reads. Conclusions: These highly accurate long reads combine the mappability and ability to detect structural variants of long reads with the accuracy and ability to detect small variants of short reads.


June 1, 2021

A workflow for the comprehensive detection and prioritization of variants in human genomes with PacBio HiFi reads

PacBio HiFi reads (minimum 99% accuracy, 15-25 kb read length) have emerged as a powerful data type for comprehensive variant detection in human genomes. The HiFi read length extends confident mapping and variant calling to repetitive regions of the genome that are not accessible with short reads. Read length also improves detection of structural variants (SVs), with recall exceeding that of short reads by over 30%. High read quality allows for accurate single nucleotide variant and small indel detection, with precision and recall matching that of short reads. While many tools have been developed to take advantage of these qualities of HiFi reads, there is no end-to-end workflow for the filtering and prioritization of variants uniquely detected with long reads for rare and undiagnosed disease research. We have developed a flexible, modular workflow and web portal for variant analysis from HiFi reads and applied it to a set of rare disease cases unsolved by short-read whole genome sequencing. We expect that broad application of long-read variant detection workflows will solve many more rare disease cases. We have made these tools available at https://github.com/williamrowell/pbRUGD-workflow, and we hope they serve a starting point for developing a robust analysis framework for long read variant detection for rare diseases.


June 1, 2021

Comprehensive variant detection in a human genome with highly accurate long reads

Introduction: Long-read sequencing has revealed more than 20,000 structural variants spanning over 12 Mb in a healthy human genome. Short-read sequencing fails to detect most structural variants but has remained the more effective approach for small variants, due to 10-15% error rates in long reads, and copy-number variants (CNVs), due to lack of effective long-read variant callers. The development of PacBio highly accurate long reads (HiFi reads) with read lengths of 10-25 kb and quality >99% presents the opportunity to capture all classes of variation with one approach.Methods: We sequence the Genome in a Bottle benchmark sample HG002 and an individual with a presumed Mendelian disease with HiFi reads. We call SNVs and indels with DeepVariant and extend the structural variant caller pbsv to call CNVs using read depth and clipping signatures. Results: For 18-fold coverage with 13 kb HiFi reads, variant calling in HG002 achieves an F1 score of 99.7% for SNVs, 96.6% for indels, and 96.4% for structural variants. Additionally, we detect more than 300 CNVs spanning around 10 Mb. For the Mendelian disease case, HiFi reads reveal thousands of variants that were overlooked by short-read sequencing, including a candidate causative structural variant. Conclusions: These results illustrate the ability of HiFi reads to comprehensively detect variants, including those associated with human disease.


June 1, 2021

Amplification-free targeted enrichment powered by CRISPR-Cas9 and long-read Single Molecule Real-Time (SMRT) Sequencing can efficiently and accurately sequence challenging repeat expansion disorders

Genomic regions with extreme base composition bias and repetitive sequences have long proven challenging for targeted enrichment methods, as they rely upon some form of amplification. Similarly, most DNA sequencing technologies struggle to faithfully sequence regions of low complexity. This has been especially trying for repeat expansion disorders such as Fragile-X disease, Huntington disease and various Ataxias, where the repetitive elements range from several hundreds of bases to tens of kilobases. We have developed a robust, amplification-free targeted enrichment technique, called No-Amp Targeted Sequencing, that employs the CRISPR-Cas9 system. In conjunction with SMRT Sequencing, which delivers long reads spanning the entire repeat expansion, high consensus accuracy, and uniform coverage, these previously inaccessible regions are now accessible. This method is completely amplification-free, therefore removing any PCR errors and biases from the experiment. Furthermore, this technique also preserves native DNA molecules, allowing for direct detection and characterization of epigenetic signatures. The No-Amp method is a two-day protocol that is compatible with multiplexing of multiple targets and multiple samples in a single reaction, using as little as 1 µg of genomic DNA input per sample. We have successfully targeted a number of repeat expansion disorder loci including HTT, FMR1, C9orf7,2 as well as built an Ataxia panel which consists of 15 different disease-causing repeat expansion regions. Using the No-Amp method we have isolated hundreds of individual on-target molecules, allowing for reliable repeat size estimation, mosaicism detection and identification of interruption sequences with alleles as long as >2700 repeat unites ( >13 kb). In addition to multiplexing several targets, we have also multiplexed at least 20 samples in one experiment making the No-Amp Targeted Sequencing method a cost-effective option. Combining the CRISPR-Cas9 enrichment method with Single Molecule, Real-Time Sequencing provided us with base-level resolution of previously inaccessible regions of the genome, like disease-causing repeat expansions. No-Amp Targeted Sequencing captures, in one experiment, many aspects of repeat expansion disorders which are important for better understanding the underlying disease mechanisms.


June 1, 2021

Targeting Clinically Significant Dark Regions of the Human Genome with High-Accuracy, Long-Read Sequencing

Introduction: There are many clinically important genes in “dark” regions of the human genome. These regions are characterized as dark due to a paucity of NGS coverage as a result of short-read sequencing or mapping difficulties. Low NGS sequencing yield can arise in these regions due to the presence of various repeat elements or biased base composition while inaccurate mapping is attributable to segmental duplications. Long-read sequencing coupled with an optimized, robust enrichment method has the potential to illuminate these dark regions.

Materials and Methods: Using PacBio highly accurate long-read (HiFi) Sequencing, coupled with a long-PCR targeted enrichment method, we investigated two important dark region genes that are challenging to accurately type with short-read sequencing due to associated pseudogenes: CYP21A2, responsible for congenital adrenal hyperplasia, and GBA, responsible for Gaucher disease. For each gene, our aim was to cover regions of pathogenic mutations in a single contiguous sequence or set of sequences that can be assayed in a single reaction. CYP21A2 and an associated pseudogene CYP21A1P were co-amplified in a single long-range PCR reaction generating a 10.2 kb and 8.9 kb amplicon, respectively. Similarly, GBA and an associated pseudogene GBAP1 were co-amplified in a single long-range PCR reaction generating a 12.6 kb and 16.0 kb amplicon, respectively. Seven Coriell samples for the CYP21A2 target region and 13 Coriell samples for the GBA target region containing known pathogenic mutations were studied in replicate. SMRTbell libraries were generated from pooled amplicons for each target gene and sequenced on a PacBio Sequel II System. Accounting for replicates, each library contained a multiplex of 24 samples. A new PacBio sequence clustering algorithm, pbAA, designed for rapid analysis of HiFi reads from amplicons was used in variant typing.

Results: All pathogenic CYP21A2 and GBA variants were accurately called in the test samples. These variants included whole-gene deletions, gene duplication, gene fusions, and recombinant exons. Additionally, phasing of complex heterozygotes was achieved.

Conclusion: We demonstrate that long-read HiFi Sequencing provides new opportunities for sequencing clinically relevant but previously dark regions of the human genome that are underrepresented in short-read sequencing. Accurate long reads provide important phasing information, identify structural variations, and avoid potential confusion with pseudogenes. SMRT Sequencing of these regions enables a better understanding of the relationship between genetic factors and personal health and has the potential to ultimately help guide health-related decisions.


June 1, 2021

Resolving Complex Pathogenic Alleles using HiFi Long-range Amplicon Data and a New Clustering Algorithm

Many genetic diseases are mapped to structurally complex loci. These regions contain highly similar paralogous alleles (>99% identity) that span kilobases within the human genome. Comprehensive screening for pathogenic variants amongst paralogous sequences is incomplete and labor intensive using short-reads or optical mapping. In contrast, long-range targeted amplification and PacBio HiFi sequencing fully and directly resolves and phases a wide range of pathogenic variants without assembly or inference. To capitalize on the accuracy of HiFi amplicon data we designed a new amplicon analysis tool, pbAA. pbAA uses a new sequence clustering algorithm to rapidly deconvolve (separate) a mixture of haplotypes, enabling precise diplotyping, and disease allele classification. In this experiment, we analyzed two sets of gene-pseudogene systems, GBA and CYP, that are the second and eighth most common carrier disease alleles, respectively. Samples tested were selected from the Coriell catalog known to have pathogenic variants troublesome to test for with standard short-read assays. Co-amplified long-range PCR amplicons were generated for GBA (12kb)/GBAP1 (15kb), responsible for Gaucher disease, as well as CYP21A2 (10kb)/CYP21A1P (8kb), responsible for congenital adrenal hyperplasia. We obtained 7 samples to test the CYP21A2 region and 13 separate samples for GBA.  HiFi reads were then generated from the amplicon libraries on both Sequel and Sequel II Systems, with replicated samples, to achieve a 24-sample multiplex for each target. Consensus amplicons were produced using pbAA and variants were determined using minimap2 alignments along with a custom SQL database for characterizing and reporting results.  From these data we were able to accurately call all pathogenic variants in the test samples for all replicates, including whole-gene deletions, gene duplication, gene fusions, recombinant exons, and phased complex heterozygotes.  In one trio affected by adrenal hyperplasia, three large structural variants were correctly and independently attributed to the parents and proband, including a duplication of CP21A1P and a CYP21A1P-CYP21A2 gene fusion in the mother and a CYP21A2 deletion in the father. This experiment demonstrates how PacBio HiFi data, analyzed with pbAA, simplifies targeted disease allele identification.  


June 1, 2021

Full-Length Sequencing of CYP2D6Variants with PacBio HiFi Reads

CYP2D6 is a highly polymorphic gene with more than 130 named variants, including deletions, duplications, single nucleotide polymorphisms, and other types of variation (Butler, 2018; Black et al., 2011). These variants affect the rate of metabolism in human individuals of approximately 25% of common prescription drugs (Owen et al., 2019;). PacBio SMRT sequencing is a proven tool for the interrogation of CYP2D6 variants (Qiao et al., 2016; Buermans et al., 2017).  Now with HiFi sequencing, we have developed a streamlined end-to-end workflow for the more accurate detection of highly polymorphic CYP2D6 loci. This study also evaluates the advantage of HiFi reads for the sequencing of full-length CYP2D6 genes with variants previously annotated by other technologies.

Twenty-two Coriell pharmacogenomic samples containing variant CYP2D6 alleles were amplified using long-range PCR. The primer pairs for the amplification of upstream CYP2D6 gene duplications and the downstream CYP2D6 genes were adapted from a publication in Pharmacogenomics (Qiao et al., 2019). A 2-step PCR method was used for the addition of the unique barcode to each sample, allowing pooling of multiple samples for SMRTbell library prep. The resulting SMRTbell Library was then sequenced on the PacBio Sequel II/IIe system for 20-hours. HiFi reads (>QV20) were demultiplexed on SMRTlink and clustered into haplotypes. The consensus reads of each haplotype were produced using the “pbaa” amplicon analysis and then mapped to the human reference genome GRCh38 for the assignment of CYP2D6 types.

More than 700,000 full-length HiFi reads were generated with an average read length of 8.2 kb and a mean accuracy of 99.9%. Nearly all (>99%) demultiplexed reads were on target to the CYP2D6 locus. Genotyping of the CYP2D6 region with PacBio HiFi reads identified all expected upstream duplications and downstream CYP2D6 alleles including single nucleotide variants, except for *5 allele which is a complete deletion. For 21 of 22 samples, the types from HiFi reads matched the diplotypes identified from microarrays and qPCR, while providing full resolution of each allele. One sample was identified as being mistyped by microarray as *1/*41. HiFi sequencing produced a correct type of *33/*41. In addition, for 4/21 samples HiFi sequencing identified duplications missed by microarray or real-time PCR.

The PCR and sequencing assay we have presented here for the detection of CYP2D6 variants is robust and specific. Assignment of new alleles or duplications on pharmacogenomic samples from HiFi reads suggests that PacBio sequencing technology can reveal new diplotypes that were not characterized accurately by other technologies. This study demonstrates that HiFi sequencing provides much higher resolution than either microarray or real-time PCR for the detection of polymorphic genes, while maintaining sensitivity and accuracy.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.