As a cost-effective alternative to whole genome human sequencing, targeted sequencing of specific regions, such as exomes or panels of relevant genes, has become increasingly common. These methods typically include direct PCR amplification of the genomic DNA of interest, or the capture of these targets via probe-based hybridization. Commonly, these approaches are designed to amplify or capture exonic regions and thereby result in amplicons or fragments that are a few hundred base pairs in length, a length that is well-addressed with short-read sequencing technologies. These approaches typically provide very good coverage and can identify SNPs in the targeted region, but are unable to haplotype these variants. Here we describe a targeted sequencing workflow that combines Roche NimbleGen’s SeqCap EZ enrichment technology with Pacific Biosciences’ SMRT Sequencing to provide a more comprehensive view of variants and haplotype information over multi-kilobase regions. While the SeqCap EZ technology is typically used to capture 200 bp fragments, we demonstrate that 6 kb fragments can also be utilized to enrich for long fragments that extend beyond the targeted capture site and well into (and often across) the flanking intronic regions. When combined with the long reads of SMRT Sequencing, multi-kilobase regions of the human genome can be phased and variants detected in exons, introns and intergenic regions.
Highly contiguous de novo human genome assembly and long-range haplotype phasing using SMRT Sequencing
The long reads, random error, and unbiased sampling of SMRT Sequencing enables high quality, de novo assembly of the human genome. PacBio long reads are capable of resolving genomic variations at all size scales, including SNPs, insertions, deletions, inversions, translocations, and repeat expansions, all of which are important in understanding the genetic basis for human disease and difficult to access via other technologies. In demonstration of this, we report a new high-quality, diploid aware de novo assembly of Craig Venter’s well-studied genome.
Full-length gene capture solutions offer opportunities to screen and characterize structural variations and genetic diversity to understand key traits in plants and animals. Through a combined Roche NimbleGen probe capture and SMRT Sequencing strategy, we demonstrate the capability to resolve complex gene structures often observed in plant defense and developmental genes spanning multiple kilobases. The custom panel includes members of the WRKY plant-defense-signaling family, members of the NB-LRR disease-resistance family, and developmental genes important for flowering. The presence of repetitive structures and low-complexity regions makes short-read sequencing of these genes difficult, yet this approach allows researchers to obtain complete sequences for unambiguous resolution of gene models. This strategy has been applied to genomic DNA samples from soybean coupled with barcoding for multiplexing.
Over the last few years, several advances were implemented in the PacBio RS II System to maximize throughput and efficiency while reducing the cost per sample. The number of useable bases per SMRT Cell now exceeds 1 Gb with the latest P6-C4 chemistry and 6-hour movies. For applications such as microbial sequencing, targeted sequencing, Iso-Seq (full-length isoform sequencing) and Nimblegen’s target enrichment method, current SMRT Cell yields could be an excess relative to project requirements. To this end, barcoding is a viable option for multiplexing samples. For microbial sequencing, multiplexing can be accomplished by tagging sheared genomic DNA during library construction with modified SMRTbell adapters. We studied the performance of 2- to 8-plex microbial sequencing. For full-length amplicon sequencing such as HLA typing, amplicons as large as 5 kb may be barcoded during amplification using barcoded locus-specific primers. Alternatively, amplicons may be barcoded during SMRTbell library construction using barcoded SMRTbell adapters. The preferred barcoding strategy depends on the user’s existing workflow and flexibility to changing and/or updating existing workflows. Using barcoded adapters, five Class I and II genes (3.3 – 5.8 kb) x 96 patients can be multiplexed and typed. For Iso-Seq full-length cDNA sequencing, barcodes are incorporated during 1st-strand synthesis and are enabled by tailing the oligo-dT primer with any PacBio published 16-bp barcode sequences. RNA samples from 6 maize tissues were multiplexed to generate barcoded cDNA libraries. The NimbleGen SeqCap Target Enrichment method, combined with PacBio’s long-read sequencing, provides comprehensive view of multi-kilobase contiguous regions, both exonic and intronic regions. To make this cost effective, we recommend barcoding samples for pooling prior to target enrichment and capture. Here, we present specific examples of strategies and best practices for multiplexing samples for different applications for SMRT Sequencing. Additionally, we describe recommendations for analyzing barcoded samples.
The complex immune regions of the genome, including MHC and KIR, contain large copy number variants (CNVs), a high density of genes, hyper-polymorphic gene alleles, and conserved extended haplotypes (CEH) with enormous linkage disequilibrium (LDs). This level of complexity and inherent biases of short-read sequencing make it challenging for extracting immune region haplotype information from reference-reliant, shotgun sequencing and GWAS methods. As NGS based genome and exome sequencing and SNP arrays have become a routine for population studies, numerous efforts are being made for developing software to extract and or impute the immune gene information from these datasets. Despite these efforts, the fine mapping of causal variants of immune genes for their well-documented association with cancer, drug-induced hypersensitivity and immune-related diseases, has been slower than expected. This has in many ways limited our understanding of the mechanisms leading to immune disease. In the present work, we demonstrate the advantages of long reads delivered by SMRT Sequencing for assembling complete haplotypes of MHC and KIR gene clusters, as well as calling correct genotypes of genes comprised within them. All the genotype information is detected at allele- level with full phasing information across SNP-poor regions. Genotypes were called correctly from targeted gene amplicons, haplotypes, as well as from a completely assembled 5 Mb contig of the MHC region from a de novo assembly of whole genome shotgun data. De novo analysis pipeline used in all these approaches allowed for reference-free analysis without imputation, a key for interrogation without prior knowledge about ethnic backgrounds. These methods are thus easily adoptable for previously uncharacterized human or non-human species.
The increased sequencing throughput creates a need for multiplexing for several applications. We are here detailing different barcoding strategies for microbial sequencing, targeted sequencing, Iso-Seq full-length isoform sequencing, and Roche NimbleGen’s target enrichment method.
Highly contiguous de novo human genome assembly and long-range haplotype phasing using SMRT Sequencing
The long reads, random error, and unbiased sampling of SMRT Sequencing enables high quality, de novo assembly of the human genome. PacBio long reads are capable of resolving genomic variations at all size scales, including SNPs, insertions, deletions, inversions, translocations, and repeat expansions, all of which are both important in understanding the genetic basis for human disease, and difficult to access via other technologies. In demonstration of this, we report a new high-quality, diploid-aware de novo assembly of Craig Venter’s well-studied genome.
Structural variant in the RNA Binding Motif Protein, X-Linked 2 (RBMX2) gene found to be linked to bipolar disorder
Bipolar disorder (BD) is a phenotypically and genetically complex neurological disorder that affects 1% of the worldwide population. There is compelling evidence from family, twin and adoption studies supporting the involvement of a genetic predisposition with estimated heritability up to ~ 80%. The risk in first-degree relatives is ten times higher than in the general population. Linkage and association studies have implicated multiple putative chromosomal loci for BD susceptibility, however no disease genes have yet to be identified. Here, we have fully characterized a ~12 Mb significantly linked (lod score=3.54) genomic region on chromosome Xq24-q27 in an extended family from a genetic isolate that was using long-read single molecule, real-time (SMRT) sequencing. The family segregates BD in at least 4 generations with 16 individuals out of 61 affected. Thus, this family portrays a highly elevated reoccurrence risk compared to the general population. It is expected that the genetic complexity would be reduced in isolated populations, even in genetically complex disorders such as BD, as in the case of this extended family. We selected 16 key individuals from the X-chromosomally linked family to be sequenced. These selected individuals either carried the disease haplotype, were non-carriers of the disease haplotype, or served as married-in controls. We designed a Nimblegen capture array enriching for 5-9 kb fragments spanning the entire 12 Mb region that were then sequenced using long-read SMRT sequencing to screen for causative structural variants (SVs) explaining the increased risk for BD in this extended family. Altogether, 192 SVs were detected in the critically linked region however most of these represented common variants that could be seen across many of the family members regardless of the disease status. One SV stood out that showed perfect segregation among all affected individuals that were carriers of the disease haplotype. This was a 330bp Alu deletion in intron 4 of the RNA Binding Motif Protein, X-Linked 2 (RBMX2) gene that has previously been shown to play a central role in brain development and function. Moreover, Alu elements in general have also previously been associated with at least 37 neurological and neurodegenerative disorders. In order to validate the finding and the functionality of the identified SV further studies like isoform characterization are warranted.