In today’s clinical diagnostic laboratories, the detection of the disease causing mutations is either done through genotyping or Sanger sequencing. Whether done singly or in a multiplex assay, genotyping works only if the exact molecular change is known. Sanger sequencing is the gold standard method that captures both known and novel molecular changes in the disease gene of interest. Most clinical Sanger sequencing assays involve PCR-amplifying the coding sequences of the disease target gene followed by bi-directional sequencing of the amplified products. Therefore for every patient sample, one generates multiple amplicons singly and each amplicon leads to two separate sequencing reactions. Single Molecule, Real-Time (SMRT) sequencing offers several advantages to Sanger sequencing including long read lengths, first-in-first-out processing, fast time to result, high-levels of multiplexing and substantially reduced costs. For our first proof-of-concept experiment, we queried 3 known disease-associated mutations in de-identified clinical samples. We started off with 3 autosomal recessive diseases found at an increased frequency in the Ashkenazi Jewish population: Tay Sachs disease, Niemann-Pick disease and Canavan disease. The mutated gene in Tays Sachs is HEXA, Niemann-Pick is SMPD1 and Canavan is ASPA. Coding exons were amplified in multiple (6-13) amplicons for each gene from both non-carrier and carriers. Amplicons were purified, concentrations normalized, and combined prior to SMRTbell™ Library prep. A single SMRTbell library was sequenced for each gene from each patient using standard Pacific Biosciences C2 chemistry and protocols. Average read lengths of 4,000 bp across samples allowed for high-quality Circular Consensus Sequences (CCS) across all amplicons (all less than 1 kb). This high quality CCS data permitted the clean partitioning of reads from a patient in the presence of heterozygous events. Using non-carrier sequencing as a control, we were able to correctly identify the known events in carrier genes. This suggests the potential utility of SMRT sequencing in a clinical setting, enabling a cost-effective method of replacing targeted mutation detection with sequencing of the entire gene.
Advances in sequence consensus and clustering algorithms for effective de novo assembly and haplotyping applications.
One of the major applications of DNA sequencing technology is to bring together information that is distant in sequence space so that understanding genome structure and function becomes easier on a large scale. The Single Molecule Real Time (SMRT) Sequencing platform provides direct sequencing data that can span several thousand bases to tens of thousands of bases in a high-throughput fashion. In contrast to solving genomic puzzles by patching together smaller piece of information, long sequence reads can decrease potential computation complexity by reducing combinatorial factors significantly. We demonstrate algorithmic approaches to construct accurate consensus when the differences between reads are dominated by insertions and deletions. High-performance implementations of such algorithms allow more efficient de novo assembly with a pre-assembly step that generates highly accurate, consensus-based reads which can be used as input for existing genome assemblers. In contrast to recent hybrid assembly approach, only a single ~10 kb or longer SMRTbell library is necessary for the hierarchical genome assembly process (HGAP). Meanwhile, with a sensitive read-clustering algorithm with the consensus algorithms, one is able to discern haplotypes that differ by less than 1% different from each other over a large region. One of the related applications is to generate accurate haplotype sequences for HLA loci. Long sequence reads that can cover the whole 3 kb to 4 kb diploid genomic regions will simplify the haplotyping process. These algorithms can also be applied to resolve individual populations within mixed pools of DNA molecules that are similar to each, e.g., by sequencing viral quasi-species samples.
Allele-level sequencing and phasing of full-length HLA class I and II genes using SMRT Sequencing technology
The three classes of genes that comprise the MHC gene family are actively involved in determining donor-recipient compatibility for organ transplant, as well as susceptibility to autoimmune diseases via cross-reacting immunization. Specifically, Class I genes HLA-A, -B, -C, and class II genes HLA-DR, -DQ and -DP are considered medically important for genetic analysis to determine histocompatibility. They are highly polymorphic and have thousands of alleles implicated in disease resistance and susceptibility. The importance of full-length HLA gene sequencing for genotyping, detection of null alleles, and phasing is now widely acknowledged. While DNA-sequencing-based HLA genotyping has become routine, only 7% of the HLA genes have been characterized by allele-level sequencing, while 93% are still defined by partial sequences. The gold-standard Sanger sequencing technology is being quickly replaced by second-generation, high- throughput sequencing methods due to its inability to generate unambiguous phased reads from heterozygous alleles. However, although these short, high-throughput, clonal sequencing methods are better at heterozygous allele detection, they are inadequate at generating full-length haploid gene sequences. Thus, full-length gene sequencing from an enhancer-promoter region to a 3’UTR that includes phasing information without the need for imputation still remains a technological challenge. The best way to overcome these challenges is to sequence these genes with a technology that is clonal in nature and has the longest possible read lengths. We have employed Single Molecule Real-Time (SMRT) sequencing technology from Pacific Biosciences for sequencing full-length HLA class I and II genes.
Isoform sequencing: Unveiling the complex landscape of the eukaryotic transcriptome on the PacBio RS II.
Alternative splicing of RNA is an important mechanism that increases protein diversity and is pervasive in the most complex biological functions. While advances in RNA sequencing methods have accelerated our understanding of the transcriptome, isoform discovery remains computationally challenging due to short read lengths. Here, we describe the Isoform Sequencing (Iso-Seq) method using long reads generated by the PacBio RS II. We sequenced rat heart and lung RNA using the Clontech® SMARTer® cDNA preparation kit followed by size selection using agarose gel. Additionally, we tested the BluePippin™ device from Sage Science for efficiently extracting longer transcripts = 3 kb. Post-sequencing, we developed a novel isoform-level clustering algorithm to generate high-quality transcript consensus sequences. We show that our method recovered alternative splice forms as well as alternative stop sites, antisense transcription, and retained introns. To conclude, the Iso-Seq method provides a new opportunity for researchers to study the complex eukaryotic transcriptome even in the absence of reference genomes or annotated transcripts.
A novel analytical pipeline for de novo haplotype phasing and amplicon analysis using SMRT Sequencing technology.
While the identification of individual SNPs has been readily available for some time, the ability to accurately phase SNPs and structural variation across a haplotype has been a challenge. With individual reads of an average length of 9 kb (P5-C3), and individual reads beyond 30 kb in length, SMRT Sequencing technology allows the identification of mutation combinations such as microdeletions, insertions, and substitutions without any predetermined reference sequence. Long- amplicon analysis is a novel protocol that identifies and reports the abundance of differing clusters of sequencing reads within a single library. Graphs generated via hierarchical clustering of individual sequencing reads are used to generate Markov models representing the consensus sequence of individual clusters found to be significantly different. Long-amplicon analysis is capable of differentiating between underlying sequences that are 99.9% similar, which is suitable for haplotyping and differentiating pseudogenes from coding transcripts. This protocol allows for the identification of structural variation in the MUC5AC gene sequence, despite the presence of a gap in the current genome assembly, and can also be used for HLA haplotyping. Clustering can also been applied to identify full length transcripts for the purpose of estimating consensus sequences and enumerating isoform types. Long-amplicon analysis allows for the elucidation of complex regions otherwise missed by other sequencing technologies, which may contribute to the diagnosis and understanding of otherwise complex diseases.
Isoform sequencing: Unveiling the complex landscape in eukaryotic transcriptome on the PacBio RS II.
Advances in RNA sequencing have accelerated our understanding of the transcriptome, however isoform discovery remains challenging due to short read lengths. The Iso-Seq Application provides a new alternative to sequence full-length cDNA libraries using long reads from the PacBio RS II. Identification of long and often rare isoforms is demonstrated with rat heart and lung RNA prepared using the Clontech® SMARTer® cDNA preparation kit, followed by agarose-gel size selection in fractions of 1-2 kb, 2-3 kb and 3-6 kb. For each tissue, 1.8 and 1.2 million reads were obtained from 32 and 26 SMRT Cells, respectively. Filtering for reads with both adapters and polyA tail signals yielded >50% putative full-length transcripts. To improve consensus accuracy, we developed an isoform-level clustering algorithm ICE (Iterative Clustering for Error Correction), and polished full-length consensus sequences from ICE using Quiver. This method generated full-length transcripts up to 4.5 kb with = 99% post-correction accuracy. Compared with known rat genes, the Iso-Seq method not only recovered the majority of currently annotated isoforms, but also several unannotated novel isoforms with identified homologs in the RefSeq database. Additionally, alternative stop sites, extended UTRs, and retained introns were detected.
Long Amplicon Analysis: Highly accurate, full-length, phased, allele-resolved gene sequences from multiplexed SMRT Sequencing data.
The correct phasing of genetic variations is a key challenge for many applications of DNA sequencing. Allele-level resolution is strongly preferred for histocompatibility sequencing where recombined genes can exhibit different compatibilities than their parents. In other contexts, gene complementation can provide protection if deleterious mutations are found on only one allele of a gene. These problems are especially pronounced in immunological domains given the high levels of genetic diversity and recombination seen in regions like the Major Histocompatibility Complex. A new tool for analyzing Single Molecule, Real-Time (SMRT) Sequencing data – Long Amplicon Analysis (LAA) – can generate highly accurate, phased and full-length consensus sequences for multiple genes in a single sequencing run.
Unique haplotype structure determination in human genome using Single Molecule, Real-Time (SMRT) Sequencing of targeted full-length fosmids.
Determination of unique individual haplotypes is an essential first step toward understanding how identical genotypes having different phases lead to different biological interpretations of function, phenotype, and disease. Genome-wide methods for identifying individual genetic variation have been limited in their ability to acquire phased, extended, and complete genomic sequences that are long enough to assemble haplotypes with high confidence. We explore a recombineering approach for isolation and sequencing of a tiling of targeted fosmids to capture interesting regions from human genome. Each individual fosmid contains large genomic fragments (~35?kb) that are sequenced with long-read SMRT technology to generate contiguous long reads. These long reads can be easily de novo assembled for targeted haplotype resolution within an individual’s genomes. The P5-C3 chemistry for SMRT Sequencing generated contiguous, full-length fosmid sequences of 30 to 40 kb in a single read, allowing assembly of resolved haplotypes with minimal data processing. The phase preserved in fosmid clones spanned at least two heterozygous variant loci, providing the essential detail of precise haplotype structures. We show complete assembly of haplotypes for various targeted loci, including the complex haplotypes of the KIR locus (~150 to 200 kb) and conserved extended haplotypes (CEHs) of the MHC region. This method is easily applicable to other regions of the human genome, as well as other genomes.
Fully phased allele-level sequencing of highly polymorphic HLA genes is greatly facilitated by SMRT Sequencing technology. In the present work, we have evaluated multiple DNA barcoding strategies for multiplexing several loci from multiple individuals, using three different tagging methods. Specifically MHC class I genes HLA-A, -B, and –C were indexed via DNA Barcodes by either tailed primers or barcoded SMRTbell adapters. Eight different 16-bp barcode sequences were used in symmetric & asymmetric pairing. Eight DNA barcoded adapters in symmetric pairing were independently ligated to a pool of HLA-A, -B and –C for eight different individuals, one at a time and pooled for sequencing on a single SMRT Cell. Amplicons generated from barcoded primers were pooled upfront for library generation. Eight symmetric barcoded primers were generated for HLA class I genes. These primers facilitated multiplexing of 8 samples and also allowed generation of unique asymmetric pairings for simultaneous amplification from 28 reference genomic DNA samples. The data generated from all 3 methods was analyzed using LAA protocol in SMRT analysis V2.3. Consensus sequences generated were typed using GenDx NGS engine HLA-typing software.
Goat is an important source of milk, meat, and fiber, especially in developing countries. An advantage of goats as livestock is the low maintenance requirements and high adaptability compared to other milk producers. The global population of domestic goats exceeds 800 million. In Africa, goat production is characterized by low productivity levels, and attempts to introduce more productive breeds have met with poor success due in part to nutritional constraints. It has been suggested that incorporation of selective breeding within the herds adapted for survival could represent one approach to improving food security across Africa. A recently produced genome assembly of a Chinese Yunnan breed goat, based on 192 Gb of short reads across a range of insert sizes from 180 bp to 20 kb, reported a contig N50 of 18.7 kb. The scaffold N50 was improved from 2.2 Mb to 3.1 Mb by addition of fosmid end sequence, with an estimated 140 million Ns in gaps and 91% coverage. The assembly has proven somewhat problematic for pursuing genome-wide association analysis with SNP arrays, apparently due in part to errors in ordering of markers using the draft genome. In order to provide a higher quality assembly, we sequenced a highly inbred, San Clemente breed goat genome using 458 SMRT cells on the Pacific Biosciences platform. These cells generated 193.5 Gbases of sequence after processing into subreads, with mean 5110 bases and max subread length of 40.5 kb. This sequence data generated an assembly using the recently reported MHAP error correction approach and Celera Assembler v8.2. The contig N50 was 2.5 Mb, with the largest contig spanning 19.5 Mb. Additional characteristics of the assembly will be presented.
For comprehensive metabolic reconstructions and a resulting understanding of the pathways leading to natural products, it is desirable to obtain complete information about the genetic blueprint of the organisms used. Traditional Sanger and next-generation, short-read sequencing technologies have shortcomings with respect to read lengths and DNA-sequence context bias, leading to fragmented and incomplete genome information. The development of long-read, single molecule, real-time (SMRT) DNA sequencing from Pacific Biosciences, with >10,000 bp average read lengths and a lack of sequence context bias, now allows for the generation of complete genomes in a fully automated workflow. In addition to the genome sequence, DNA methylation is characterized in the process of sequencing. PacBio® sequencing has also been applied to microbial transcriptomes. Long reads enable sequencing of full-length cDNAs allowing for identification of complete gene and operon sequences without the need for transcript assembly. We will highlight several examples where these capabilities have been leveraged in the areas of industrial microbiology, including biocommodities, biofuels, bioremediation, new bacteria with potential commercial applications, antibiotic discovery, and livestock/plant microbiome interactions.
As a cost-effective alternative to whole genome human sequencing, targeted sequencing of specific regions, such as exomes or panels of relevant genes, has become increasingly common. These methods typically include direct PCR amplification of the genomic DNA of interest, or the capture of these targets via probe-based hybridization. Commonly, these approaches are designed to amplify or capture exonic regions and thereby result in amplicons or fragments that are a few hundred base pairs in length, a length that is well-addressed with short-read sequencing technologies. These approaches typically provide very good coverage and can identify SNPs in the targeted region, but are unable to haplotype these variants. Here we describe a targeted sequencing workflow that combines Roche NimbleGen’s SeqCap EZ enrichment technology with Pacific Biosciences’ SMRT Sequencing to provide a more comprehensive view of variants and haplotype information over multi-kilobase regions. While the SeqCap EZ technology is typically used to capture 200 bp fragments, we demonstrate that 6 kb fragments can also be utilized to enrich for long fragments that extend beyond the targeted capture site and well into (and often across) the flanking intronic regions. When combined with the long reads of SMRT Sequencing, multi-kilobase regions of the human genome can be phased and variants detected in exons, introns and intergenic regions.
Multiplexing human HLA class I & II genotyping with DNA barcode adapters for high throughput research.
Human MHC class I genes HLA-A, -B, -C, and class II genes HLA-DR, -DP and -DQ, play a critical role in the immune system as major factors responsible for organ transplant rejection. The have a direct or linkage-based association with several diseases, including cancer and autoimmune diseases, and are important targets for clinical and drug sensitivity research. HLA genes are also highly polymorphic and their diversity originates from exonic combinations as well as recombination events. A large number of new alleles are expected to be encountered if these genes are sequenced through the UTRs. Thus allele-level resolution is strongly preferred when sequencing HLA genes. Pacific Biosciences has developed a method to sequence the HLA genes in their entirety within the span of a single read taking advantage of long read lengths (average >10 kb) facilitated by SMRT technology. A highly accurate consensus sequence (=99.999 or QV50 demonstrated) is generated for each allele in a de novo fashion by our SMRT Analysis software. In the present work, we have combined this imputation-free, fully phased, allele-specific consensus sequence generation workflow and a newly developed DNA-barcode-tagged SMRTbell sample preparation approach to multiplex 96 individual samples for sequencing all of the HLA class I and II genes. Commercially available NGS-go reagents for full-length HLA class I and relevant exons of class II genes were amplified for hi-resolution HLA sequencing. The 96 samples included 72 that are part of UCLA reference panel and had pre-typing information available for 2 fields, based on gold standard SBT methods. SMRTbell adapters with 16 bp barcode tags were ligated to long amplicons in symmetric pairing. PacBio sequencing was highly effective in generating accurate, phased sequences of full-length alleles of HLA genes. In this work we demonstrate scalability of HLA sequencing using off the shelf assays for research applications to find biological significance in full-length sequencing.
Assembly of complete KIR haplotypes from a diploid individual by the direct sequencing of full-length fosmids.
We show that linearizing and directly sequencing full-length fosmids simplifies the assembly problem such that it is possible to unambiguously assemble individual haplotypes for the highly repetitive 100-200 kb killer Ig-like receptor (KIR) gene loci of chromosome 19. A tiling of targeted fosmids can be used to clone extended lengths of genomic DNA, 100s of kb in length, but repeat complexity in regions of particular interest, such as the KIR locus, means that sequence assembly of pooled samples into complete haplotypes is difficult and in many cases impossible. The current maximum read length generated by SMRT Sequencing exceeds the length of a 40 kb fosmid; it is therefore possible to span an entire fosmid in one sequencing read. Shearing, sequencing and assembling fosmids in a shotgun approach is prone to errors when the underlying sequence is highly repetitive. We show that it is possible to directly sequence linearized fosmids and generate a high-quality consensus by simple alignment, removing the need for an error-prone assembly step. The high-quality sequence of complete fosmids can then be tiled into full haplotypes. We demonstrate the method on DNA samples from a number of individuals and fully recover the sequence of both haplotypes from a pool of KIR fosmids. The ability to haplotype and sequence complex immunogenetic regions will bring exciting opportunities to explore the evolution of disease associations of the immune sub-genome. This simple and robust approach can be scaled-up allowing a complex genomic region to be sequenced at a population level. We expect such sequencing to be valuable in disease association research.
Highly contiguous de novo human genome assembly and long-range haplotype phasing using SMRT Sequencing
The long reads, random error, and unbiased sampling of SMRT Sequencing enables high quality, de novo assembly of the human genome. PacBio long reads are capable of resolving genomic variations at all size scales, including SNPs, insertions, deletions, inversions, translocations, and repeat expansions, all of which are important in understanding the genetic basis for human disease and difficult to access via other technologies. In demonstration of this, we report a new high-quality, diploid aware de novo assembly of Craig Venter’s well-studied genome.