AGBT 2013 Presentation Slides: Cold Spring Harbor Laboratory’s Michael Schatz presented strategies for de novo assembly of crop genomes with PacBio technolgy.
A comparison of 454 GS FLX Ti and PacBio RS in the context of characterizing HIV-1 intra-host diversity.
PacBio 2013 User Group Meeting Presentation Slides: Lance Hepler from UC San Diego’s Center for AIDS Research used the PacBio RS to study intra-host diversity in HIV-1. He compared PacBio’s performance to that of 454® sequencer, the platform he and his team previously used. Hepler noted that in general, there was strong agreement between the platforms; where results differed, he said that PacBio data had significantly better reproducibility and accuracy. “PacBio does not suffer from local coverage loss post-processing, whereas 454 has homopolymer problems,” he noted. Hepler said they are moving away from using 454 in favor of the PacBio system.
Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers to understand molecular mechanisms in evolution and gain insight into adaptive strategies. With read lengths exceeding 10 kb, we are able to sequence high-quality, closed microbial genomes with associated plasmids, and investigate large genome complexities, such as long, highly repetitive, low-complexity regions and multiple tandem-duplication events. Improved genome quality, observed at 99.9999% (QV60) consensus accuracy, and significant reduction of gap regions in reference genomes (up to and beyond 50%) allow researchers to better understand coding sequences with high confidence, investigate potential regulatory mechanisms in noncoding regions, and make inferences about evolutionary strategies that are otherwise missed by the coverage biases associated with short- read sequencing technologies. Additional benefits afforded by SMRT Sequencing include the simultaneous capability to detect epigenomic modifications and obtain full-length cDNA transcripts that obsolete the need for assembly. With direct sequencing of DNA in real-time, this has resulted in the identification of numerous base modifications and motifs, which genome-wide profiles have linked to specific methyltransferase activities. Our new offering, the Iso-Seq Application, allows for the accurate differentiation between transcript isoforms that are difficult to resolve with short-read technologies. PacBio reads easily span transcripts such that both 5’/3’ primers for cDNA library generation and the poly-A tail are observed. As such, exon configuration and intron retention events can be analyzed without ambiguity. This technological advance is useful for characterizing transcript diversity and improving gene structure annotations in reference genomes. We review solutions available with SMRT Sequencing, from targeted sequencing efforts to obtaining reference genomes (>100 Mb). This includes strategies for identifying microsatellites and conducting phylogenetic comparisons with targeted gene families. We highlight how to best leverage our long reads that have exceeded 20 kb in length for research investigations, as well as currently available bioinformatics strategies for analysis. Benefits for these applications are further realized with consistent use of size selection of input sample using the BluePippin™ device from Sage Science as demonstrated in our genome improvement projects. Using the latest P5-C3 chemistry on model organisms, these efforts have yielded an observed contig N50 of ~6 Mb, with the longest contig exceeding 12.5 Mb and an average base quality of QV50.
Single Molecule, Real-Time (SMRT) Sequencing provides efficient, streamlined solutions to address new frontiers in plant genomes and transcriptomes. Inherent challenges presented by highly repetitive, low-complexity regions and duplication events are directly addressed with multi- kilobase read lengths exceeding 8.5 kb on average, with many exceeding 20 kb. Differentiating between transcript isoforms that are difficult to resolve with short-read technologies is also now possible. We present solutions available for both reference genome and transcriptome research that best leverage long reads in several plant projects including algae, Arabidopsis, rice, and spinach using only the PacBio platform. Benefits for these applications are further realized with consistent use of size-selection of input sample using the BluePippin™ device from Sage Science. We will share highlights from our genome projects using the latest P5- C3 chemistry to generate high-quality reference genomes with the highest contiguity, contig N50 exceeding 1 Mb, and average base quality of QV50. Additionally, the value of long, intact reads to provide a no-assembly approach to investigate transcript isoforms using our Iso-Seq protocol will be presented for full transcriptome characterization and targeted surveys of genes with complex structures. PacBio provides the most comprehensive assembly with annotation when combining offerings for both genome and transcriptome research efforts. For more focused investigation, PacBio also offers researchers opportunities to easily investigate and survey genes with complex structures.
Arabica coffee, revered for its taste and aroma, has a complex genome. It is an allotetraploid (2n=4x=44) with a genome size of approximately 1.3 Gb, derived from the recent (< 0.6 Mya) hybridization of two diploid progenitors (2n=2x=22), C. canephora (710 Mb) and C. eugenioides (670 Mb). Both parental species diverged recently (< 4.2Mya) and their genomes are highly homologous. To facilitate assembly, a dihaploid plant was chosen for sequencing. Initial genome assembly attempts with short read data produced an assembly covering 1,031 Mb of the C. arabica genome with a contig L50 of 9kb. By implementation of long read PacBio at greater than 50x coverage and cutting-edge PacBio software, a de novo PacBio-only genome assembly was constructed that covers 1,042 Mb of the genome with an L50 of 267 kb. The two assemblies were assessed and compared to determine gene content, chimeric regions, and the ability to separate the parental genomes. A genetic map that contains 600 SSRs is being used for anchoring the contigs and improve the sub-genome differentiation together with the search of sub-genome specific SNPs. PacBio transcriptome sequencing is currently being added to finalize gene annotation of the polished assembly. The finished genome assembly will be used to guide re-sequencing assemblies of parental genomes (C. canephora and C. eugenioides) as well as a template for GBS analysis and whole genome re-sequencing of a set of C. arabica accessions representative of the species diversity. The obtained data will provide powerful genomic tools to enable more efficient coffee breeding strategies for this crop, which is highly susceptible to climate change and is the main source of income for millions of small farmers in producing countries.
The comprehensive characterization of cancer genomes and epigenomes for understanding drug resistance remains an important challenge in the field of oncology. For example, PC-9, a non-small cell lung cancer (NSCL) cell line, contains a deletion mutation in exon 19 (DelE746A750) of EGRF that renders it sensitive to erlotinib, an EGFR inhibitor. However, sustained treatment of these cells with erlotinib leads to drug-tolerant cell populations that grow in the presence of erlotinib. However, the resistant cells can be resensitized to erlotinib upon treatment with methyltransferase inhibitors, suggesting a role of epigenetic modification in development of drug resistance. We have characterized for the first time cancer genomes of both drug-sensitive and drug-resistant PC- 9 cells using long-read PacBio sequencing. The PacBio data allowed us to generate a high-quality, de novo assembly of this cancer genome, enabling the detection of forms of genomic variations at all size scales, including SNPs, structural variations, copy number alterations, gene fusions, and translocations. The data simultaneously provide a global view of epigenetic DNA modifications such as methylation. We will present findings on large-scale changes in the methylation status across the cancer genome as a function of drug sensitivity.
Comprehensive genome and transcriptome structural analysis of a breast cancer cell line using PacBio long read sequencing
Genomic instability is one of the hallmarks of cancer, leading to widespread copy number variations, chromosomal fusions, and other structural variations. The breast cancer cell line SK-BR-3 is an important model for HER2+ breast cancers, which are among the most aggressive forms of the disease and affect one in five cases. Through short read sequencing, copy number arrays, and other technologies, the genome of SK-BR-3 is known to be highly rearranged with many copy number variations, including an approximately twenty-fold amplification of the HER2 oncogene. However, these technologies cannot precisely characterize the nature and context of the identified genomic events and other important mutations may be missed altogether because of repeats, multi-mapping reads, and the failure to reliably anchor alignments to both sides of a variation. To address these challenges, we have sequenced SK-BR-3 using PacBio long read technology. Using the new P6-C4 chemistry, we generated more than 70X coverage of the genome with average read lengths of 9-13kb (max: 71kb). Using Lumpy for split-read alignment analysis, as well as our novel assembly-based algorithms for finding complex variants, we have developed a detailed map of structural variations in this cell line. Taking advantage of the newly identified breakpoints and combining these with copy number assignments, we have developed an algorithm to reconstruct the mutational history of this cancer genome. From this we have discovered a complex series of nested duplications and translocations between chr17 and chr8, two of the most frequent translocation partners in primary breast cancers, resulting in amplification of HER2. We have also carried out full-length transcriptome sequencing using PacBio’s Iso-Seq technology, which has revealed a number of previously unrecognized gene fusions and isoforms. Combining long-read genome and transcriptome sequencing technologies enables an in-depth analysis of how changes in the genome affect the transcriptome, including how gene fusions are created across multiple chromosomes. This analysis has established the most complete cancer reference genome available to date, and is already opening the door to applying long-read sequencing to patient samples with complex genome structures.
Reference genome assemblies provide important context in genetics by standardizing the order of genes and providing a universal set of coordinates for individual nucleotides. Often due to the high complexity of genic regions and higher copy number of genes involved in immune function, immunity-related genes are often misassembled in current reference assemblies. This problem is particularly ubiquitous in the reference genomes of non-model organisms as they often do not receive the years of curation necessary to resolve annotation and assembly errors. In this study, we reassemble a reference genome of the goat (Capra hircus) using modern PacBio technology in tandem with BioNano Genomics Irys optical maps and Lachesis clustering in order to provide a high quality reference assembly without the need for extensive filtering. Initial PacBio assemblies using P5C4 chemistry achieved contig N50’s of 4 Megabases and a BUSCO completion score of 84.0%, which is comparable to several finished model organism reference assemblies. We used BioNano Genomics’ Irys platform to generate 336 scaffolds from this data with a scaffold N50 of 24 megabases and total genome coverage of 98%. Lachesis interaction maps were used with a clustering algorithm to associate Irys scaffolds into the expected 30 chromosome physical maps. Comparisons of the initial hybrid scaffolds generated from the long read contigs and optical map information to a previously generated RH map revealed that the entirety of the Goat autosome 20 physical map was contained within one scaffold. Additionally, the BioNano scaffolding resolved several difficult regions that contained genes related to innate immunity which were problem regions in previous reference genome assemblies.
The complex immune regions of the genome, including MHC and KIR, contain large copy number variants (CNVs), a high density of genes, hyper-polymorphic gene alleles, and conserved extended haplotypes (CEH) with enormous linkage disequilibrium (LDs). This level of complexity and inherent biases of short-read sequencing make it challenging for extracting immune region haplotype information from reference-reliant, shotgun sequencing and GWAS methods. As NGS based genome and exome sequencing and SNP arrays have become a routine for population studies, numerous efforts are being made for developing software to extract and or impute the immune gene information from these datasets. Despite these efforts, the fine mapping of causal variants of immune genes for their well-documented association with cancer, drug-induced hypersensitivity and immune-related diseases, has been slower than expected. This has in many ways limited our understanding of the mechanisms leading to immune disease. In the present work, we demonstrate the advantages of long reads delivered by SMRT Sequencing for assembling complete haplotypes of MHC and KIR gene clusters, as well as calling correct genotypes of genes comprised within them. All the genotype information is detected at allele- level with full phasing information across SNP-poor regions. Genotypes were called correctly from targeted gene amplicons, haplotypes, as well as from a completely assembled 5 Mb contig of the MHC region from a de novo assembly of whole genome shotgun data. De novo analysis pipeline used in all these approaches allowed for reference-free analysis without imputation, a key for interrogation without prior knowledge about ethnic backgrounds. These methods are thus easily adoptable for previously uncharacterized human or non-human species.
Multiplex target enrichment using barcoded multi-kilobase fragments and probe-based capture technologies
Target enrichment capture methods allow scientists to rapidly interrogate important genomic regions of interest for variant discovery, including SNPs, gene isoforms, and structural variation. Custom targeted sequencing panels are important for characterizing heterogeneous, complex diseases and uncovering the genetic basis of inherited traits with more uniform coverage when compared to PCR-based strategies. With the increasing availability of high-quality reference genomes, customized gene panels are readily designed with high specificity to capture genomic regions of interest, thus enabling scientists to expand their research scope from a single individual to larger cohort studies or population-wide investigations. Coupled with PacBio® long-read sequencing, these technologies can capture 5 kb fragments of genomic DNA (gDNA), which are useful for interrogating intronic, exonic, and regulatory regions, characterizing complex structural variations, distinguishing between gene duplications and pseudogenes, and interpreting variant haplotyes. In addition, SMRT® Sequencing offers the lowest GC-bias and can sequence through repetitive regions. We demonstrate the additional insights possible by using in-depth long read capture sequencing for key immunology, drug metabolizing, and disease causing genes such as HLA, filaggrin, and cancer associated genes.
Characterizing haplotype diversity at the immunoglobulin heavy chain locus across human populations using novel long-read sequencing and assembly approaches
The human immunoglobulin heavy chain locus (IGH) remains among the most understudied regions of the human genome. Recent efforts have shown that haplotype diversity within IGH is elevated and exhibits population specific patterns; for example, our re-sequencing of the locus from only a single chromosome uncovered >100 Kb of novel sequence, including descriptions of six novel alleles, and four previously unmapped genes. Historically, this complex locus architecture has hindered the characterization of IGH germline single nucleotide, copy number, and structural variants (SNVs; CNVs; SVs), and as a result, there remains little known about the role of IGH polymorphisms in inter-individual antibody repertoire variability and disease. To remedy this, we are taking a multi-faceted approach to improving existing genomic resources in the human IGH region. First, from whole-genome and fosmid-based datasets, we are building the largest and most ethnically diverse set of IGH reference assemblies to date, by employing PacBio long-read sequencing combined with novel algorithms for phased haplotype assembly. In total, our effort will result in the characterization of >15 phased haplotypes from individuals of Asian, African, and European descent, to be used as a representative reference set by the genomics and immunogenetics community. Second, we are utilizing this more comprehensive sequence catalogue to inform the design and analysis of novel targeted IGH genotyping assays. Standard targeted DNA enrichment methods (e.g., exome capture) are currently optimized for the capture of only very short (100’s of bp) DNA segments. Our platform uses a modified bench protocol to pair existing capture-array technologies with the enrichment of longer fragments of DNA, enabling the use of PacBio sequencing of DNA segments up to 7 Kb. This substantial increase in contiguity disambiguates many of the complex repeated structures inherent to the locus, while yielding the base pair fidelity required to call SNVs. Together these resources will establish a stronger framework for further characterizing IGH genetic diversity and facilitate IGH genomic profiling in the clinical and research settings, which will be key to fully understanding the role of IGH germline variation in antibody repertoire development and disease.
Structural variants (SVs) – genomic differences =50 base pairs – are few by count compared to single nucleotide variants (SNVs) and indels but include most of the base pairs that differ between two humans.
Structural variant detection with long read sequencing reveals driver and passenger mutations in a melanoma cell line
Past large scale cancer genome sequencing efforts, including The Cancer Genome Atlas and the International Cancer Genome Consortium, have utilized short-read sequencing, which is well-suited for detecting single nucleotide variants (SNVs) but far less reliable for detecting variants larger than 20 base pairs, including insertions, deletions, duplications, inversions and translocations. Recent same-sample comparisons of short- and long-read human reference genome data have revealed that short-read resequencing typically uncovers only ~4,000 structural variants (SVs, =50 bp) per genome and is biased towards deletions, whereas sequencing with PacBio long-reads consistently finds ~20,000 SVs, evenly balanced between insertions and deletions. This discovery has important implications for cancer research, as it is clear that SVs are both common and biologically important in many cancer subtypes, including colorectal, breast and ovarian cancer. Without confident and comprehensive detection of structural variants, it is unlikely we have a sufficiently complete picture of all the genomic changes that impact cancer development, disease progression, treatment response, drug resistance, and relapse. To begin to address this unmet need, we have sequenced the COLO829 tumor and matched normal lymphoblastoid cell lines to 49- and 51-fold coverage, respectively, with PacBio SMRT Sequencing, with the goal of developing a high-confidence structural variant call set that can be used to empirically evaluate cost-effective experimental designs for larger scale studies and develop structural variation calling software suitable for cancer genomics. Structural variant calling revealed over 21,000 deletions and 19,500 insertions larger than 20 bp, nearly four times the number of events detected with short-read sequencing. The vast majority of events are shared between the tumor and normal, with about 100 putative somatic deletions and 400 insertions, primarily in microsatellites. A further 40 rearrangements were detected, nearly exclusively in the tumor. One rearrangement is shared between the tumor and normal, t(5;X) which disrupts the mismatch repeat gene MSH3, and is likely a driver mutation. Generating high-confidence call sets that cover the entire size-spectrum of somatic variants from a range of cancer model systems is the first step in determining what will be the best approach for addressing an ongoing blind spot in our current understanding of cancer genomes. Here the application of PacBio sequencing to a melanoma cancer cell line revealed thousands of previously overlooked variants, including a mutation likely involved in tumorogenesis.
To comprehensively detect large variants in human genomes, we have extended pbsv – a structural variant caller for long reads – to call copy-number variants (CNVs) from read-clipping and read-depth signatures. In human germline benchmark samples, we detect more than 300 CNVs spanning around 10 Mb, and we call hundreds of additional events in re-arranged cancer samples. Long-read sequencing of diverse humans has revealed more than 20,000 insertion, deletion, and inversion structural variants spanning more than 12 Mb in a typical human genome. Most of these variants are too large to detect with short reads and too small for array comparative genome hybridization (aCGH). While the standard approaches to calling structural variants with long reads thrive in the 50 bp to 10 kb size range, they tend to miss exactly the large (>50 kb) copy-number variants that are called more readily with aCGH and short reads. Standard algorithms rely on reference-based mapping of reads that fully span a variant or on de novo assembly; and copy-number variants are often too large to be spanned by a single read and frequently involve segmentally duplicated sequence that is not yet included in most de novo assemblies.
Long-read sequencing of diverse humans has revealed more than 20,000 insertion, deletion, and inversion structural variants spanning more than 12 Mb in a healthy human genome. Most of these variants are too large to detect with short reads and too small for array comparative genome hybridization (aCGH). While the standard approaches to calling structural variants with long reads thrive in the 50 bp to 10 kb size range, they tend to miss exactly the large (>50 kb) copy-number variants that are called more readily with aCGH. Standard algorithms rely on reference-based mapping of reads that fully span a variant or on de novo assembly; and copy-number variants are often too large to be spanned by a single read and frequently involve segmentally duplicated sequence that is not yet included in most de novo assemblies. To comprehensively detect large variants in human genomes, we extended pbsv – a structural variant caller for long reads – to call copy-number variants (CNVs) from read-clipping and read-depth signatures. In human germline benchmark samples, we detect more than 300 CNVs spanning around 10 Mb, and we call hundreds of additional events in re-arranged cancer samples. Together with insertion, deletion, inversion, duplication, and translocation calling from spanning reads, this allows pbsv to comprehensively detect large variants from a single data type.