Understanding the genetic basis of infectious diseases is critical to enacting effective treatments, and several large-scale sequencing initiatives are underway to collect this information. Sequencing bacterial samples is typically performed by mapping sequence reads against genomes of known reference strains. While such resequencing informs on the spectrum of single-nucleotide differences relative to the chosen reference, it can miss numerous other forms of variation known to influence pathogenicity: structural variations (duplications, inversions), acquisition of mobile elements (phages, plasmids), homonucleotide length variation causing phase variation, and epigenetic marks (methylation, phosphorothioation) that influence gene expression to switch bacteria from non- pathogenic to pathogenic states. Therefore, sequencing methods which provide complete, de novo genome assemblies and epigenomes are necessary to fully characterize infectious disease agents in an unbiased, hypothesis-free manner. Hybrid assembly methods have been described that combine long sequence reads from SMRT DNA Sequencing with short reads (SMRT CCS (circular consensus) or second-generation reads), wherein the short reads are used to error-correct the long reads which are then used for assembly. We have developed a new paradigm for microbial de novo assemblies in which SMRT sequencing reads from a single long insert library are used exclusively to close the genome through a hierarchical genome assembly process, thereby obviating the need for a second sample preparation, sequencing run, and data set. We have applied this method to achieve closed de novo genomes with accuracies exceeding QV50 (>99.999%) for numerous disease outbreak samples, including E. coli, Salmonella, Campylobacter, Listeria, Neisseria, and H. pylori. The kinetic information from the same SMRT Sequencing reads is utilized to determine epigenomes. Approximately 70% of all methyltransferase specificities we have determined to date represent previously unknown bacterial epigenetic signatures. With relatively short sequencing run times and automated analysis pipelines, it is possible to go from an unknown DNA sample to its complete de novo genome and epigenome in about a day.
Understanding the genetic basis of infectious diseases is critical to enacting effective treatments, and several large-scale sequencing initiatives are underway to collect this information. Sequencing bacterial samples is typically performed by mapping sequence reads against genomes of known reference strains. While such resequencing informs on the spectrum of single nucleotide differences relative to the chosen reference, it can miss numerous other forms of variation known to influence pathogenicity: structural variations (duplications, inversions), acquisition of mobile elements (phages, plasmids), homonucleotide length variation causing phase variation, and epigenetic marks (methylation, phosphorothioation) that influence gene expression to switch bacteria from non-pathogenic to pathogenic states. Therefore, sequencing methods which provide complete, de novo genome assemblies and epigenomes are necessary to fully characterize infectious disease agents in an unbiased, hypothesis-free manner. Hybrid assembly methods have been described that combine long sequence reads from SMRT DNA sequencing with short, high-accuracy reads (SMRT (circular consensus sequencing) CCS or second-generation reads) to generate long, highly accurate reads that are then used for assembly. We have developed a new paradigm for microbial de novo assemblies in which long SMRT sequencing reads (average readlengths >5,000 bases) are used exclusively to close the genome through a hierarchical genome assembly process, thereby obviating the need for a second sample preparation, sequencing run and data set. We have applied this method to achieve closed de novo genomes with accuracies exceeding QV50 (>99.999%) to numerous disease outbreak samples, including E. coli, Salmonella, Campylobacter, Listeria, Neisseria, and H. pylori. The kinetic information from the same SMRT sequencing reads is utilized to determine epigenomes. Approximately 70% of all methyltransferase specificities we have determined to date represent previously unknown bacterial epigenetic signatures. The process has been automated and requires less than 1 day from an unknown DNA sample to its complete de novo genome and epigenome.
Heterozygous and highly polymorphic diploid (2n) and higher polyploidy (n > 2) genomes have proven to be very difficult to assemble. One key to the successful assembly and phasing of polymorphic genomics is the very long read length (9-40 kb) provided by the PacBio RS II system. We recently released software and methods that facilitate the assembly and phasing of genomes with ploidy levels equal to or greater than 2n. In an effort to collaborate and spur on algorithm development for assembly and phasing of heterozygous polymorphic genomes, we have recently released sequencing datasets that can be used to test and develop highly polymorphic diploid and polyploidy assembly and phasing algorithms. These data sets include multiple species and ecotypes of Arabidopsis that can be combined to create synthetic in-silico F1 hybrids with varying levels of heterozygosity. Because the sequence of each individual line was generated independently, the data set provides a ‘ground truth’ answer for the expected results allowing the evaluation of assembly algorithms. The sequencing data, assembly of inbred and in-silico heterozygous samples (n=>2) and phasing statistics will be presented. The raw and processed data has been made available to aid other groups in the development of phasing and assembly algorithms.
While advances in RNA sequencing methods have accelerated our understanding of the human transcriptome, isoform discovery remains a challenge because short read lengths require complicated assembly algorithms to infer the contiguity of full-length transcripts. With PacBio’s long reads, one can now sequence full-length transcript isoforms up to 10 kb. The PacBio Iso- Seq protocol produces reads that originate from independent observations of single molecules, meaning no assembly is needed. Here, we sequenced the transcriptome of the human MCF-7 breast cancer cell line using the Clontech SMARTer® cDNA preparation kit and the PacBio RS II. Using PacBio Iso-Seq bioinformatics software, we obtained 55,770 unique, full-length, high-quality transcript sequences that were subsequently mapped back to the human genome with = 99% accuracy. In addition, we identified both known and novel fusion transcripts. To assess our results, we compared the predicted ORFs from the PacBio data against a published mass spectrometry dataset from the same cell line. 84% of the proteins identified with the Uniprot protein database were recovered by the PacBio predictions. Notably, 251 peptides solely matched to the PacBio generated ORFs and were entirely novel, including abundant cases of single amino acid polymorphisms, cassette exon splicing and potential alternative protein coding frames.
Making the most of long reads: towards efficient assemblers for reference quality, de novo reconstructions
2015 SMRT Informatics Developers Conference Presentation Slides: Gene Myers, Ph.D., Founding Director, Systems Biology Center, Max Planck Institute delivered the keynote presentation. He talked about building efficient assemblers, the importance of random error distribution in sequencing data, and resolving tricky repeats with very long reads. He also encouraged developers to release assembly modules openly, and noted that data should be straightforward to parse since sharing data interfaces is easier than sharing software interfaces.
Full-length sequencing of HLA class I genes of more than 1000 samples provides deep insights into sequence variability
Aim: The vast majority of donor typing relies on sequencing exons 2 and 3 of HLA class I genes (HLA-A, -B, -C). With such an approach certain allele combinations do not result in the anticipated “high resolution” (G-code) typing, due to the lack of exon-phasing information. To resolve ambiguous typing results for a haplotype frequency project, we established a whole gene sequencing approach for HLA class I, facilitating also an estimation of the degree of sequence variability outside the commonly sequenced exons. Methods: Primers were developed flanking the UTR regions resulting in similar amplicon lengths of 4.2-4.4 kb. Using a 4-primer approach, secondary primers containing barcodes were combined with the gene specific primers to obtain barcoded full-gene amplicons in a single amplification step. Amplicons were pooled, purified, and ligated to SMRT bells (i.e. annealing points for sequencing primers) following standard protocols from Pacific Biosciences. Taking advantage of the SMRT chemistry, pools of 48-72 amplicons were sequenced full length and phased in single runs on a Pacific Biosciences RSII instrument. Demultiplexing was achieved using the SMRT portal. Sequence analysis was performed using NGSengine software (GenDx). Results: We successfully performed full-length gene sequencing of 1003 samples, harboring ambiguous typings of either HLA-A (n=46), HLA-B (n=304) or HLA-C (n=653). Despite the high per-read raw error rates typical for SMRT sequencing (~15%) the consensus sequence proved highly reliable. All consensus sequences for exons 2 and 3 were in full accordance with their MiSeq-derived sequences. Unambiguous allelic resolution was achieved for all samples. We observed novel intronic, exonic as well as UTR sequence variations for many of the alleles covered by our data set. This included sequences of 600 individuals with HLA-C*07:01/C*07:02 genotype revealing the extent of sequence variation outside the exons 2 and 3. Conclusion: Here we present a whole gene amplification and sequencing approach for HLA class I genes. The maturity of this approach was demonstrated by sequencing more than 1000 samples, achieving fully phased allelic sequences. Extensive sequencing of one common allele combination hints at the yet to discover diversity of the HLA system outside the commonly analyzed exons.
Purpose: Clinical laboratories, research laboratories and technology developers all need DNA samples with reliably known genotypes in order to help validate and improve their methods. The Genome in a Bottle Consortium (genomeinabottle.org) has been developing Reference Materials with high-accuracy whole genome sequences to support these efforts.Methodology: Our pilot reference material is based on Coriell sample NA12878 and was released in May 2015 as NIST RM 8398 (tinyurl.com/giabpilot). To minimize bias and improve accuracy, 11 whole-genome and 3 exome data sets produced using 5 different technologies were integrated using a systematic arbitration method . The Genome in a Bottle Analysis Group is adapting these methods and developing new methods to characterize 2 families, one Asian and one Ashkenazi Jewish from the Personal Genome Project, which are consented for public release of sequencing and phenotype data. We have generated a larger and even more diverse data set on these samples, including high-depth Illumina paired-end and mate-pair, Complete Genomics, and Ion Torrent short-read data, as well as Moleculo, 10X, Oxford Nanopore, PacBio, and BioNano Genomics long-read data. We are analyzing these data to provide an accurate assessment of not just small variants but also large structural variants (SVs) in both “easy” regions of the genome and in some “hard” repetitive regions. We have also made all of the input data sources publicly available for download, analysis, and publication.Results: Our arbitration method produced a reference data set of 2,787,291 single nucleotide variants (SNVs), 365,135 indels, 2744 SVs, and 2.2 billion homozygous reference calls for our pilot genome. We found that our call set is highly sensitive and specific in comparison to independent reference data sets. We have also generated preliminary assemblies and structural variant calls for the next 2 trios from long read data and are currently integrating and validating these.Discussion: We combined the strengths of each of our input datasets to develop a comprehensive and accurate benchmark call set. In the short time it has been available, over 20 published or submitted papers have used our data. Many challenges exist in comparing to our benchmark calls, and thus we have worked with the Global Alliance for Genomics and Health to develop standardized methods, performance metrics, and software to assist in its use. Zook et al, Nat Biotech. 2014.
Structural Variants (SVs), which include deletions, insertions, duplications, inversions and chromosomal rearrangements, have been shown to effect organism phenotypes, including changing gene expression, increasing disease risk, and playing an important role in cancer development. Still it remains challenging to detect all types of SVs from high throughput sequencing data and it is even harder to detect more complex SVs such as a duplication nested within an inversion. To overcome these challenges we developed algorithms for SV analysis using longer third generation sequencing reads. The increased read lengths allow us to span more complex SVs and accurately assess SVs in repetitive regions, two of the major limitations when using short Illumina data. Our enhanced open-source analysis method Sniffles accurately detects structural variants based on split read mapping and assessment of the alignments. Sniffles uses a self-balancing interval tree in combination with a plane sweep algorithm to manage and assess the identified SVs. Central to its high accuracy is its advanced scoring model that can distinguish erroneous alignments from true breakpoints flanking SVs. In experiments with simulated and real genomes (e.g human breast cancer), we find that Sniffles outperforms all other SV analysis approaches in both the sensitivity of finding events as well as the specificity of those events. Sniffles is available at: https://github.com/fritzsedlazeck/Sniffles
MaSuRCA Mega-Reads Assembly Technique for haplotype resolved genome assembly of hybrid PacBio and Illumina Data
The developments in DNA sequencing technology over the past several years have enabled large number of scientists to obtain sequences for the genomes of their interest at a fairly low cost. Illumina Sequencing was the dominant whole genome sequencing technology over the past few years due to its low cost. The Illumina reads are short (up to 300bp) and thus most of those draft genomes produced from Illumina data are very fragmented which limits their usability in practical scenarios. Longer reads are needed for more contiguous genomes. Recently Pacbio sequencing made significant advances in developing cost-effective long-read (>10000bp) sequencing technology and their data, although several times more expensive than Illumina, can be used to produce high quality genomes. Pacbio data can be used for de novo assembly, however due to its high error rate high coverage of the genome is required this raising the cost barrier. A solution for cost-effective genomes is to combine Pacbio and Illumina data leveraging the low error rates of the short Illumina reads and the length of the Pacbio reads. We have developed MaSuRCA mega-reads assembler for efficient assembly of hybrid data sets and we demonstrate that it performs well compared to the other published hybrid techniques. Another important benefit of the long reads is their ability to link the haplotype differences. The mega-reads approach corrects each Pacbio read independently and thus haplotype differences are preserved. Thus, leveraging the accuracy of the Illumina data and the length of the Pacbio reads, MaSuRCA mega-reads can produce haplotype-resolved genome assemblies, where each contig has sequence from a single haplotype. We present preliminary results on haplotype-resolved genome assemblies of faux (proof-of-concept) and real data.
Scalability and reliability improvements to the Iso-Seq analysis pipeline enables higher throughput sequencing of full-length cancer transcripts
The characterization of gene expression profiles via transcriptome sequencing has proven to be an important tool for characterizing how genomic rearrangements in cancer affect the biological pathways involved in cancer progression and treatment response. More recently, better resolution of transcript isoforms has shown that this additional level of information may be useful in stratifying patients into cancer subtypes with different outcomes and responses to treatment.1 The Iso-Seq protocol developed at PacBio is uniquely able to deliver full-length, high-quality cDNA sequences, allowing the unambiguous determination of splice variants, identifying potential biomarkers and yielding new insights into gene fusion events. Recent improvements to the Iso-Seq bioinformatics pipeline increases the speed and scalability of data analysis while boosting the reliability of isoform detection and cross-platform usability. Here we report evaluation of Sequel Iso-Seq runs of human UHRR samples with spiked-in synthetic RNA controls and show that the new pipeline is more CPU efficient and recovers more human and synthetic isoforms while reducing the number of false positives. We also share the results of sequencing the well-characterized HCC-1954 breast cancer and normal breast cell lines, which will be made publicly available. Combined with the recent simplification of the Iso-Seq sample preparation2, the new analysis pipeline completes a streamlined workflow for revealing the most comprehensive picture of transcriptomes at the throughput needed to characterize cancer samples.
FALCON-Phase integrates PacBio and HiC data for de novo assembly, scaffolding and phasing of a diploid Puerto Rican genome (HG00733)
Haplotype-resolved genomes are important for understanding how combinations of variants impact phenotypes. The study of disease, quantitative traits, forensics, and organ donor matching are aided by phased genomes. Phase is commonly resolved using familial data, population-based imputation, or by isolating and sequencing single haplotypes using fosmids, BACs, or haploid tissues. Because these methods can be prohibitively expensive, or samples may not be available, alternative approaches are required. de novo genome assembly with PacBio Single Molecule, Real-Time (SMRT) data produces highly contiguous, accurate assemblies. For non-inbred samples, including humans, the separate resolution of haplotypes results in higher base accuracy and more contiguous assembled sequences. Two primary methods exist for phased diploid genome assembly. The first, TrioCanu requires Illumina data from parents and PacBio data from the offspring. The long reads from the child are partitioned into maternal and paternal bins using parent-specific sequences; the separate PacBio read bins are then assembled, generating two fully phased genomes. An alternative approach (FALCON-Unzip) does not require parental information and separates PacBio reads, during genome assembly, using heterozygous SNPs. The length of haplotype phase blocks in FALCON-Unzip is limited by the magnitude and distribution of heterozygosity, the length of sequence reads, and read coverage. Because of this, FALCON-Unzip contigs typically contain haplotype-switch errors between phase blocks, resulting in primary contig of mixed parental origin. We developed FALCON-Phase, which integrates Hi-C data downstream of FALCON-Unzip to resolve phase switches along contigs. We applied the method to a human (Puerto Rican, HG00733) and non-human genome assemblies and evaluated accuracy using samples with trio data. In a cattle genome, we observe >96% accuracy in phasing when compared to TrioCanu assemblies as well as parental SNPs. For a high-quality PacBio assembly (>90-fold Sequel coverage) of a Puerto Rican individual we scaffolded the FALCON-Phase contigs, and re-phased the contigs creating a de novo scaffolded, phased diploid assembly with chromosome-scale contiguity.
Unbiased characterization of metagenome composition and function using HiFi sequencing on the PacBio Sequel II System
Recent work comparing metagenomic sequencing methods indicates that a comprehensive picture of the taxonomic and functional diversity of complex communities will be difficult to achieve with short-read technology alone. While the lower cost of short reads has enabled greater sequencing depth, the greater contiguity of long-read assemblies and lack of GC bias in SMRT Sequencing has enabled better gene finding. However, since long-read assembly requires high coverage for error correction, the benefits of unbiased coverage have in the past been lost for low abundance species. SMRT Sequencing performance improvements and the introduction of the Sequel II System has enabled a new, high throughput data type uniquely suited to metagenome characterization: HiFi reads. HiFi reads combine high accuracy with read lengths up to 15 kb, eliminating the need for assembly for most microbiome applications, including functional profiling, gene discovery, and metabolic pathway reconstruction. Here we present the application of the HiFi data type to enable a new method of analyzing metagenomes that does not require assembly.
Unbiased characterization of metagenome composition and function using HiFi sequencing on the PacBio Sequel II System
Recent work comparing metagenomic sequencing methods indicates that a comprehensive picture of the taxonomic and functional diversity of complex communities will be difficult to achieve with one sequencing technology alone. While the lower cost of short reads has enabled greater sequencing depth, the greater contiguity of long-read assemblies and lack of GC bias in SMRT Sequencing has enabled better gene finding. However, since long-read assembly typically requires high coverage for error correction, these benefits have in the past been lost for low-abundance species. The introduction of the Sequel II System has enabled a new, higher throughput, assembly-optional data type that addresses these challenges: HiFi reads. HiFi reads combine QV20 accuracy with long read lengths, eliminating the need for assembly for most metagenome applications, including gene discovery and metabolic pathway reconstruction. In fact, the read lengths and accuracy of HiFi data match or outperform the quality metrics of most metagenome assemblies, enabling cost-effective recovery of intact genes and operons while omitting the resource intensive and data-inefficient assembly step. Here we present the application of HiFi sequencing to both mock and human fecal samples using full-length 16S and shotgun methods. This proof-of-concept work demonstrates the unique strengths of the HiFi method. First, the high correspondence between the expected community composition,16S and shotgun profiling data reflects low context bias. In addition, every HiFi read yields ~5-8 predicted genes, without assembly, using standard tools. If assembly is desired, excellent results can be achieved with Canu and contig binning tools. In summary, HiFi sequencing is a new, cost-effective option for high-resolution functional profiling of metagenomes which complements existing short read workflows.
PacBio SMRT Sequencing is fast changing the genomics space with its long reads and high consensus sequence accuracy, providing the most comprehensive view of the genome and transcriptome. In this…
User Group Meeting: Unbiased characterization of metagenome composition and function using HiFi sequencing on the PacBio Sequel II System
In this PacBio User Group Meeting presentation, PacBio scientist Meredith Ashby shared several examples of analysis — from full-length 16S sequencing to shotgun sequencing — showing how SMRT Sequencing enables…