Understanding the genetic basis of infectious diseases is critical to enacting effective treatments, and several large-scale sequencing initiatives are underway to collect this information. Sequencing bacterial samples is typically performed by mapping sequence reads against genomes of known reference strains. While such resequencing informs on the spectrum of single-nucleotide differences relative to the chosen reference, it can miss numerous other forms of variation known to influence pathogenicity: structural variations (duplications, inversions), acquisition of mobile elements (phages, plasmids), homonucleotide length variation causing phase variation, and epigenetic marks (methylation, phosphorothioation) that influence gene expression to switch bacteria from non- pathogenic to pathogenic states. Therefore, sequencing methods which provide complete, de novo genome assemblies and epigenomes are necessary to fully characterize infectious disease agents in an unbiased, hypothesis-free manner. Hybrid assembly methods have been described that combine long sequence reads from SMRT DNA Sequencing with short reads (SMRT CCS (circular consensus) or second-generation reads), wherein the short reads are used to error-correct the long reads which are then used for assembly. We have developed a new paradigm for microbial de novo assemblies in which SMRT sequencing reads from a single long insert library are used exclusively to close the genome through a hierarchical genome assembly process, thereby obviating the need for a second sample preparation, sequencing run, and data set. We have applied this method to achieve closed de novo genomes with accuracies exceeding QV50 (>99.999%) for numerous disease outbreak samples, including E. coli, Salmonella, Campylobacter, Listeria, Neisseria, and H. pylori. The kinetic information from the same SMRT Sequencing reads is utilized to determine epigenomes. Approximately 70% of all methyltransferase specificities we have determined to date represent previously unknown bacterial epigenetic signatures. With relatively short sequencing run times and automated analysis pipelines, it is possible to go from an unknown DNA sample to its complete de novo genome and epigenome in about a day.
Heterozygous and highly polymorphic diploid (2n) and higher polyploidy (n > 2) genomes have proven to be very difficult to assemble. One key to the successful assembly and phasing of polymorphic genomics is the very long read length (9-40 kb) provided by the PacBio RS II system. We recently released software and methods that facilitate the assembly and phasing of genomes with ploidy levels equal to or greater than 2n. In an effort to collaborate and spur on algorithm development for assembly and phasing of heterozygous polymorphic genomes, we have recently released sequencing datasets that can be used to test and develop highly polymorphic diploid and polyploidy assembly and phasing algorithms. These data sets include multiple species and ecotypes of Arabidopsis that can be combined to create synthetic in-silico F1 hybrids with varying levels of heterozygosity. Because the sequence of each individual line was generated independently, the data set provides a ‘ground truth’ answer for the expected results allowing the evaluation of assembly algorithms. The sequencing data, assembly of inbred and in-silico heterozygous samples (n=>2) and phasing statistics will be presented. The raw and processed data has been made available to aid other groups in the development of phasing and assembly algorithms.
Highly sensitive, non-invasive detection of colorectal cancer mutations using single molecule, third generation sequencing.
Colorectal cancer (CRC) represents one of the most prevalent and lethal malignant neoplasms and every individual of age 50 and above should undergo regular CRC screening. Currently, the most effective procedure to detect adenomas, the precursors to CRC, is colonoscopy, which reduces CRC incidence by 80%. However, it is an invasive approach that is unpleasant for the patient, expensive, and poses some risk of complications such as colon perforation. A non-invasive screening approach with detection rates comparable to those of colonoscopy has not yet been established. The current study applies Pacific Biosciences third generation, single molecule sequencing to the inspection of CRC-driving mutations. Our approach combines the screening power and the extremely high accuracy of circular consensus (CCS) third generation sequencing with the non-invasiveness of using stool DNA to detect CRC-associated mutations present at extremely low frequencies and establishes a foundation for a non-invasive, highly sensitive assay to screen the population for CRC and early stage adenomas. We performed a series of experiments using a pool of fifteen amplicons covering the genes most frequently mutated in CRC (APC, Beta Catenin, KRAS, BRAF, and TP53), ensuring a theoretical screening coverage of over 97% for both CRC and adenomas. The assay was able to detect mutations in DNA isolated from stool samples from patients diagnosed with CRC at frequencies below 0.5 % with no false positives. The mutations were then confirmed by sequencing DNA isolated from the excised tumor samples. Our assay should be sensitive enough to allow the early identification of adenomatous polyps using stool DNA as analyte. In conclusion, we have developed an assay to detect mutations in the genes associated with CRC and adenomas using Pacific Biosciences RS Single Molecule, Real Time Circular Consensus Sequencing (SMRT-CCS). With no systematic bias and a much higher raw base-calling quality (CCS) compared to other sequencing methods, the assay was able to detect mutations in stool DNA at frequencies below 0.5 % with no false positives. This level of sensitivity should be sufficient to allow the detection of most adenomatous polyps using stool DNA as analyte, a feature that would make our approach the first non-invasive assay with a sensitivity comparable to that of colonoscopy and a strong candidate for the non-invasive preventive CRC screening of the general population.
Full-length sequencing of HLA class I genes of more than 1000 samples provides deep insights into sequence variability
Aim: The vast majority of donor typing relies on sequencing exons 2 and 3 of HLA class I genes (HLA-A, -B, -C). With such an approach certain allele combinations do not result in the anticipated “high resolution” (G-code) typing, due to the lack of exon-phasing information. To resolve ambiguous typing results for a haplotype frequency project, we established a whole gene sequencing approach for HLA class I, facilitating also an estimation of the degree of sequence variability outside the commonly sequenced exons. Methods: Primers were developed flanking the UTR regions resulting in similar amplicon lengths of 4.2-4.4 kb. Using a 4-primer approach, secondary primers containing barcodes were combined with the gene specific primers to obtain barcoded full-gene amplicons in a single amplification step. Amplicons were pooled, purified, and ligated to SMRT bells (i.e. annealing points for sequencing primers) following standard protocols from Pacific Biosciences. Taking advantage of the SMRT chemistry, pools of 48-72 amplicons were sequenced full length and phased in single runs on a Pacific Biosciences RSII instrument. Demultiplexing was achieved using the SMRT portal. Sequence analysis was performed using NGSengine software (GenDx). Results: We successfully performed full-length gene sequencing of 1003 samples, harboring ambiguous typings of either HLA-A (n=46), HLA-B (n=304) or HLA-C (n=653). Despite the high per-read raw error rates typical for SMRT sequencing (~15%) the consensus sequence proved highly reliable. All consensus sequences for exons 2 and 3 were in full accordance with their MiSeq-derived sequences. Unambiguous allelic resolution was achieved for all samples. We observed novel intronic, exonic as well as UTR sequence variations for many of the alleles covered by our data set. This included sequences of 600 individuals with HLA-C*07:01/C*07:02 genotype revealing the extent of sequence variation outside the exons 2 and 3. Conclusion: Here we present a whole gene amplification and sequencing approach for HLA class I genes. The maturity of this approach was demonstrated by sequencing more than 1000 samples, achieving fully phased allelic sequences. Extensive sequencing of one common allele combination hints at the yet to discover diversity of the HLA system outside the commonly analyzed exons.
Aim: In contrast to exon-based HLA-typing approaches, whole gene genotyping crucially depends on full-length sequences submitted to the IMGT/HLA Database. Currently, full-length sequences are provided for only 7 out of 520 HLA-DPB1 alleles. Therefore, we developed a fully phased whole-gene sequencing approach for DPB1, to facilitate further exploration of the allelic structure at this locus. Methods: Primers were developed flanking the UTR-regions of DPB1 resulting in a 12 kb amplicon. Using a 4-primer approach, secondary primers containing barcodes were combined with the gene-specific primers to obtain barcoded full-gene amplicons in a single amplification step. Amplicons were pooled, purified, and ligated to SMRT bells (i.e. annealing points for sequencing primers) following standard protocols from Pacific Biosciences. Taking advantage of the SMRT chemistry, pools of 48 amplicons were sequenced full length in single runs on a Pacific Biosciences RSII instrument. Demultiplexing was performed using the SMRT portal. Sequence analysis was performed using the NGSengine software (GenDx). Results: We analyzed a set of 48 randomly picked samples. With 3 exceptions due to PCR failure, all genotype assignments conformed to standard genotyping results based on exons 2 and 3. Allelic proportions for heterozygous positions were evenly distributed (range 0.4 – 0.6) for all samples, suggesting unbiased amplifications. Despite the high per-read raw error rates typical for SMRT sequencing (~15%) the consensus sequence proved highly reliable. All consensus sequences for exons 2 and 3 were in full accordance with their MiSeq-derived sequences. We describe novel intronic sequence variation of the 7 so far genomically defined alleles, as well as 7 whole-length DPB1 alleles with hitherto unknown intronic regions. One of these alleles (HLA-DPB1*131:01) is classified as rare. Conclusion: Here we present a whole gene amplification and sequencing workflow for DPB1 alleles utilizing single molecule real-time (SMRT) sequencing from Pacific Biosciences. Validation of consensus sequences against known exonic sequences highlights the reliability of this technology. This workflow will facilitate amending the IMGT/HLA Database for DPB1.
Analysis of 37,000 Caucasian samples reveals tight linkage between SNP RS9277534 and high resolution typing of HLA-DPB1
HLA-DPB1 mismatching between patients and unrelated donors is known to increase the risk of acute graft-versus-host-disease (GvHD) after hematopoietic stem cell transplantation. If only HLA-DPB1 mismatched donors are available, the genotype defined by the Single Nucleotide Polymorphism (SNP) rs9277534 can be used to select mismatched donors that are well-tolerated. However, since rs9277534 resides within the 3’ untranslated region (UTR), it usually is not analyzed during DPB1 routine typing.
Structural variants (SVs) – genomic differences =50 base pairs – are few by count compared to single nucleotide variants (SNVs) and indels but include most of the base pairs that differ between two humans.
Recent improvements in sequencing chemistry and instrument performance combine to create a new PacBio data type, Single Molecule High-Fidelity reads (HiFi reads). Increased read length and improvement in library construction enables average read lengths of 10-20 kb with average sequence identity greater than 99% from raw single molecule reads. The resulting reads have the accuracy comparable to short read NGS but with 50-100 times longer read length. Here we benchmark the performance of this data type by sequencing and genotyping the Genome in a Bottle (GIAB) HG0002 human reference sample from the National Institute of Standards and Technology (NIST). We further demonstrate the general utility of HiFi reads by analyzing multiple clones of Cabernet Sauvignon. Three different clones were sequenced and de novo assembled with the CANU assembly algorithm, generating draft assemblies of very high contiguity equal to or better than earlier assembly efforts using PacBio long reads. Using the Cabernet Sauvignon Clone 8 assembly as a reference, we mapped the HiFi reads generated from Clone 6 and Clone 47 to identify single nucleotide polymorphisms (SNPs) and structural variants (SVs) that are specific to each of the three samples.
In this presentation, Sonja Vernes of the Max Plank Institute shares her work with the Bat1K project which aims to catalog the genetic diversity of all living bat species. She…
User Group Meeting: Getting a read on one little worm with PacBio’s low DNA input workflow and the Agilent Femto Pulse System
In this PacBio User Group Meeting presentation, Erin Bernberg from the University of Delaware reports on using the Agilent Femto Pulse System for high-resolution, highly sensitive fragment analysis and on…
Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads.
The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes. © 2019 John Wiley & Sons Ltd/University College London.
Characterization of Reference Materials for Genetic Testing of CYP2D6 Alleles: A GeT-RM Collaborative Project.
Pharmacogenetic testing increasingly is available from clinical and research laboratories. However, only a limited number of quality control and other reference materials currently are available for the complex rearrangements and rare variants that occur in the CYP2D6 gene. To address this need, the Division of Laboratory Systems, CDC-based Genetic Testing Reference Material Coordination Program, in collaboration with members of the pharmacogenetic testing and research communities and the Coriell Cell Repositories (Camden, NJ), has characterized 179 DNA samples derived from Coriell cell lines. Testing included the recharacterization of 137 genomic DNAs that were genotyped in previous Genetic Testing Reference Material Coordination Program studies and 42 additional samples that had not been characterized previously. DNA samples were distributed to volunteer testing laboratories for genotyping using a variety of commercially available and laboratory-developed tests. These publicly available samples will support the quality-assurance and quality-control programs of clinical laboratories performing CYP2D6 testing.Published by Elsevier Inc.
Biochemical characterization of a novel cold-adapted agarotetraose-producing a-agarase, AgaWS5, from Catenovulum sediminis WS1-A.
Although many ß-agarases that hydrolyze the ß-1,4 linkages of agarose have been biochemically characterized, only three a-agarases that hydrolyze the a-1,3 linkages are reported to date. In this study, a new a-agarase, AgaWS5, from Catenovulum sediminis WS1-A, a new agar-degrading marine bacterium, was biochemically characterized. AgaWS5 belongs to the glycoside hydrolase (GH) 96 family. AgaWS5 consists of 1295 amino acids (140 kDa) and has the 65% identity to an a-agarase, AgaA33, obtained from an agar-degrading bacterium Thalassomonas agarivorans JAMB-A33. AgaWS5 showed the maximum activity at a pH and temperature of 8 and 40 °C, respectively. AgaWS5 showed a cold-tolerance, and it retained more than 40% of its maximum enzymatic activity at 10 °C. AgaWS5 is predicted to have several calcium-binding sites. Thus, its activity was slightly enhanced in the presence of Ca2+, and was strongly inhibited by EDTA. The Km and Vmax of AgaWS5 for agarose were 10.6 mg/mL and 714.3 U/mg, respectively. Agarose-liquefication, thin layer chromatography, and mass and NMR spectroscopic analyses demonstrated that AgaWS5 is an endo-type a-agarase that degrades agarose and mainly produces agarotetraose. Thus, in this study, a novel cold-adapted GH96 agarotetraose-producing a-agarase was identified.
New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. Translating these methods to routine research and clinical practice requires robust benchmark sets. We developed the first benchmark set for identification of both false negative and false positive germline SVs, which complements recent efforts emphasizing increasingly comprehensive characterization of SVs. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods, both alignment- and de novo assembly-based, from short-, linked-, and long-read sequencing, as well as optical and electronic mapping. The final benchmark set contains 12745 isolated, sequence-resolved insertion and deletion calls =50 base pairs (bp) discovered by at least 2 technologies or 5 callsets, genotyped as heterozygous or homozygous variants by long reads. The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.66 Gbp and 9641 SVs supported by at least one diploid assembly. Support for SVs was assessed using svviz with short-, linked-, and long-read sequence data. In general, there was strong support from multiple technologies for the benchmark SVs, with 90 % of the Tier 1 SVs having support in reads from more than one technology. The Mendelian genotype error rate was 0.3 %, and genotype concordance with manual curation was >98.7 %. We demonstrate the utility of the benchmark set by showing it reliably identifies both false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping.
Cupriavidus sp. strain Ni-2 resistant to high concentration of nickel and its genes responsible for the tolerance by genome comparison.
The widespread use of metals influenced many researchers to examine the relationship between heavy metal toxicity and bacterial resistance. In this study, we have inoculated heavy metal-contaminated soil from Janghang region of South Korea in the nickel-containing media (20 mM Ni2+) for the enrichment. Among dozens of the colonies acquired from the several transfers and serial dilutions with the same concentrations of Ni, the strain Ni-2 was chosen for further studies. The isolates were identified for their phylogenetic affiliations using 16S rRNA gene analysis. The strain Ni-2 was close to Cupriavidus metallidurans and was found to be resistant to antibiotics of vancomycin, erythromycin, chloramphenicol, ampicillin, gentamicin, streptomycin, and kanamycin by disk diffusion method. Of the isolated strains, Ni-2 was sequenced for the whole genome, since the Ni-resistance seemed to be better than the other strains. From the genome sequence we have found that there was a total of 89 metal-resistance-related genes including 11 Ni-resistance genes, 41 heavy metal (As, Cd, Zn, Hg, Cu, and Co)-resistance genes, 22 cation-efflux genes, 4 metal pumping ATPase genes, and 11 metal transporter genes.