June 1, 2021  |  

New discoveries from closing Salmonella genomes using Pacific Biosciences continuous long reads.

The newer hierarchical genome assembly process (HGAP) performs de novo assembly using data from a single PacBio long insert library. To assess the benefits of this method, DNA from several Salmonella enterica serovars was isolated from a pure culture. Genome sequencing was performed using Pacific Biosciences RS sequencing technology. The HGAP process enabled us to close sixteen Salmonella subsp. enterica genomes and their associated mobile elements: The ten serotypes include: Salmonella enterica subsp. enterica serovar Enteritidis (S. Enteritidis) S. Bareilly, S. Heidelberg, S. Cubana, S. Javiana and S. Typhimurium, S. Newport, S. Montevideo, S. Agona, and S. Tennessee. In addition, we were able to detect novel methyltransferases (MTases) by using the Pacific Biosciences kinetic score distributions showing that each serovar appears to have a novel methylation pattern. For example while all Salmonella serovars examined so far have methylase specific activity for 5’-GATC-3’/3’-CTAG-5’ and 5’-CAGAG-3’/3’-GTCTC-5’ (underlined base indicates a modification), S. Heidelberg is uniquely specific for 5’-ACCANCC-3’/3’-TGGTNGG-5’, while S. Typhimurium has uniquely methylase specific for 5′-GATCAG-3’/3′- CTAGTC-5′ sites, for the samples examined so far. We believe that this may be due to the unique environments and phages that these serotypes have been exposed to. Furthermore, our analysis identified and closed a variety of plasmids such as mobilization plasmids, antimicrobial resistance plasmids and IncX plasmids carrying a Type IV secretion system (T4SS). The VirB/D4 T4SS apparatus is important in that it assists with rapid dissemination of antibiotic resistance and virulence determinants. Presently, only limited information exists regarding the genotypic characterization of drug resistance in S. Heidelberg isolates derived from various host species. Here, we characterize two S. Heidelberg outbreak isolates from two different outbreaks. Both isolates contain the IncX plasmid of approximately 35 kb, and carried the genes virB1, virB2, virB3/4, virB5, virB6, virB7, virB8, virB9, virB10, virB11, virD2, and virD4, that are associated with the T4SS. In addition, the outbreak isolate associated with ground turkey carries a 4,473 bp mobilization plasmid and an incompatibility group (Inc) I1 antimicrobial resistance plasmid encoding resistance to gentamicin (aacC2), beta-lactam (bl2b_tem), streptomycin (aadAI) and tetracycline (tetA, tetR) while the outbreak isolate associated with chicken breast carries the IncI1 plasmid encoding resistance to gentamicin (aacC2), streptomycin (aadAI) and sulfisoxazole (sul1). Using this new technology we explored the genetic elements present in resistant pathogens which will achieve a better understanding of the evolution of Salmonella.


June 1, 2021  |  

SMRT Sequencing solutions for investigative studies to understand evolutionary processes.

Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers to understand molecular mechanisms in evolution and gain insight into adaptive strategies. With read lengths exceeding 10 kb, we are able to sequence high-quality, closed microbial genomes with associated plasmids, and investigate large genome complexities, such as long, highly repetitive, low-complexity regions and multiple tandem-duplication events. Improved genome quality, observed at 99.9999% (QV60) consensus accuracy, and significant reduction of gap regions in reference genomes (up to and beyond 50%) allow researchers to better understand coding sequences with high confidence, investigate potential regulatory mechanisms in noncoding regions, and make inferences about evolutionary strategies that are otherwise missed by the coverage biases associated with short- read sequencing technologies. Additional benefits afforded by SMRT Sequencing include the simultaneous capability to detect epigenomic modifications and obtain full-length cDNA transcripts that obsolete the need for assembly. With direct sequencing of DNA in real-time, this has resulted in the identification of numerous base modifications and motifs, which genome-wide profiles have linked to specific methyltransferase activities. Our new offering, the Iso-Seq Application, allows for the accurate differentiation between transcript isoforms that are difficult to resolve with short-read technologies. PacBio reads easily span transcripts such that both 5’/3’ primers for cDNA library generation and the poly-A tail are observed. As such, exon configuration and intron retention events can be analyzed without ambiguity. This technological advance is useful for characterizing transcript diversity and improving gene structure annotations in reference genomes. We review solutions available with SMRT Sequencing, from targeted sequencing efforts to obtaining reference genomes (>100 Mb). This includes strategies for identifying microsatellites and conducting phylogenetic comparisons with targeted gene families. We highlight how to best leverage our long reads that have exceeded 20 kb in length for research investigations, as well as currently available bioinformatics strategies for analysis. Benefits for these applications are further realized with consistent use of size selection of input sample using the BluePippin™ device from Sage Science as demonstrated in our genome improvement projects. Using the latest P5-C3 chemistry on model organisms, these efforts have yielded an observed contig N50 of ~6 Mb, with the longest contig exceeding 12.5 Mb and an average base quality of QV50.


June 1, 2021  |  

The resurgence of reference quality genome sequence.

Since the advent of Next-Generation Sequencing (NGS), the cost of de novo genome sequencing and assembly have dropped precipitately, which has spurred interest in genome sequencing overall. Unfortunately the contiguity of the NGS assembled sequences, as well as the accuracy of these assemblies have suffered. Additionally, most NGS de novo assemblies leave large portions of genomes unresolved, and repetitive regions are often collapsed. When compared to the reference quality genome sequences produced before the NGS era, the new sequences are highly fragmented and often prove to be difficult to properly annotate. In some cases the contiguous portions are smaller than the average gene size making the sequence not nearly as useful for biologists as the earlier reference quality genomes including of Human, Mouse, C. elegans, or Drosophila. Recently, new 3rd generation sequencing technologies, long-range molecular techniques, and new informatics tools have facilitated a return to high quality assembly. We will discuss the capabilities of the technologies and assess their impact on assembly projects across the tree of life from small microbial and fungal genomes through large plant and animal genomes. Beyond improvements to contiguity, we will focus on the additional biological insights that can be made with better assemblies, including more complete analysis genes in their flanking regulatory context, in-depth studies of transposable elements and other complex gene families, and long-range synteny analysis of entire chromosomes. We will also discuss the need for new algorithms for representing and analyzing collections of many complete genomes at once.


June 1, 2021  |  

Access full spectrum of polymorphisms in HLA class I & II genes, without imputation for disease association and evolutionary research.

MHC class I and II genes are critically monitored by high-resolution sequencing for organ transplant decisions due to their role in GVHD. Their direct or linkage-based causal association, have increased their prominence as targets for drug sensitivity, autoimmune, cancer and infectious disease research. Monitoring HLA genes can however be tricky due to their highly polymorphic nature. Allele-level resolution is thus strongly preferred. However, most studies were historically focused on peptide binding domains of the HLA genes, due to technological challenges. As a result knowledge about the functional role of polymorphisms outside of exons 2 and 3 of HLA genes was rather limited. There are also relatively few full-length gene references currently available in the IMGT HLA database. This made it difficult to quickly adopt high-throughput reference-reliant methods for allele-level HLA sequencing. Increasing awareness regarding role of regulatory region polymorphisms of HLA genes in disease association1, nonetheless have brought about a revolution in full-length HLA gene sequencing. Researchers are now exploring ways to obtain complete information for HLA genes and integrate it with the current HLA database so it can be interpreted used by clinical researchers. We have explored advantages of SMRT Sequencing to obtain fully phased, allele-specific sequences of HLA class I and II genes for 96 samples using completely De novo consensus generation approach for imputation-free 4-field typing. With long read lengths (average >10 kb) and consensus accuracy exceeding 99.999% (Q50), a comprehensive snapshot of variants in exons, introns and UTRs could be obtained for spectrum of polymorphisms in phase across SNP-poor regions. Such information can provide invaluable insights in future causality association and population diversity research.


June 1, 2021  |  

A comprehensive lincRNA analysis: From conifers to trees

We have produced an updated annotation of the Norway spruce genome on the basis of an in siliconormalised set of RNA-Seq data obtained from 1,529 samples and comprising 15.5 billion paired-end Illumina HiSeq reads complemented by 18Mbp of PacBio cDNA data (3.2M sequences). In addition to augmenting and refining the previous protein coding gene annotation, here we focus on the addition of long intergenic non-coding RNA (lincRNA) and micro RNA (miRNA) genes. In addition to non-coding loci, our analyses also identified protein coding genes that had been missed by the initial genome annotation and enabled us to update the annotation of existing gene models. In particular, splice variant information, as supported by PacBio sequencing reads, has been added to the current annotation and previously fragmented gene models have been merged by scaffolding disjoint genomic scaffolds on the basis of transcript evidence. Using this refined annotation, a targeted analysis of the lincRNAs enabled their classification as i) deeply conserved, ii) conserved in seed plants iii) gymnosperm/conifer specific. Concurrently, complementary analyses were performed as part of the aspen genome project and the results of a comparative analysis of the lincRNAs conserved in both Norway spruce and Eurasian aspen enabled us to identify conserved and diverged expression profiles. At present, we are delving further into the expression results with the aim to functionally annotate the lincRNA genes, by developing a co-expression network analyses based GO annotation.


June 1, 2021  |  

Multiplex target enrichment using barcoded multi-kilobase fragments and probe-based capture technologies

Target enrichment capture methods allow scientists to rapidly interrogate important genomic regions of interest for variant discovery, including SNPs, gene isoforms, and structural variation. Custom targeted sequencing panels are important for characterizing heterogeneous, complex diseases and uncovering the genetic basis of inherited traits with more uniform coverage when compared to PCR-based strategies. With the increasing availability of high-quality reference genomes, customized gene panels are readily designed with high specificity to capture genomic regions of interest, thus enabling scientists to expand their research scope from a single individual to larger cohort studies or population-wide investigations. Coupled with PacBio® long-read sequencing, these technologies can capture 5 kb fragments of genomic DNA (gDNA), which are useful for interrogating intronic, exonic, and regulatory regions, characterizing complex structural variations, distinguishing between gene duplications and pseudogenes, and interpreting variant haplotyes. In addition, SMRT® Sequencing offers the lowest GC-bias and can sequence through repetitive regions. We demonstrate the additional insights possible by using in-depth long read capture sequencing for key immunology, drug metabolizing, and disease causing genes such as HLA, filaggrin, and cancer associated genes.


June 1, 2021  |  

A method for the identification of variants in Alzheimer’s disease candidate genes and transcripts using hybridization capture combined with long-read sequencing

Alzheimer’s disease (AD) is a devastating neurodegenerative disease that is genetically complex. Although great progress has been made in identifying fully penetrant mutations in genes such as APP, PSEN1 and PSEN2 that cause early-onset AD, these still represent a very small percentage of AD cases. Large-scale, genome-wide association studies (GWAS) have identified at least 20 additional genetic risk loci for the more common form of late-onset AD. However, the identified SNPs are typically not the actual risk variants, but are in linkage disequilibrium with the presumed causative variant (Van Cauwenberghe C, et al., The genetic landscape of Alzheimer disease: clinical implications and perspectives. Genet Med 2015;18:421-430). Long-read sequencing together with hybrid-capture targeting technologies provides a powerful combination to target candidate genes/transcripts of interest. Shearing the genomic DNA to ~5 kb fragments and then capturing with probes that span the whole gene(s) of interest can provide uniform coverage across the entire region, identifying variants and allowing for phasing into two haplotypes. Furthermore, capturing full-length cDNA from the same sample using the same capture probes can also provide an understanding of isoforms that are generated and allow them to be assigned to their corresponding haplotype. Here we present a method for capturing genomic DNA and cDNA from an AD sample using a panel of probes targeting approximately 20 late-onset AD candidate genes which includes CLU, ABCA7, CD33, TREM2, TOMM40, PSEN2, APH1 and BIN1. By combining xGen® Lockdown® probes with SMRT Sequencing, we provide completely sequenced candidate genes as well as their corresponding transcripts. In addition, we are also able to evaluate structural variants that due to their size, repetitive nature, or low sequence complexity have been un-sequenceable using short-read technologies.


June 1, 2021  |  

Phased diploid genome assembly with single-molecule real-time sequencing

While genome assembly projects have been successful in many haploid and inbred species, the assembly of non-inbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.


June 1, 2021  |  

Screening and characterization of causative structural variants for bipolar disorder in a significantly linked chromosomal region onXq24-q27 in an extended pedigree from a genetic isolate

Bipolar disorder (BD) is a phenotypically and genetically complex and debilitating neurological disorder that affects 1% of the worldwide population. There is compelling evidence from family, twin and adoption studies supporting the involvement of a genetic predisposition in BD with estimated heritability up to ~ 80%. The risk in first-degree relatives is ten times higher than in the general population. Linkage and association studies have implicated multiple putative chromosomal loci for BP susceptibility, however no disease genes have been identified to date.


June 1, 2021  |  

A method for the identification of variants in Alzheimer’s disease candidate genes and transcripts using hybridization capture combined with long-read sequencing

Alzheimer’s disease (AD) is a devastating neurodegenerative disease that is genetically complex. Although great progress has been made in identifying fully penetrant mutations in genes such as APP, PSEN1 and PSEN2 that cause early-onset AD, these still represent a very small percentage of AD cases. Large-scale, genome-wide association studies (GWAS) have identified at least 20 additional genetic risk loci for the more common form of late-onset AD. However, the identified SNPs are typically not the actual causal variants, but are in linkage disequilibrium with the presumed causative variant (Van Cauwenberghe C, et al., The genetic landscape of Alzheimer disease: clinical implications and perspectives. Genet Med 2015;18:421-430).


June 1, 2021  |  

Screening for causative structural variants in neurological disorders using long-read sequencing

Over the past decades neurological disorders have been extensively studied producing a large number of candidate genomic regions and candidate genes. The SNPs identified in these studies rarely represent the true disease-related functional variants. However, more recently a shift in focus from SNPs to larger structural variants has yielded breakthroughs in our understanding of neurological disorders.Here we have developed candidate gene screening methods that combine enrichment of long DNA fragments with long-read sequencing that is optimized for structural variation discovery. We have also developed a novel, amplification-free enrichment technique using the CRISPR/Cas9 system to target genomic regions.We sequenced gDNA and full-length cDNA extracted from the temporal lobe for two Alzheimer’s patients for 35 GWAS candidate genes. The multi-kilobase long reads allowed for phasing across the genes and detection of a broad range of genomic variants including SNPs to multi-kilobase insertions, deletions and inversions. In the full-length cDNA data we detected differential allelic isoform complexity, novel exons as well as transcript isoforms. By combining the gDNA data with full-length isoform characterization allows to build a more comprehensive view of the underlying biological disease mechanisms in Alzheimer’s disease. Using the novel PCR-free CRISPR-Cas9 enrichment method we screened several genes including the hexanucleotide repeat expansion C9ORF72 that is associated with 40% of familiar ALS cases. This method excludes any PCR bias or errors from an otherwise hard to amplify region as well as preserves the basemodication in a single molecule fashion which allows you to capture mosaicism present in the sample.


June 1, 2021  |  

Improving the reference with a diversity panel of sequence-resolved structural variation

Although the accuracy of the human reference genome is critical for basic and clinical research, structural variants (SVs) have been difficult to assess because data capable of resolving them have been limited. To address potential bias, we sequenced a diversity panel of nine human genomes to high depth using long-read, single-molecule, real-time sequencing data. Systematically identifying and merging SVs =50 bp in length for these nine and one public genome yielded 83,909 sequence-resolved insertions, deletions, and inversions. Among these, 2,839 (2.0 Mbp) are shared among all discovery genomes with an additional 13,349 (6.9 Mbp) present in the majority of humans, indicating minor alleles or errors in the reference, which is partially explained by an enrichment for GC-content and repetitive DNA. Genotyping 83% of these in 290 additional genomes confirms that at least one allele of the most common SVs in unique euchromatin are now sequence-resolved. We observe a 9-fold increase within 5 Mbp of chromosome telomeric ends and correlation with de novo single-nucleotide variant mutations showing that such variation is nonrandomly distributed defining potential hotspots of mutation. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. To illustrate the utility of sequence-resolved SVs in resequencing experiments, we mapped 30 diverse high-coverage Illumina-sequenced samples to GRCh38 with and without contigs containing SV insertions as alternate sequences, and we found these additional sequences recover 6.4% of unmapped reads. For reads mapped within the SV insertion, 25.7% have a better mapping quality, and 18.7% improved by 1,000-fold or more. We reveal 72,964 occurrences of 15,814 unique variants that were not discoverable with the reference sequence alone, and we note that 7% of the insertions contain an SV in at least one sample indicating that there are additional alleles in the population that remain to be discovered. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity. We present a summary of our findings and discuss ideas for revealing variation that was once difficult to ascertain.


June 1, 2021  |  

Comparative metagenome-assembled genome analysis of “Candidatus Lachnocurva vaginae”, formerly known as Bacterial Vaginosis Associated bacterium – 1 (BVAB1)

Bacterial Vaginosis Associated bacterium 1 (BVAB1) is an as-yet uncultured bacterial species found in the human vagina that belongs to the family Lachnospiraceae within the order Clostridiales. As its name suggests, this bacterium is often associated with bacterial vaginosis (BV), a common vaginal disorder that has been shown to increase a woman’s risk for HIV, Chlamydia trachomatis, and Neisseria gonorrhoeae infections as well as preterm birth. Further, BVAB1 is associated with the persistence of BV following metronidazole treatment, increased vaginal inflammation, and adverse obstetrics outcomes. There is no available complete genome sequence of BVAB1, which has made it di?cult to mechanistically understand its role in disease. We present here a circularized metagenome-assembled genome (cMAG) of B VAB1 as well as a comparative analysis including an additional six metagenome-assembled genomes (MAGs) of this species. These sequences were derived from cervicovaginal samples of seven separate women. The cMAG is 1.649 Mb in size and encodes 1,578 genes. We propose to rename BVAB1 to “Candidatus Lachnocurva vaginae” based on phylogenetic analyses, and provide genomic evidence that this candidate species may metabolize D-lactate, produce trimethylamine (one of the chemicals responsible for BV-associated odor), and be motile. The cMAG and the six MAGs are valuable resources that will further contribute to our understanding of the heterogeneous etiology of bacterial vaginosis.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.