Comparative analysis Archives

June 1, 2021 |

Evaluating the potential of new sequencing technologies for genotyping and variation discovery in human data.

A first look at Pacific Biosciences RS data Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome these limitations by providing significantly longer reads (now averaging >1kb), enabling more unique seeds for reference alignment. In addition, the lack of amplification in the library construction step avoids a common source of base composition bias. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical resequencing projects by assessing the quality of the raw sequencing data, as well as its use for SNP discovery and genotyping using the Genome Analysis Toolkit (GATK).

June 1, 2021 |

Resolving the ‘dark matter’ in genomes.

Second-generation sequencing has brought about tremendous insights into the genetic underpinnings of biology. However, there are many functionally important and medically relevant regions of genomes that are currently difficult or impossible to sequence, resulting in incomplete and fragmented views of genomes. Two main causes are (i) limitations to read DNA of extreme sequence content (GC-rich or AT-rich regions, low complexity sequence contexts) and (ii) insufficient read lengths which leave various forms of structural variation unresolved and result in mapping ambiguities.

June 1, 2021 |

Targeted sequencing of genes from soybean using NimbleGen SeqCap EZ and PacBio SMRT Sequencing

Full-length gene capture solutions offer opportunities to screen and characterize structural variations and genetic diversity to understand key traits in plants and animals. Through a combined Roche NimbleGen probe capture and SMRT Sequencing strategy, we demonstrate the capability to resolve complex gene structures often observed in plant defense and developmental genes spanning multiple kilobases. The custom panel includes members of the WRKY plant-defense-signaling family, members of the NB-LRR disease-resistance family, and developmental genes important for flowering. The presence of repetitive structures and low-complexity regions makes short-read sequencing of these genes difficult, yet this approach allows researchers to obtain complete sequences for unambiguous resolution of gene models. This strategy has been applied to genomic DNA samples from soybean coupled with barcoding for multiplexing.

June 1, 2021 |

A comprehensive lincRNA analysis: From conifers to trees

We have produced an updated annotation of the Norway spruce genome on the basis of an in siliconormalised set of RNA-Seq data obtained from 1,529 samples and comprising 15.5 billion paired-end Illumina HiSeq reads complemented by 18Mbp of PacBio cDNA data (3.2M sequences). In addition to augmenting and refining the previous protein coding gene annotation, here we focus on the addition of long intergenic non-coding RNA (lincRNA) and micro RNA (miRNA) genes. In addition to non-coding loci, our analyses also identified protein coding genes that had been missed by the initial genome annotation and enabled us to update the annotation of existing gene models. In particular, splice variant information, as supported by PacBio sequencing reads, has been added to the current annotation and previously fragmented gene models have been merged by scaffolding disjoint genomic scaffolds on the basis of transcript evidence. Using this refined annotation, a targeted analysis of the lincRNAs enabled their classification as i) deeply conserved, ii) conserved in seed plants iii) gymnosperm/conifer specific. Concurrently, complementary analyses were performed as part of the aspen genome project and the results of a comparative analysis of the lincRNAs conserved in both Norway spruce and Eurasian aspen enabled us to identify conserved and diverged expression profiles. At present, we are delving further into the expression results with the aim to functionally annotate the lincRNA genes, by developing a co-expression network analyses based GO annotation.

June 1, 2021 |

Comparative metagenome-assembled genome analysis of “Candidatus Lachnocurva vaginae”, formerly known as Bacterial Vaginosis Associated bacterium – 1 (BVAB1)

Bacterial Vaginosis Associated bacterium 1 (BVAB1) is an as-yet uncultured bacterial species found in the human vagina that belongs to the family Lachnospiraceae within the order Clostridiales. As its name suggests, this bacterium is often associated with bacterial vaginosis (BV), a common vaginal disorder that has been shown to increase a woman’s risk for HIV, Chlamydia trachomatis, and Neisseria gonorrhoeae infections as well as preterm birth. Further, BVAB1 is associated with the persistence of BV following metronidazole treatment, increased vaginal inflammation, and adverse obstetrics outcomes. There is no available complete genome sequence of BVAB1, which has made it di?cult to mechanistically understand its role in disease. We present here a circularized metagenome-assembled genome (cMAG) of B VAB1 as well as a comparative analysis including an additional six metagenome-assembled genomes (MAGs) of this species. These sequences were derived from cervicovaginal samples of seven separate women. The cMAG is 1.649 Mb in size and encodes 1,578 genes. We propose to rename BVAB1 to “Candidatus Lachnocurva vaginae” based on phylogenetic analyses, and provide genomic evidence that this candidate species may metabolize D-lactate, produce trimethylamine (one of the chemicals responsible for BV-associated odor), and be motile. The cMAG and the six MAGs are valuable resources that will further contribute to our understanding of the heterogeneous etiology of bacterial vaginosis.

February 5, 2021 |

PAG Conference: Iso-Seq analysis for plant & animal genomes – annotation evaluation & phasing

In this presentation, Elizabeth Tseng explains how PacBio’s full-length RNA Sequencing using the Iso-Seq method can characterize full-length transcripts without the need for computational transcript assembly. The Iso-Seq method is…

February 5, 2021 |

PAG Conference: Reference-quality drosophila genome assemblies for evolutionary analysis of previously inaccessible genomic regions

In this presentation, Andrew Clark from Cornell University describes work from a collaboration with Manyuan Long of the University of Chicago and Rod Wing of the University of Arizona to…

April 21, 2020 |

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads.

The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes. © 2019 John Wiley & Sons Ltd/University College London.

April 21, 2020 |

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.

The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.

April 21, 2020 |

The Genome of the Zebra Mussel, Dreissena polymorpha: A Resource for Invasive Species Research

The zebra mussel, Dreissena polymorpha, continues to spread from its native range in Eurasia to Europe and North America, causing billions of dollars in damage and dramatically altering invaded aquatic ecosystems. Despite these impacts, there are few genomic resources for Dreissena or related bivalves, with nearly 450 million years of divergence between zebra mussels and its closest sequenced relative. Although the D. polymorpha genome is highly repetitive, we have used a combination of long-read sequencing and Hi-C-based scaffolding to generate the highest quality molluscan assembly to date. Through comparative analysis and transcriptomics experiments we have gained insights into processes that likely control the invasive success of zebra mussels, including shell formation, synthesis of byssal threads, and thermal tolerance. We identified multiple intact Steamer-Like Elements, a retrotransposon that has been linked to transmissible cancer in marine clams. We also found that D. polymorpha have an unusual 67 kb mitochondrial genome containing numerous tandem repeats, making it the largest observed in Eumetazoa. Together these findings create a rich resource for invasive species research and control efforts.

April 21, 2020 |

Comparative Genomic Analysis of Virulence, Antimicrobial Resistance, and Plasmid Profiles of Salmonella Dublin Isolated from Sick Cattle, Retail Beef, and Humans in the United States.

Salmonella enterica serovar Dublin is a host-adapted serotype associated with typhoidal disease in cattle. While rare in humans, it usually causes severe illness, including bacteremia. In the United States, Salmonella Dublin has become one of the most multidrug-resistant (MDR) serotypes. To understand the genetic elements that are associated with virulence and resistance, we sequenced 61 isolates of Salmonella Dublin (49 from sick cattle and 12 from retail beef) using the Illumina MiSeq and closed 5 genomes using the PacBio sequencing platform. Genomic data of eight human isolates were also downloaded from NCBI (National Center for Biotechnology Information) for comparative analysis. Fifteen Salmonella pathogenicity islands (SPIs) and a spv operon (spvRABCD), which encodes important virulence factors, were identified in all 69 (100%) isolates. The 15 SPIs were located on the chromosome of the 5 closed genomes, with each of these isolates also carrying 1 or 2 plasmids with sizes between 36 and 329?kb. Multiple antimicrobial resistance genes (ARGs), including blaCMY-2, blaTEM-1B, aadA12, aph(3′)-Ia, aph(3′)-Ic, strA, strB, floR, sul1, sul2, and tet(A), along with spv operons were identified on these plasmids. Comprehensive antimicrobial resistance genotypes were determined, including 17 genes encoding resistance to 5 different classes of antimicrobials, and mutations in the housekeeping gene (gyrA) associated with resistance or decreased susceptibility to fluoroquinolones. Together these data revealed that this panel of Salmonella Dublin commonly carried 15 SPIs, MDR/virulence plasmids, and ARGs against several classes of antimicrobials. Such genomic elements may make important contributions to the severity of disease and treatment failures in Salmonella Dublin infections in both humans and cattle.

April 21, 2020 |

RNA sequencing: the teenage years.

Over the past decade, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs. However, as next-generation sequencing technologies have developed, so too has RNA-seq. Now, RNA-seq methods are available for studying many different aspects of RNA biology, including single-cell gene expression, translation (the translatome) and RNA structure (the structurome). Exciting new applications are being explored, such as spatial transcriptomics (spatialomics). Together with new long-read and direct RNA-seq technologies and better computational tools for data analysis, innovations in RNA-seq are contributing to a fuller understanding of RNA biology, from questions such as when and where transcription occurs to the folding and intermolecular interactions that govern RNA function.

April 21, 2020 |

The Chinese chestnut genome: a reference for species restoration

Forest tree species are increasingly subject to severe mortalities from exotic pests, diseases, and invasive organisms, accelerated by climate change. Forest health issues are threatening multiple species and ecosystem sustainability globally. While sources of resistance may be available in related species, or among surviving trees, introgression of resistance genes into threatened tree species in reasonable time frames requires genome-wide breeding tools. Asian species of chestnut (Castanea spp.) are being employed as donors of disease resistance genes to restore native chestnut species in North America and Europe. To aid in the restoration of threatened chestnut species, we present the assembly of a reference genome with chromosome-scale sequences for Chinese chestnut (C. mollissima), the disease-resistance donor for American chestnut restoration. We also demonstrate the value of the genome as a platform for research and species restoration, including new insights into the evolution of blight resistance in Asian chestnut species, the locations in the genome of ecologically important signatures of selection differentiating American chestnut from Chinese chestnut, the identification of candidate genes for disease resistance, and preliminary comparisons of genome organization with related species.

April 21, 2020 |

Complete Genome of Bacillus velezensis CMT-6 and Comparative Genome Analysis Reveals Lipopeptide Diversity.

The complete genome sequence of Bacillus velezensis type strain CMT-6 is presented for the first time. A comparative analysis between the genome sequences of CMT-6 with the genome of Bacillus amyloliquefaciens DSM7T, B. velezensis FZB42, and Bacillus subtilis 168 revealed major differences in the lipopeptide synthesis genes. Of the above, only the CMT-6 strain possessed an integrated synthetase gene for synthesizing surfactin, iturin, and fengycin. However, CMT-6 shared 14, 12, and 10 other lipopeptide-producing genes with FZB42, DSM7T, and 168 respectively. The largest numbers of non-synonymous mutations were detected in 205 gene sequences that produced these three lipopeptides in CMT-6 and 168. Comparing CMT-6 with DSM7T, 58 non-synonymous mutations were detected in gene sequences that contributed to produce lipopeptides. In addition, InDels were identified in yczE and glnR genes. CMT-6 and FZB42 had the lowest number of non-synonymous mutations with 8 lipopeptide-related gene sequences. And InDels were identified in only yczE. The numbers of core genes, InDels, and non-synonymous mutations in genes were the main reasons for the differences in yield and variety of lipopeptides. These results will enrich the genomic resources available for B. velezensis and provide fundamental information to construct strains that can produce specific lipopeptides.

April 21, 2020 |

Genome data of Fusarium oxysporum f. sp. cubense race 1 and tropical race 4 isolates using long-read sequencing.

Fusarium wilt of banana is caused by the soil-borne fungal pathogen Fusarium oxysporum f. sp. cubense (Foc). We generated two chromosome-level assemblies of Foc race 1 and tropical race 4 strains using single-molecule real-time sequencing. The Foc1 and FocTR4 assemblies had 35 and 29 contigs with contig N50 lengths of 2.08 Mb and 4.28 Mb, respectively. These two new references genomes represent a greater than 100-fold improvement over the contig N50 statistics of the previous short read-based Foc assemblies. The two high-quality assemblies reported here will be a valuable resource for the comparative analysis of Foc races at the pathogenic levels.

Auto Tag: Comparative analysis

Evaluating the potential of new sequencing technologies for genotyping and variation discovery in human data.

Resolving the ‘dark matter’ in genomes.

Targeted sequencing of genes from soybean using NimbleGen SeqCap EZ and PacBio SMRT Sequencing

A comprehensive lincRNA analysis: From conifers to trees

Comparative metagenome-assembled genome analysis of “Candidatus Lachnocurva vaginae”, formerly known as Bacterial Vaginosis Associated bacterium – 1 (BVAB1)

PAG Conference: Iso-Seq analysis for plant & animal genomes – annotation evaluation & phasing

PAG Conference: Reference-quality drosophila genome assemblies for evolutionary analysis of previously inaccessible genomic regions

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads.

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.

The Genome of the Zebra Mussel, Dreissena polymorpha: A Resource for Invasive Species Research

Comparative Genomic Analysis of Virulence, Antimicrobial Resistance, and Plasmid Profiles of Salmonella Dublin Isolated from Sick Cattle, Retail Beef, and Humans in the United States.

RNA sequencing: the teenage years.

The Chinese chestnut genome: a reference for species restoration

Complete Genome of Bacillus velezensis CMT-6 and Comparative Genome Analysis Reveals Lipopeptide Diversity.

Genome data of Fusarium oxysporum f. sp. cubense race 1 and tropical race 4 isolates using long-read sequencing.

Subscribe for blog updates:

Filter by topic

Talk with an expert

ALS case study

Subscribe for blog updates:

Filter by topic

Talk with an expert