In this Labroots webinar, Meredith Ashby, Director of Microbial Genomics at PacBio, describes the utility of highly accurate long-read sequencing, known as HiFi sequencing, to understand the SARs-CoV-2 viral genome….
Motivation: Third-generation sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results: In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research.
The antibody repertoire of Bos taurus is characterized by a subset of variable heavy (VH) chain regions with ultralong third complementarity determining regions (CDR3) which, compared to other species, can provide a potent response to challenging antigens like HIV env. These unusual CDR3 can range to over seventy highly diverse amino acids in length and form unique ß-ribbon ‘stalk’ and disulfide bonded ‘knob’ structures, far from the typical antigen binding site. The genetic components and processes for forming these unusual cattle antibody VH CDR3 are not well understood. Here we analyze sequences of Bos taurus antibody VH domains and find that the subset with ultralong CDR3 exclusively uses a single variable gene, IGHV1-7 (VHBUL) rearranged to the longest diversity gene, IGHD8-2. An eight nucleotide duplication at the 3′ end of IGHV1-7 encodes a longer V-region producing an extended F ß-strand that contributes to the stalk in a rearranged CDR3. A low amino acid variability was observed in CDR1 and CDR2, suggesting that antigen binding for this subset most likely only depends on the CDR3. Importantly a novel, potentially AID mediated, deletional diversification mechanism of the B. taurus VH ultralong CDR3 knob was discovered, in which interior codons of the IGHD8-2 region are removed while maintaining integral structural components of the knob and descending strand of the stalk in place. These deletions serve to further diversify cysteine positions, and thus disulfide bonded loops. Hence, both germline and somatic genetic factors and processes appear to be involved in diversification of this structurally unusual cattle VH ultralong CDR3 repertoire.
Extracellular RNA has been proposed to mediate communication between cells and organisms however relatively little is understood regarding how specific sequences are selected for export. Here, we describe a specific Argonaute protein (exWAGO) that is secreted in extracellular vesicles (EVs) released by the gastrointestinal nematode Heligmosomoides bakeri, at multiple copies per EV. Phylogenetic and gene expression analyses demonstrate exWAGO orthologues are highly conserved and abundantly expressed in related parasites but highly diverged in free-living genus Caenorhabditis. We show that the most abundant small RNAs released from the nematode parasite are not microRNAs as previously thought, but rather secondary small interfering RNAs (siRNAs) that are produced by RNA-dependent RNA Polymerases. The siRNAs that are released in EVs have distinct evolutionary properties compared to those resident in free-living or parasitic nematodes. Immunoprecipitation of exWAGO demonstrates that it specifically associates with siRNAs from transposons and newly evolved repetitive elements that are packaged in EVs and released into the host environment. Together this work demonstrates molecular and evolutionary selectivity in the small RNA sequences that are released in EVs into the host environment and identifies a novel Argonaute protein as the mediator of this. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences.We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours.The source code of the benchmarking tool is available as Supplementary Materials. © The Author 2017. Published by Oxford University Press.
Long-read based de novo assembly of low-complexity metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system.
Complete and contiguous genome assemblies greatly improve the quality of subsequent systems-wide functional profiling studies and the ability to gain novel biological insights. While a de novo genome assembly of an isolated bacterial strain is in most cases straightforward, more informative data about co-existing bacteria as well as synergistic and antagonistic effects can be obtained from a direct analysis of microbial communities. However, the complexity of metagenomic samples represents a major challenge. While third generation sequencing technologies have been suggested to enable finished metagenome-assembled genomes, to our knowledge, the complete genome assembly of all dominant strains in a microbiome sample has not been demonstrated. Natural whey starter cultures (NWCs) are used in cheese production and represent low-complexity microbiomes. Previous studies of Swiss Gruyère and selected Italian hard cheeses, mostly based on amplicon metagenomics, concurred that three species generally pre-dominate: Streptococcus thermophilus, Lactobacillus helveticus and Lactobacillus delbrueckii.Two NWCs from Swiss Gruyère producers were subjected to whole metagenome shotgun sequencing using the Pacific Biosciences Sequel and Illumina MiSeq platforms. In addition, longer Oxford Nanopore Technologies MinION reads had to be generated for one to resolve repeat regions. Thereby, we achieved the complete assembly of all dominant bacterial genomes from these low-complexity NWCs, which was corroborated by a 16S rRNA amplicon survey. Moreover, two distinct L. helveticus strains were successfully co-assembled from the same sample. Besides bacterial chromosomes, we could also assemble several bacterial plasmids and phages and a corresponding prophage. Biologically relevant insights were uncovered by linking the plasmids and phages to their respective host genomes using DNA methylation motifs on the plasmids and by matching prokaryotic CRISPR spacers with the corresponding protospacers on the phages. These results could only be achieved by employing long-read sequencing data able to span intragenomic as well as intergenomic repeats.Here, we demonstrate the feasibility of complete de novo genome assembly of all dominant strains from low-complexity NWCs based on whole metagenomics shotgun sequencing data. This allowed to gain novel biological insights and is a fundamental basis for subsequent systems-wide omics analyses, functional profiling and phenotype to genotype analysis of specific microbial communities.
In recent genome analyses, population-specific reference panels have indicated important. However, reference panels based on short-read sequencing data do not sufficiently cover long insertions. Therefore, the nature of long insertions has not been well documented. Here, we assembled a Japanese genome using single-molecule real-time sequencing data and characterized insertions found in the assembled genome. We identified 3691 insertions ranging from 100?bps to ~10,000?bps in the assembled genome relative to the international reference sequence (GRCh38). To validate and characterize these insertions, we mapped short-reads from 1070 Japanese individuals and 728 individuals from eight other populations to insertions integrated into GRCh38. With this result, we constructed JRGv1 (Japanese Reference Genome version 1) by integrating the 903 verified insertions, totaling 1,086,173 bases, shared by at least two Japanese individuals into GRCh38. We also constructed decoyJRGv1 by concatenating 3559 verified insertions, totaling 2,536,870 bases, shared by at least two Japanese individuals or by six other assemblies. This assembly improved the alignment ratio by 0.4% on average. These results demonstrate the importance of refining the reference assembly and creating a population-specific reference genome. JRGv1 and decoyJRGv1 are available at the JRG website.