Here we describe the ways in which the sequence and annotation of the Plasmodium falciparum reference genome has changed since its publication in 2002. As the malaria species responsible for the most deaths worldwide, the richness of annotation and accuracy of the sequence are important resources for the P. falciparum research community as well as the basis for interpreting the genomes of subsequently sequenced species. At the time of publication in 2002 over 60% of predicted genes had unknown functions. As of March 2019, this number has been significantly decreased to 33%. The reduction is due to the inclusion of genes that were subsequently characterised experimentally and genes with significant similarity to others with known functions. In addition, the structural annotation of genes has been significantly refined; 27% of gene structures have been changed since 2002, comprising changes in exon-intron boundaries, addition or deletion of exons and the addition or deletion of genes. The sequence has also undergone significant improvements. In addition to the correction of a large number of single-base and insertion or deletion errors, a major miss-assembly between the subtelomeres of chromosome 7 and 8 has been corrected. As the number of sequenced isolates continues to grow rapidly, a single reference genome will not be an adequate basis for interpretating intra-species sequence diversity. We therefore describe in this publication a population reference genome of P. falciparum, called Pfref1. This reference will enable the community to map to regions that are not present in the current assembly. P. falciparum 3D7 will be continued to be maintained with ongoing curation ensuring continual improvements in annotation quality.
Comparative genomic and phylogenetic analyses of Populus section Leuce using complete chloroplast genome sequences
Species of Populus section Leuce are distributed throughout most parts of the Northern Hemisphere and have important economic and ecological significance. However, due to frequent hybridization within Leuce, the phylogenetic relationship between species has not been clarified. The chloroplast (cp) genome is characterized by maternal inheritance and relatively conservative mutation rates; thus, it is a powerful tool for building phylogenetic trees. In this study, we used the PacBio SEQUEL software to determine that the cp genome of Populus tomentosa has a length of 156,558 bp including a long single-copy region (84,717 bp), a small single-copy region (16,555 bp), and a pair of inverted repeat regions (27,643 bp). The cp genome contains 131 unique genes, including 37 transfer RNAs, 8 ribosomal RNAs, and 86 protein-coding genes. We compared the cp genomes of seven species of section Leuce and identified five cp DNA markers with >?1% variable sites. Phylogenetic analyses revealed two evolutionary branches for section Leuce. The species with the closest relationship with P. tomenstosa was P. adenopoda, followed by P. alba. These cp genome data will help to determine the cp evolution of section Leuce and further elucidate the origin of P. tomentosa.
In recent genome analyses, population-specific reference panels have indicated important. However, reference panels based on short-read sequencing data do not sufficiently cover long insertions. Therefore, the nature of long insertions has not been well documented. Here, we assembled a Japanese genome using single-molecule real-time sequencing data and characterized insertions found in the assembled genome. We identified 3691 insertions ranging from 100?bps to ~10,000?bps in the assembled genome relative to the international reference sequence (GRCh38). To validate and characterize these insertions, we mapped short-reads from 1070 Japanese individuals and 728 individuals from eight other populations to insertions integrated into GRCh38. With this result, we constructed JRGv1 (Japanese Reference Genome version 1) by integrating the 903 verified insertions, totaling 1,086,173 bases, shared by at least two Japanese individuals into GRCh38. We also constructed decoyJRGv1 by concatenating 3559 verified insertions, totaling 2,536,870 bases, shared by at least two Japanese individuals or by six other assemblies. This assembly improved the alignment ratio by 0.4% on average. These results demonstrate the importance of refining the reference assembly and creating a population-specific reference genome. JRGv1 and decoyJRGv1 are available at the JRG website.
As the genomes of more metazoan species are sequenced, reports of horizontal transposon transfers (HTT) have increased. Our understanding of the mechanisms of such events is at an early stage. The close physical relationship between a parasite and its host could facilitate horizontal transfer. To date, two studies have identified horizontal transfer of RTEs, a class of retrotransposable elements, involving parasites: ticks might act as vector for BovB between ruminants and squamates, and AviRTE was transferred between birds and parasitic nematodes.We searched for RTEs shared between nematode and mammalian genomes. Given their physical proximity, it was necessary to detect and remove sequence contamination from the genome datasets, which would otherwise distort the signal of horizontal transfer. We developed an approach that is based on reads instead of genomic sequences to reliably detect contamination. From comparison of 43 RTEs across 197 genomes, we identified a single putative case of horizontal transfer: we detected RTE1_Sar from Sorex araneus, the common shrew, in parasitic nematodes. From the taxonomic distribution and evolutionary analysis, we show that RTE1_Sar was horizontally transferred.We identified a new horizontal RTE transfer in host-parasite interactions, which suggests that it is not uncommon. Further, we present and provide the workflow a read-based method to distinguish between contamination and horizontal transfer.
Background: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Bur- rows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. Results: We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modi- fied to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. Conclusions: We prove that our algorithm performs O(nmaxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.
Vertebrate genomes contain a record of retroviruses that invaded the germlines of ancestral hosts and are passed to offspring as endogenous retroviruses (ERVs). ERVs can impact host function since they contain the necessary sequences for expression within the host. Dogs are an important system for the study of disease and evolution, yet no substantiated reports of infectious retroviruses in dogs exist. Here, we utilized Illumina whole genome sequence data to assess the origin and evolution of a recently active gammaretroviral lineage in domestic and wild canids.We identified numerous recently integrated loci of a canid-specific ERV-Fc sublineage within Canis, including 58 insertions that were absent from the reference assembly. Insertions were found throughout the dog genome including within and near gene models. By comparison of orthologous occupied sites, we characterized element prevalence across 332 genomes including all nine extant canid species, revealing evolutionary patterns of ERV-Fc segregation among species as well as subpopulations.Sequence analysis revealed common disruptive mutations, suggesting a predominant form of ERV-Fc spread by trans complementation of defective proviruses. ERV-Fc activity included multiple circulating variants that infected canid ancestors from the last 20 million to within 1.6 million years, with recent bursts of germline invasion in the sublineage leading to wolves and dogs.
Comprehensive analysis of full genome sequence and Bd-milRNA/target mRNAs to discover the mechanism of hypovirulence in Botryosphaeria dothidea strains on pear infection with BdCV1 and BdPV1
Pear ring rot disease, mainly caused by Botryosphaeria dothidea, is widespread in most pear and apple-growing regions. Mycoviruses are used for biocontrol, especially in fruit tree disease. BdCV1 (Botryosphaeria dothidea chrysovirus 1) and BdPV1 (Botryosphaeria dothidea partitivirus 1) influence the biological characteristics of B. dothidea strains. BdCV1 is a potential candidate for the control of fungal disease. Therefore, it is vital to explore interactions between B. dothidea and mycovirus to clarify the pathogenic mechanisms of B. dothidea and hypovirulence of B. dothidea in pear. A high-quality full-length genome sequence of the B. dothidea LW-Hubei isolate was obtained using Single Molecule Real-Time sequencing. It has high repeat sequence with 9.3% and DNA methylation existence in the genome. The 46.34?Mb genomes contained 14,091 predicted genes, which of 13,135 were annotated. B. dothidea was predicted to express 3833 secreted proteins. In bioinformatics analysis, 351 CAZy members, 552 transporters, 128 kinases, and 1096 proteins associated with plant-host interaction (PHI) were identified. RNA-silencing components including two endoribonuclease Dicer, four argonaute (Ago) and three RNA-dependent RNA polymerase (RdRp) molecules were identified and expressed in response to mycovirus infection. Horizontal transfer of the LW-C and LW-P strains indicated that BdCV1 induced host gene silencing in LW-C to suppress BdPV1 transmission. To investigate the role of RNA-silencing in B. dothidea defense, we constructed four small RNA libraries and sequenced B. dothidea micro-like RNAs (Bd-milRNAs) produced in response to BdCV1 and BdPV1 infection. Among these, 167 conserved and 68 candidate novel Bd-milRNAs were identified, of which 161 conserved and 20 novel Bd-milRNA were differentially expressed. WEGO analysis revealed involvement of the differentially expressed Bd-milRNA-targeted genes in metabolic process, catalytic activity, cell process and response to stress or stimulus. BdCV1 had a greater effect on the phenotype, virulence, conidiomata, vertical and horizontal transmission ability, and mycelia cellular structure biological characteristics of B. dothidea strains than BdPV1 and virus-free strains. The results obtained in this study indicate that mycovirus regulates biological processes in B. dothidea through the combined interaction of antiviral defense mediated by RNA-silencing and milRNA-mediated regulation of target gene mRNA expression.