Although Plasmodium vivax is a leading cause of malaria around the world, only a handful of vivax antigens are being studied for vaccine development. Here, we investigated genetic signatures of selection and geospatial genetic diversity of two leading vivax vaccine antigens–Plasmodium vivax merozoite surface protein 1 (pvmsp-1) and Plasmodium vivax circumsporozoite protein (pvcsp). Using scalable next-generation sequencing, we deep-sequenced amplicons of the 42 kDa region of pvmsp-1 (n?=?44) and the complete gene of pvcsp (n?=?47) from Cambodian isolates. These sequences were then compared with global parasite populations obtained from GenBank. Using a combination of statistical and phylogenetic methods to assess…
The genome of Helicobacter pylori is remarkable for its large number of restriction-modification (R-M) systems, and strain-specific diversity in R-M systems has been suggested to limit natural transformation, the major driving force of genetic diversification in H. pylori. We have determined the comprehensive methylomes of two H. pylori strains at single base resolution, using Single Molecule Real-Time (SMRT®) sequencing. For strains 26695 and J99-R3, 17 and 22 methylated sequence motifs were identified, respectively. For most motifs, almost all sites occurring in the genome were detected as methylated. Twelve novel methylation patterns corresponding to nine recognition sequences were detected (26695, 3;…
Although cholera has been present in Latin America since 1991, it had not been epidemic in Haiti for at least 100 years. Recently, however, there has been a severe outbreak of cholera in Haiti.We used third-generation single-molecule real-time DNA sequencing to determine the genome sequences of 2 clinical Vibrio cholerae isolates from the current outbreak in Haiti, 1 strain that caused cholera in Latin America in 1991, and 2 strains isolated in South Asia in 2002 and 2008. Using primary sequence data, we compared the genomes of these 5 strains and a set of previously obtained partial genomic sequences of…
Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome some limitations of current next generation sequencing platforms by providing significantly longer reads, single molecule sequencing, low composition bias and an error profile that is orthogonal to other platforms. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical amplicon resequencing projects.We evaluated the Pacific Biosciences technology for SNP discovery in medical resequencing projects using the Genome Analysis Toolkit, observing high sensitivity and specificity for calling differences in amplicons containing known true or false SNPs.…
Despite modern sequencing efforts, the difficulty in assembly of highly repetitive sequences has prevented resolution of human genome gaps, including some in the coding regions of genes with important biological functions. One such gene, MUC5AC, encodes a large, secreted mucin, which is one of the two major secreted mucins in human airways. The MUC5AC region contains a gap in the human genome reference (hg19) across the large, highly repetitive, and complex central exon. This exon is predicted to contain imperfect tandem repeat sequences and multiple conserved cysteine-rich (CysD) domains. To resolve the MUC5AC genomic gap, we used high-fidelity long PCR…
The cultivation of rice in Africa dates back more than 3,000 years. Interestingly, African rice is not of the same origin as Asian rice (Oryza sativa L.) but rather is an entirely different species (i.e., Oryza glaberrima Steud.). Here we present a high-quality assembly and annotation of the O. glaberrima genome and detailed analyses of its evolutionary history of domestication and selection. Population genomics analyses of 20 O. glaberrima and 94 Oryza barthii accessions support the hypothesis that O. glaberrima was domesticated in a single region along the Niger river as opposed to noncentric domestication events across Africa. We detected…
Characterizing large genomic variants is essential to expanding the research and clinical applications of genome sequencing. While multiple data types and methods are available to detect these structural variants (SVs), they remain less characterized than smaller variants because of SV diversity, complexity, and size. These challenges are exacerbated by the experimental and computational demands of SV analysis. Here, we characterize the SV content of a personal genome with Parliament, a publicly available consensus SV-calling infrastructure that merges multiple data types and SV detection methods.We demonstrate Parliament’s efficacy via integrated analyses of data from whole-genome array comparative genomic hybridization, short-read next-generation…
Accurate sequence and assembly of genomes is a critical first step for studies of genetic variation. We generated a high-quality assembly of the gorilla genome using single-molecule, real-time sequence technology and a string graph de novo assembly algorithm. The new assembly improves contiguity by two to three orders of magnitude with respect to previously released assemblies, recovering 87% of missing reference exons and incomplete gene models. Although regions of large, high-identity segmental duplications remain largely unresolved, this comprehensive assembly provides new biological insight into genetic diversity, structural variation, gene loss, and representation of repeat structures within the gorilla genome. The…
Structural rearrangements have long been recognized as an important source of genetic variation, with implications in phenotypic diversity and disease, yet their detailed evolutionary dynamics remain elusive. Here we use long-read sequencing to generate end-to-end genome assemblies for 12 strains representing major subpopulations of the partially domesticated yeast Saccharomyces cerevisiae and its wild relative Saccharomyces paradoxus. These population-level high-quality genomes with comprehensive annotation enable precise definition of chromosomal boundaries between cores and subtelomeres and a high-resolution view of evolutionary genome dynamics. In chromosomal cores, S. paradoxus shows faster accumulation of balanced rearrangements (inversions, reciprocal translocations and transpositions), whereas S. cerevisiae…
‘Candidatus Liberibacter solanacearum’ contains two solanaceous crop-infecting haplotypes, A and B. Two haplotype A draft genomes were assembled and compared with ZC1 (haplotype B), revealing inversion and relocation genomic rearrangements, numerous single-nucleotide polymorphisms, and differences in phage-related regions. Differences in prophage location and sequence were seen both within and between haplotype comparisons. OrthoMCL and BLAST analyses identified 46 putative coding sequences present in haplotype A that were not present in haplotype B. Thirty-eight of these loci were not found in sequences from other Liberibacter spp. Quantitative polymerase chain reaction (qPCR) assays designed to amplify sequences from 15 of these loci…
Chromosomal rearrangements, which shuffle DNA throughout the genome, are an important source of divergence across taxa. Using a paired-end read approach with Illumina sequence data for archaic humans, I identify changes in genome structure that occurred recently in human evolution. Hundreds of rearrangements indicate genomic trafficking between the sex chromosomes and autosomes, raising the possibility of sex-specific changes. Additionally, genes adjacent to genome structure changes in Neanderthals are associated with testis-specific expression, consistent with evolutionary theory that new genes commonly form with expression in the testes. I identify one case of new-gene creation through transposition from the Y chromosome to…
Comprehensive discovery of structural variation (SV) from whole genome sequencing data requires multiple detection signals including read-pair, split-read, read-depth and prior knowledge. Owing to technical challenges, extant SV discovery algorithms either use one signal in isolation, or at best use two sequentially. We present LUMPY, a novel SV discovery framework that naturally integrates multiple SV signals jointly across multiple samples. We show that LUMPY yields improved sensitivity, especially when SV signal is reduced owing to either low coverage data or low intra-sample variant allele frequency. We also report a set of 4,564 validated breakpoints from the NA12878 human genome. https://github.com/arq5x/lumpy-sv.
Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism) matrix (reference-based and reference-free), and (3) phylogenetic inference method (FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole genome…
Whole-genome sequences are now available for many microbial species and clades, however, existing whole-genome alignment methods are limited in their ability to perform sequence comparisons of multiple sequences simultaneously. Here we present the Harvest suite of core-genome alignment and visualization tools for the rapid and simultaneous analysis of thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Together they provide interactive core-genome alignments, variant calls, recombination detection, and phylogenetic trees. Using simulated and real data we demonstrate that our approach exhibits unrivaled speed while maintaining the accuracy of existing methods. The…
The DNA Data Bank of Japan Center (DDBJ Center; http://www.ddbj.nig.ac.jp) maintains and provides public archival, retrieval and analytical services for biological information. Since October 2013, DDBJ Center has operated the Japanese Genotype-phenotype Archive (JGA) in collaboration with our partner institute, the National Bioscience Database Center (NBDC) of the Japan Science and Technology Agency. DDBJ Center provides the JGA database system which securely stores genotype and phenotype data collected from individuals whose consent agreements authorize data release only for specific research use. NBDC has established guidelines and policies for sharing human-derived data and reviews data submission and usage requests from researchers.…