Background: Understanding the co-evolution of HIV populations and broadly neutralizing antibodies (bNAbs) may inform vaccine design. Novel long-read, next-generation sequencing methods allow, for the first time, full-length deep sequencing of HIV env populations. Methods: We longitudinally examined HIV-1 env populations (12 time points) in a subtype A infected individual from the IAVI primary infection cohort (Protocol C) who developed bNAbs (62% ID50>50 on a diverse panel of 105 viruses) targeting the V1/V2 loop region. We developed a PacBio single molecule, real-time sequencing protocol to deeply sequence full-length env from HIV RNA. Bioinformatics tools were developed to align env sequences, infer phylogenies, and interrogate escape dynamics of key residues and glycosylation sites. PacBio env sequences were compared to env sequences generated through amplification and cloning. Env dynamics and viral escape motif evolution were interpreted in the context of the development V1/V2-targeting broadly neutralizing antibodies. Results: We collected a median of 6799 (range: 1770-14727) high quality full-length HIV env circular consensus sequences (CCS) per SMRT Cell, per time point. Using only CCS reads comprised of 6 or more passes over the HIV env insert (= 16 kb read length) ensured that our median per-base accuracy was 99.7%. A phylogeny inferred with PacBio and 100 cloned env sequences (10 time points) found the cloned sequences evenly distributed among PacBio sequences. Viral escape from the V1/V2 targeted bNAbs was evident at V2 positions 160, 166, 167, 169 and 181 (HxB2 numbering), exhibiting several distinct escape pathways by 40 months post-infection. Conclusions: Our PacBio full-length env sequencing method allowed unprecedented view and ability to characterize HIV-1 env dynamics throughout the first four years of infection. Longitudinal full-length env deep sequencing allows accurate phylogenetic inference, provides a detailed picture of escape dynamics in epitope regions, and can identify minority variants, all of which will prove critical for increasing our understanding of how env evolution drives the development of antibody breadth.
ASHG 2015 GRC workshop talk by Tina Graves-Lindsay
Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing
The human reference sequence has provided a foundation for studies of genome structure, human variation, evolutionary biology, and disease. At the time the reference was originally completed there were some loci recalcitrant to closure; however, the degree to which structural variation and diversity affected our ability to produce a representative genome sequence at these loci was still unknown. Many of these regions in the genome are associated with large, repetitive sequences and exhibit complex allelic diversity such producing a single, haploid representation is not possible. To overcome this challenge, we have sequenced DNA from two hydatidiform moles (CHM1 and CHM13), which are essentially haploid. CHM13 was sequenced with the latest PacBio technology (P6-C5) to 52X genome coverage and assembled using Daligner and Falcon v0.2 (GCA_000983455.1, CHM13_1.1). Compared to the first mole (CHM1) PacBio assembly (GCA_001007805.1, 54X) contig N50 of 4.5Mb, the contig N50 of CHM13_1.1 is almost 13Mb, and there is a 13-fold reduction in the number of contigs. This demonstrates the improved contiguity of sequence generated with the new chemistry. We annotated 50,188 RefSeq transcripts of which only 0.63% were split transcripts, and the repetitive and segmental duplication content was within the expected range. These data all indicate an extremely high quality assembly. Additionally, we sequenced CHM13 DNA using Illumina SBS technology to 60X coverage, aligned these reads to the GRCh37, GRCh38, and CHM13_1.1 assemblies and performed variant calling using the SpeedSeq pipeline. The number of single nucleotide variants (SNV) and indels was comparable between GRCh37 and GRCh38. Regions that showed increased SNV density in GRCh38 compared to GRCh37 could be attributed to the addition of centromeric alpha satellite sequence to the reference assembly. Alternatively, regions of decreased SNV density in GRCh38 were concentrated in regions that were improved from BAC based sequencing of CHM1 such as 1p12 and 1q21 containing the SRGAP2 gene family. The alignment of PacBio reads to GRCh37 and GRCh38 assemblies allowed us to resolve complex loci such as the MHC region where the best alignment was to the DBB (A2-B57-DR7) haplotype. Finally, we will discuss how combining the two high quality mole assemblies can be used for benchmarking and novel bioinformatics tool development.
With the increasing availability of whole-genome sequencing, haplotype reconstruction of individual genomes, or haplotype assembly, remains unsolved. Like the de novo genome assembly problem, haplotype assembly is greatly simplified by having more long-range information. The Targeted Locus Amplification (TLA) technology from Cergentis has the unique capability of targeting a specific region of the genome using a single primer pair and yielding ~2 kb DNA circles that are comprised of ~500 bp fragments. Fragments from the same circle come from the same haplotype and follow an exponential decay in distance from the target region, with a span that reaches the multi-megabase range. Here, we apply TLA to the BRCA1 gene on NA12878 and then sequence the resulting 2 kb circles on a PacBio RS II. The multiple fragments per circle were iteratively mapped to hg19 and then haplotype assembled using HAPCUT. We show that the 80 kb length of BRCA1 is represented by a single haplotype block, which was validated against GIAB data. We then explored chromosomal-scale haplotype assembly by combining these data with whole genome shotgun PacBio long reads, and demonstrate haplotype blocks approaching the length of chromosome 17 on which BRCA1 lies. Finally, by performing TLA without the amplification step and size selecting for reads >5 kb to maximize the number of fragments per read, we target whole genome haplotype assembly across all chromosomes.
The domestic pig (Sus scrofa) is important both as a food source and as a biomedical model with high anatomical and immunological similarity to humans. The draft reference genome (Sscrofa10.2) represented a purebred female pig from a commercial pork production breed (Duroc), and was established using older clone-based sequencing methods. The Sscrofa10.2 assembly was incomplete and unresolved redundancies, short range order and orientation errors and associated misassembled genes limited its utility. We present two highly contiguous chromosome-level genome assemblies created with more recent long read technologies and a whole genome shotgun strategy, one for the same Duroc female (Sscrofa11.1) and one for an outbred, composite breed male animal commonly used for commercial pork production (USMARCv1.0). Both assemblies are of substantially higher (>90-fold) continuity and accuracy compared to the earlier reference, and the availability of two independent assemblies provided an opportunity to identify large-scale variants and to error-check the accuracy of representation of the genome. We propose that the improved Duroc breed assembly (Sscrofa11.1) become the reference genome for genomic research in pigs.
Updated assembly resource of Phytophthora ramorum Pr102 isolate incorporating long reads from PacBio sequencing.
The NA1 clonal lineage of Phytophthora ramorum is responsible for Sudden Oak Death, an epidemic that has devastated California’s coastal forest ecosystems. An NA1 isolate Pr102 derived from coast live oak in California was previously sequenced and reported with 65 Mb assembly containing 12 Mb gaps in 2576 scaffolds. Here we report an improved 70 Mb genome in 1512 scaffolds with 6752 bp gaps after incorporating PacBio P5-C3 longreads. This assembly contains 19494 gene models (average gene length 2515 bp) compared to 16134 genes (average gene length of 1673 bp) in the previous version. We predicted 29 new RXLRs and 76 new paralogs of a total 392 RXLRs from this assembly. We predicted 35 CRNs compared to 19 in earlier version with six paralogs. Our lncRNAs prediction identified 255 candidates. This new resource will be invaluable for future evolution studies on the invasive plant pathogen.
We report an improved assembly and scaffolding of the European pear (Pyrus communis L.) genome (referred to as BartlettDHv2.0), obtained using a combination of Pacific Biosciences RSII Long read sequencing (PacBio), Bionano optical mapping, chromatin interaction capture (Hi-C), and genetic mapping. A total of 496.9 million bases (Mb) corresponding to 97% of the estimated genome size were assembled into 494 scaffolds. Hi-C data and a high-density genetic map allowed us to anchor and orient 87% of the sequence on the 17 chromosomes of the pear genome. About 50% (247 Mb) of the genome consists of repetitive sequences. Comparison with previous assemblies of Pyrus communis. and Pyrus x bretschneideri confirmed the presence of 37,445 protein-coding genes, which is 13% fewer than previously predicted.
Mce3R Stress-Resistance Pathway Is Vulnerable to Small-Molecule Targeting That Improves Tuberculosis Drug Activities.
One-third of the world’s population carries Mycobacterium tuberculosis ( Mtb), the infectious agent that causes tuberculosis (TB), and every 17 s someone dies of TB. After infection, Mtb can live dormant for decades in a granuloma structure arising from the host immune response, and cholesterol is important for this persistence of Mtb. Current treatments require long-duration drug regimens with many associated toxicities, which are compounded by the high doses required. We phenotypically screened 35 6-azasteroid analogues against Mtb and found that, at low micromolar concentrations, a subset of the analogues sensitized Mtb to multiple TB drugs. Two analogues were selected for further study to characterize the bactericidal activity of bedaquiline and isoniazid under normoxic and low-oxygen conditions. These two 6-azasteroids showed strong synergy with bedaquiline (fractional inhibitory concentration index = 0.21, bedaquiline minimal inhibitory concentration = 16 nM at 1 µM 6-azasteroid). The rate at which spontaneous resistance to one of the 6-azasteroids arose in the presence of bedaquiline was approximately 10-9, and the 6-azasteroid-resistant mutants retained their isoniazid and bedaquiline sensitivity. Genes in the cholesterol-regulated Mce3R regulon were required for 6-azasteroid activity, whereas genes in the cholesterol catabolism pathway were not. Expression of a subset of Mce3R genes was down-regulated upon 6-azasteroid treatment. The Mce3R regulon is implicated in stress resistance and is absent in saprophytic mycobacteria. This regulon encodes a cholesterol-regulated stress-resistance pathway that we conclude is important for pathogenesis and contributes to drug tolerance, and this pathway is vulnerable to small-molecule targeting in live mycobacteria.
Motivation: Third-generation sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results: In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research.
Brassica napus (AACC, 2n = 38) is an important oilseed crop grown worldwide. However, little is known about the population evolution of this species, the genomic difference between its major genetic groups, such as European and Asian rapeseed, and the impacts of historical large-scale introgression events on this young tetraploid. In this study, we reported the de novo assembly of the genome sequences of an Asian rapeseed (B. napus), Ningyou 7, and its four progenitors and compared these genomes with other available genomic data from diverse European and Asian cultivars. Our results showed that Asian rapeseed originally derived from European rapeseed but subsequently significantly diverged, with rapid genome differentiation after hybridization and intensive local selective breeding. The first historical introgression of B. rapa dramatically broadened the allelic pool but decreased the deleterious variations of Asian rapeseed. The second historical introgression of the double-low traits of European rapeseed (canola) has reshaped Asian rapeseed into two groups (double-low and double-high), accompanied by an increase in genetic load in the double-low group. This study demonstrates distinctive genomic footprints and deleterious SNP (single nucleotide polymorphism) variants for local adaptation by recent intra- and interspecies introgression events and provides novel insights for understanding the rapid genome evolution of a young allopolyploid crop. © 2019 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.
Comparison of DNA Methylation in Vibrio vulnificus Cells Grown in Human Serum with Those Grown in Seawater.
The chromosomal methylation statuses of the highly virulent Vibrio vulnificus strain CMCP6 grown in human serum and in seawater are compared here. Growth in seawater resulted in ~4 times as much methylation as that in human serum, primarily N4-methylcytosines.Copyright © 2019 Conrad and Harwood.
Complete Genome Sequence of Enterococcus faecium QU50, a Thermophilic Lactic Acid Bacterium Capable of Metabolizing Xylose.
Herein, we report the complete genome sequence of Enterococcus faecium QU50, isolated from Egyptian soil and exhibiting intermediate susceptibility to vancomycin. The genome contains a 2,535,796-bp circular chromosome and two plasmids of 196,595?bp and 17,267?bp. IS1062-like sequences were not found.Copyright © 2019 Abe et al.
Anopheles funestus is one of the 3 most consequential and widespread vectors of human malaria in tropical Africa. However, the lack of a high-quality reference genome has hindered the association of phenotypic traits with their genetic basis in this important mosquito.Here we present a new high-quality A. funestus reference genome (AfunF3) assembled using 240× coverage of long-read single-molecule sequencing for contigging, combined with 100× coverage of short-read Hi-C data for chromosome scaffolding. The assembled contigs total 446 Mbp of sequence and contain substantial duplication due to alternative alleles present in the sequenced pool of mosquitos from the FUMOZ colony. Using alignment and depth-of-coverage information, these contigs were deduplicated to a 211 Mbp primary assembly, which is closer to the expected haploid genome size of 250 Mbp. This primary assembly consists of 1,053 contigs organized into 3 chromosome-scale scaffolds with an N50 contig size of 632 kbp and an N50 scaffold size of 93.811 Mbp, representing a 100-fold improvement in continuity versus the current reference assembly, AfunF1.This highly contiguous and complete A. funestus reference genome assembly will serve as an improved basis for future studies of genomic variation and organization in this important disease vector. © The Author(s) 2019. Published by Oxford University Press.
African cichlid fishes are well known for their rapid radiations and are a model system for studying evolutionary processes. Here we compare multiple, high-quality, chromosome-scale genome assemblies to elucidate the genetic mechanisms underlying cichlid diversification and study how genome structure evolves in rapidly radiating lineages.We re-anchored our recent assembly of the Nile tilapia (Oreochromis niloticus) genome using a new high-density genetic map. We also developed a new de novo genome assembly of the Lake Malawi cichlid, Metriaclima zebra, using high-coverage Pacific Biosciences sequencing, and anchored contigs to linkage groups (LGs) using 4 different genetic maps. These new anchored assemblies allow the first chromosome-scale comparisons of African cichlid genomes. Large intra-chromosomal structural differences (~2-28 megabase pairs) among species are common, while inter-chromosomal differences are rare (<10 megabase pairs total). Placement of the centromeres within the chromosome-scale assemblies identifies large structural differences that explain many of the karyotype differences among species. Structural differences are also associated with unique patterns of recombination on sex chromosomes. Structural differences on LG9, LG11, and LG20 are associated with reduced recombination, indicative of inversions between the rock- and sand-dwelling clades of Lake Malawi cichlids. M. zebra has a larger number of recent transposable element insertions compared with O. niloticus, suggesting that several transposable element families have a higher rate of insertion in the haplochromine cichlid lineage.This study identifies novel structural variation among East African cichlid genomes and provides a new set of genomic resources to support research on the mechanisms driving cichlid adaptation and speciation. © The Author(s) 2019. Published by Oxford University Press.
Salmonella Typhimurium sequence type (ST) 313 causes invasive nontyphoidal Salmonella (iNTS) disease in sub-Saharan Africa, targeting susceptible HIV+, malarial, or malnourished individuals. An in-depth genomic comparison between the ST313 isolate D23580 and the well-characterized ST19 isolate 4/74 that causes gastroenteritis across the globe revealed extensive synteny. To understand how the 856 nucleotide variations generated phenotypic differences, we devised a large-scale experimental approach that involved the global gene expression analysis of strains D23580 and 4/74 grown in 16 infection-relevant growth conditions. Comparison of transcriptional patterns identified virulence and metabolic genes that were differentially expressed between D23580 versus 4/74, many of which were validated by proteomics. We also uncovered the S. Typhimurium D23580 and 4/74 genes that showed expression differences during infection of murine macrophages. Our comparative transcriptomic data are presented in a new enhanced version of the Salmonella expression compendium, SalComD23580: http://bioinf.gen.tcd.ie/cgi-bin/salcom_v2.pl. We discovered that the ablation of melibiose utilization was caused by three independent SNP mutations in D23580 that are shared across ST313 lineage 2, suggesting that the ability to catabolize this carbon source has been negatively selected during ST313 evolution. The data revealed a novel, to our knowledge, plasmid maintenance system involving a plasmid-encoded CysS cysteinyl-tRNA synthetase, highlighting the power of large-scale comparative multicondition analyses to pinpoint key phenotypic differences between bacterial pathovariants.