June 1, 2021  |  

Complete HIV-1 genomes from single molecules: Diversity estimates in two linked transmission pairs using clustering and mutual information.

We sequenced complete HIV-1 genomes from single molecules using Single Molecule, Real- Time (SMRT) Sequencing and derive de novo full-length genome sequences. SMRT sequencing yields long-read sequencing results from individual DNA molecules with a rapid time-to-result. These attributes make it a useful tool for continuous monitoring of viral populations. The single-molecule nature of the sequencing method allows us to estimate variant subspecies and relative abundances by counting methods. We detail mathematical techniques used in viral variant subspecies identification including clustering distance metrics and mutual information. Sequencing was performed in order to better understand the relationships between the specific sequences of transmitted viruses in linked transmission pairs. Samples representing HIV transmission pairs were selected from the Zambia Emory HIV Research Project (Lusaka, Zambia) and sequenced. We examine Single Genome Amplification (SGA) prepped samples and samples containing complex mixtures of genomes. Whole genome consensus estimates for each of the samples were made. Genome reads were clustered using a simple distance metric on aligned reads. Appropriate thresholds were chosen to yield distinct clusters of HIV genomes within samples. Mutual information between columns in the genome alignments was used to measure dependence. In silico mixtures of reads from the SGA samples were made to simulate samples containing exactly controlled complex mixtures of genomes and our clustering methods were applied to these complex mixtures. SMRT Sequencing data contained multiple full-length (greater than 9 kb) continuous reads for each sample. Simple whole genome consensus estimates easily identified transmission pairs. The clustering of the genome reads showed diversity differences between the samples, allowing us to characterize the diversity of the individual quasi-species comprising the patient viral populations across the full genome. Mutual information identified possible dependencies of different positions across the full HIV-1 genome. The SGA consensus genomes agreed with prior Sanger sequencing. Our clustering methods correctly segregated reads to their correct originating genome for the synthetic SGA mixtures. The results open up the potential for reference-agnostic and cost effective full genome sequencing of HIV-1.


June 1, 2021  |  

Rapid sequencing of HIV-1 genomes as single molecules from simple and complex samples.

Background: To better understand the relationships among HIV-1 viruses in linked transmission pairs, we sequenced several samples representing HIV transmission pairs from the Zambia Emory HIV Research Project (Lusaka, Zambia) using Single Molecule, Real-Time (SMRT) Sequencing. Methods: Single molecules were sequenced as full-length (9.6 kb) amplicons directly from PCR products without shearing. This resulted in multiple, fully-phased, complete HIV-1 genomes for each patient. We examined Single Genome Amplification (SGA) prepped samples, as well as samples containing complex mixtures of genomes. We detail mathematical techniques used in viral variant subspecies identification, including clustering distance metrics and mutual information, which were used to derive multiple de novo full-length genome sequences for each patient. Whole genome consensus estimates for each sample were made. Genome reads were clustered using a simple distance metric on aligned reads. Appropriate thresholds were chosen to yield distinct clusters of HIV-1 genomes within samples. Mutual information between columns in the genome alignments was used to measure dependence. In silico mixtures of reads from the SGA samples were made to simulate samples containing exactly controlled complex mixtures of genomes and our clustering methods were applied to these complex mixtures. Results: SMRT Sequencing data contained multiple full-length (>9 kb) continuous reads for each sample. Simple whole-genome consensus estimates easily identified transmission pairs. Clustering of genome reads showed diversity differences between samples, allowing characterization of the quasi-species diversity comprising the patient viral populations across the full genome. Mutual information identified possible dependencies of different positions across the full HIV-1 genome. The SGA consensus genomes agreed with prior Sanger sequencing. Our clustering methods correctly segregated reads to their correct originating genome for the synthetic SGA mixtures. Conclusions: SMRT Sequencing yields long-read sequencing results from individual DNA molecules with a rapid time-to-result. These attributes make it a useful tool for continuous monitoring of viral populations. The single-molecule nature of the sequencing method allows us to estimate variant subspecies and relative abundances by counting methods. The results open up the potential for reference-agnostic and cost effective full genome sequencing of HIV-1.


June 1, 2021  |  

Genome assembly strategies of the recent polyploid, Coffea arabica.

Arabica coffee, revered for its taste and aroma, has a complex genome. It is an allotetraploid (2n=4x=44) with a genome size of approximately 1.3 Gb, derived from the recent (< 0.6 Mya) hybridization of two diploid progenitors (2n=2x=22), C. canephora (710 Mb) and C. eugenioides (670 Mb). Both parental species diverged recently (< 4.2Mya) and their genomes are highly homologous. To facilitate assembly, a dihaploid plant was chosen for sequencing. Initial genome assembly attempts with short read data produced an assembly covering 1,031 Mb of the C. arabica genome with a contig L50 of 9kb. By implementation of long read PacBio at greater than 50x coverage and cutting-edge PacBio software, a de novo PacBio-only genome assembly was constructed that covers 1,042 Mb of the genome with an L50 of 267 kb. The two assemblies were assessed and compared to determine gene content, chimeric regions, and the ability to separate the parental genomes. A genetic map that contains 600 SSRs is being used for anchoring the contigs and improve the sub-genome differentiation together with the search of sub-genome specific SNPs. PacBio transcriptome sequencing is currently being added to finalize gene annotation of the polished assembly. The finished genome assembly will be used to guide re-sequencing assemblies of parental genomes (C. canephora and C. eugenioides) as well as a template for GBS analysis and whole genome re-sequencing of a set of C. arabica accessions representative of the species diversity. The obtained data will provide powerful genomic tools to enable more efficient coffee breeding strategies for this crop, which is highly susceptible to climate change and is the main source of income for millions of small farmers in producing countries.


June 1, 2021  |  

High-accuracy, single-base resolution of near-full-length HIV genomes.

Background: The HIV-1 proviral reservoir is incredibly stable, even while undergoing antiretroviral therapy, and is seen as the major barrier to HIV-1 eradication. Identifying and comprehensively characterizing this reservoir will be critical to achieving an HIV cure. Historically, this has been a tedious and labor intensive process, requiring high-replicate single-genome amplification reactions, or overlapping amplicons that are then reconstructed into full-length genomes by algorithmic imputation. Here, we present a deep sequencing and analysis method able to determine the exact identity and relative abundances of near-full-length HIV genomes from samples containing mixtures of genomes without shearing or complex bioinformatic reconstruction. Methods: We generated clonal near-full-length (~9 kb) amplicons derived from single genome amplification (SGA) of primary proviral isolates or PCR of well-documented control strains. These clonal products were mixed at various abundances and sequenced as near-full-length (~9 kb) amplicons without shearing. Each mixture yielded many near-full-length HIV-1 reads. Mathematical analysis techniques resolved the complex mixture of reads into estimates of distinct near-full-length viral genomes with their relative abundances. Results: Single Molecule, Real-Time (SMRT) Sequencing data contained near-full-length (~9 kb) continuous reads for each sample including some runs with greater than 10,000 near-full-length-genome reads in a three-hour sequencing run. Our methods correctly recapitulated exactly the originating genomes at a single-base resolution and their relative abundances in both mixtures of clonal controls and SGAs, and these results were validated using independent sequencing methods. Correct resolution was achieved even when genomes differed only by a single base. Minor abundances of 5% were reliably detected. Conclusions: SMRT Sequencing yields long-read sequencing results from individual DNA molecules, a rapid time-to-result. The single-molecule, full-length nature of this sequencing method allows us to estimate variant subspecies and relative abundances with single-nucleotide resolution. This method allows for reference-agnostic and cost-effective full-genome sequencing of HIV-1, which could both further our understanding of latent infection and develop novel and improved tools for quantifying HIV provirus, which will be critical to cure HIV.


April 21, 2020  |  

A robust benchmark for germline structural variant detection

New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. Translating these methods to routine research and clinical practice requires robust benchmark sets. We developed the first benchmark set for identification of both false negative and false positive germline SVs, which complements recent efforts emphasizing increasingly comprehensive characterization of SVs. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods, both alignment- and de novo assembly-based, from short-, linked-, and long-read sequencing, as well as optical and electronic mapping. The final benchmark set contains 12745 isolated, sequence-resolved insertion and deletion calls =50 base pairs (bp) discovered by at least 2 technologies or 5 callsets, genotyped as heterozygous or homozygous variants by long reads. The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.66 Gbp and 9641 SVs supported by at least one diploid assembly. Support for SVs was assessed using svviz with short-, linked-, and long-read sequence data. In general, there was strong support from multiple technologies for the benchmark SVs, with 90 % of the Tier 1 SVs having support in reads from more than one technology. The Mendelian genotype error rate was 0.3 %, and genotype concordance with manual curation was >98.7 %. We demonstrate the utility of the benchmark set by showing it reliably identifies both false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping.


April 21, 2020  |  

Dual Role of gnaA in Antibiotic Resistance and Virulence in Acinetobacter baumannii.

Acinetobacter baumannii is an important Gram-negative pathogen in hospital-related infections. However, treatment options for A. baumannii infections have become limited due to multidrug resistance. Bacterial virulence is often associated with capsule genes found in the K locus, many of which are essential for biosynthesis of the bacterial envelope. However, the roles of other genes in the K locus remain largely unknown. From an in vitro evolution experiment, we obtained an isolate of the virulent and multidrug-resistant A. baumannii strain MDR-ZJ06, called MDR-ZJ06M, which has an insertion by the ISAba16 transposon in gnaA (encoding UDP-N-acetylglucosamine C-6 dehydrogenase), a gene found in the K locus. The isolate showed an increased resistance toward tigecycline, whereas the MIC decreased in the case of carbapenems, cephalosporins, colistin, and minocycline. By using knockout and complementation experiments, we demonstrated that gnaA is important for the synthesis of lipooligosaccharide and capsular polysaccharide and that disruption of the gene affects the morphology, drug susceptibility, and virulence of the pathogen.Copyright © 2019 American Society for Microbiology.


April 21, 2020  |  

A critical comparison of technologies for a plant genome sequencing project.

A high-quality genome sequence of any model organism is an essential starting point for genetic and other studies. Older clone-based methods are slow and expensive, whereas faster, cheaper short-read-only assemblies can be incomplete and highly fragmented, which minimizes their usefulness. The last few years have seen the introduction of many new technologies for genome assembly. These new technologies and associated new algorithms are typically benchmarked on microbial genomes or, if they scale appropriately, on larger (e.g., human) genomes. However, plant genomes can be much more repetitive and larger than the human genome, and plant biochemistry often makes obtaining high-quality DNA that is free from contaminants difficult. Reflecting their challenging nature, we observe that plant genome assembly statistics are typically poorer than for vertebrates.Here, we compare Illumina short read, Pacific Biosciences long read, 10x Genomics linked reads, Dovetail Hi-C, and BioNano Genomics optical maps, singly and combined, in producing high-quality long-range genome assemblies of the potato species Solanum verrucosum. We benchmark the assemblies for completeness and accuracy, as well as DNA compute requirements and sequencing costs.The field of genome sequencing and assembly is reaching maturity, and the differences we observe between assemblies are surprisingly small. We expect that our results will be helpful to other genome projects, and that these datasets will be used in benchmarking by assembly algorithm developers. © The Author(s) 2019. Published by Oxford University Press.


April 21, 2020  |  

Rapid and Focused Maturation of a VRC01-Class HIV Broadly Neutralizing Antibody Lineage Involves Both Binding and Accommodation of the N276-Glycan.

The VH1-2 restricted VRC01-class of antibodies targeting the HIV envelope CD4 binding site are a major focus of HIV vaccine strategies. However, a detailed analysis of VRC01-class antibody development has been limited by the rare nature of these responses during natural infection and the lack of longitudinal sampling of such responses. To inform vaccine strategies, we mapped the development of a VRC01-class antibody lineage (PCIN63) in the subtype C infected IAVI Protocol C neutralizer PC063. PCIN63 monoclonal antibodies had the hallmark VRC01-class features and demonstrated neutralization breadth similar to the prototype VRC01 antibody, but were 2- to 3-fold less mutated. Maturation occurred rapidly within ~24 months of emergence of the lineage and somatic hypermutations accumulated at key contact residues. This longitudinal study of broadly neutralizing VRC01-class antibody lineage reveals early binding to the N276-glycan during affinity maturation, which may have implications for vaccine design.Copyright © 2019 The Authors. Published by Elsevier Inc. All rights reserved.


April 21, 2020  |  

Genome of the Komodo dragon reveals adaptations in the cardiovascular and chemosensory systems of monitor lizards.

Monitor lizards are unique among ectothermic reptiles in that they have high aerobic capacity and distinctive cardiovascular physiology resembling that of endothermic mammals. Here, we sequence the genome of the Komodo dragon Varanus komodoensis, the largest extant monitor lizard, and generate a high-resolution de novo chromosome-assigned genome assembly for V. komodoensis using a hybrid approach of long-range sequencing and single-molecule optical mapping. Comparing the genome of V. komodoensis with those of related species, we find evidence of positive selection in pathways related to energy metabolism, cardiovascular homoeostasis, and haemostasis. We also show species-specific expansions of a chemoreceptor gene family related to pheromone and kairomone sensing in V. komodoensis and other lizard lineages. Together, these evolutionary signatures of adaptation reveal the genetic underpinnings of the unique Komodo dragon sensory and cardiovascular systems, and suggest that selective pressure altered haemostasis genes to help Komodo dragons evade the anticoagulant effects of their own saliva. The Komodo dragon genome is an important resource for understanding the biology of monitor lizards and reptiles worldwide.


September 22, 2019  |  

ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies.

Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies.In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences’ own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.ALE is released as open source software under the UoI/NCSA license at http://www.alescore.org. It is implemented in C and Python.


September 22, 2019  |  

Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci.

We report full-length draft de novo genome assemblies for 16 widely used inbred mouse strains and find extensive strain-specific haplotype variation. We identify and characterize 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome, resulting in the completion of 10 new gene structures. Also, 62 new coding loci were added to the reference genome annotation. These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids. Mutant Efcab3-like mice display anomalies in multiple brain regions, suggesting a possible role for this gene in the regulation of brain development.


September 22, 2019  |  

Accurate characterization of the IFITM locus using MiSeq and PacBio sequencing shows genetic variation in Galliformes.

Interferon inducible transmembrane (IFITM) proteins are effectors of the immune system widely characterized for their role in restricting infection by diverse enveloped and non-enveloped viruses. The chicken IFITM (chIFITM) genes are clustered on chromosome 5 and to date four genes have been annotated, namely chIFITM1, chIFITM3, chIFITM5 and chIFITM10. However, due to poor assembly of this locus in the Gallus Gallus v4 genome, accurate characterization has so far proven problematic. Recently, a new chicken reference genome assembly Gallus Gallus v5 was generated using Sanger, 454, Illumina and PacBio sequencing technologies identifying considerable differences in the chIFITM locus over the previous genome releases.We re-sequenced the locus using both Illumina MiSeq and PacBio RS II sequencing technologies and we mapped RNA-seq data from the European Nucleotide Archive (ENA) to this finalized chIFITM locus. Using SureSelect probes capture probes designed to the finalized chIFITM locus, we sequenced the locus of a different chicken breed, namely a White Leghorn, and a turkey.We confirmed the Gallus Gallus v5 consensus except for two insertions of 5 and 1 base pair within the chIFITM3 and B4GALNT4 genes, respectively, and a single base pair deletion within the B4GALNT4 gene. The pull down revealed a single amino acid substitution of A63V in the CIL domain of IFITM2 compared to Red Jungle fowl and 13, 13 and 11 differences between IFITM1, 2 and 3 of chickens and turkeys, respectively. RNA-seq shows chIFITM2 and chIFITM3 expression in numerous tissue types of different chicken breeds and avian cell lines, while the expression of the putative chIFITM1 is limited to the testis, caecum and ileum tissues.Locus resequencing using these capture probes and RNA-seq based expression analysis will allow the further characterization of genetic diversity within Galliformes.


September 22, 2019  |  

Loss of stomach, loss of appetite? Sequencing of the ballan wrasse (Labrus bergylta) genome and intestinal transcriptomic profiling illuminate the evolution of loss of stomach function in fish.

The ballan wrasse (Labrus bergylta) belongs to a large teleost family containing more than 600 species showing several unique evolutionary traits such as lack of stomach and hermaphroditism. Agastric fish are found throughout the teleost phylogeny, in quite diverse and unrelated lineages, indicating stomach loss has occurred independently multiple times in the course of evolution. By assembling the ballan wrasse genome and transcriptome we aimed to determine the genetic basis for its digestive system function and appetite regulation. Among other, this knowledge will aid the formulation of aquaculture diets that meet the nutritional needs of agastric species.Long and short read sequencing technologies were combined to generate a ballan wrasse genome of 805 Mbp. Analysis of the genome and transcriptome assemblies confirmed the absence of genes that code for proteins involved in gastric function. The gene coding for the appetite stimulating protein ghrelin was also absent in wrasse. Gene synteny mapping identified several appetite-controlling genes and their paralogs previously undescribed in fish. Transcriptome profiling along the length of the intestine found a declining expression gradient from the anterior to the posterior, and a distinct expression profile in the hind gut.We showed gene loss has occurred for all known genes related to stomach function in the ballan wrasse, while the remaining functions of the digestive tract appear intact. The results also show appetite control in ballan wrasse has undergone substantial changes. The loss of ghrelin suggests that other genes, such as motilin, may play a ghrelin like role. The wrasse genome offers novel insight in to the evolutionary traits of this large family. As the stomach plays a major role in protein digestion, the lack of genes related to stomach digestion in wrasse suggests it requires formulated diets with higher levels of readily digestible protein than those for gastric species.


September 22, 2019  |  

A manually annotated Actinidia chinensis var. chinensis (kiwifruit) genome highlights the challenges associated with draft genomes and gene prediction in plants.

Most published genome sequences are drafts, and most are dominated by computational gene prediction. Draft genomes typically incorporate considerable sequence data that are not assigned to chromosomes, and predicted genes without quality confidence measures. The current Actinidia chinensis (kiwifruit) ‘Hongyang’ draft genome has 164 Mb of sequences unassigned to pseudo-chromosomes, and omissions have been identified in the gene models.A second genome of an A. chinensis (genotype Red5) was fully sequenced. This new sequence resulted in a 554.0 Mb assembly with all but 6 Mb assigned to pseudo-chromosomes. Pseudo-chromosomal comparisons showed a considerable number of translocation events have occurred following a whole genome duplication (WGD) event some consistent with centromeric Robertsonian-like translocations. RNA sequencing data from 12 tissues and ab initio analysis informed a genome-wide manual annotation, using the WebApollo tool. In total, 33,044 gene loci represented by 33,123 isoforms were identified, named and tagged for quality of evidential support. Of these 3114 (9.4%) were identical to a protein within ‘Hongyang’ The Kiwifruit Information Resource (KIR v2). Some proportion of the differences will be varietal polymorphisms. However, as most computationally predicted Red5 models required manual re-annotation this proportion is expected to be small. The quality of the new gene models was tested by fully sequencing 550 cloned ‘Hort16A’ cDNAs and comparing with the predicted protein models for Red5 and both the original ‘Hongyang’ assembly and the revised annotation from KIR v2. Only 48.9% and 63.5% of the cDNAs had a match with 90% identity or better to the original and revised ‘Hongyang’ annotation, respectively, compared with 90.9% to the Red5 models.Our study highlights the need to take a cautious approach to draft genomes and computationally predicted genes. Our use of the manual annotation tool WebApollo facilitated manual checking and correction of gene models enabling improvement of computational prediction. This utility was especially relevant for certain types of gene families such as the EXPANSIN like genes. Finally, this high quality gene set will supply the kiwifruit and general plant community with a new tool for genomics and other comparative analysis.


September 22, 2019  |  

SvABA: genome-wide detection of structural variants and indels by local assembly.

Structural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA’s performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs and substantially improves detection performance for variants in the 20-300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (<1000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types and found that short templated-sequence insertions occur in ~4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized (50-300 bp) SVs.© 2018 Wala et al.; Published by Cold Spring Harbor Laboratory Press.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.