During the past decade, the search for pathogenic mutations in rare human genetic diseases has involved huge efforts to sequence coding regions, or the entire genome, using massively parallel short-read sequencers. However, the approximate current diagnostic rate is <50% using these approaches, and there remain many rare genetic diseases with unknown cause. There may be many reasons for this, but one plausible explanation is that the responsible mutations are in regions of the genome that are difficult to sequence using conventional technologies (e.g., tandem-repeat expansion or complex chromosomal structural aberrations). Despite the drawbacks of high cost and a shortage of standard analytical methods, several studies have analyzed pathogenic changes in the genome using long-read sequencers. The results of these studies provide hope that further application of long-read sequencers to identify the causative mutations in unsolved genetic diseases may expand our understanding of the human genome and diseases. Such approaches may also be applied to molecular diagnosis and therapeutic strategies for patients with genetic diseases in the future.
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
A Gram-stain-negative, rod-shaped and red-pigmented strain, HME7025T, was isolated from freshwater sampled in the Republic of Korea. Phylogenetic analysis based on its 16S rRNA gene sequence revealed that strain HME7025T formed a lineage within the family Cytophagaceae of the phylum Bacteroidetes. Strain HME7025T was closely related to the genera Pseudarcicella, Arcicella and Flectobacillus. The 16S rRNA gene sequence similarity values of strain HME7025T were under 94.5?% to its closest phylogenetic neighbours. The major fatty acids of strain HME7025T were iso-C15?:?0 (41.9?%), summed feature 3 (comprising C16?:?1?7c and/or C16?:?1?6c; 12.2?%) and anteiso-C15?:?0 (10.8?%). The major respiratory quinone was menaquinone-7. The major polar lipids were phosphatidylethanolamine, two unidentified aminophospholipids and one unidentified polar lipid. The DNA G+C content of strain HME7025T was 37.9?mol%. On the basis of the evidence presented in this study, strain HME7025T represents a novel species of a novel genus within the family Cytophagaceae, for which the name Allopseudarcicella aquatilis gen. nov., sp. nov. is proposed. The type strain is HME7025T (=KCTC 23617T=CECT 7957T).
Pacific Biosciences’ SMRT sequencing method was used to extend the sequence of HLA-A*02:13. © 2019 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Transcriptional initiation of a small RNA, not R-loop stability, dictates the frequency of pilin antigenic variation in Neisseria gonorrhoeae.
Neisseria gonorrhoeae, the sole causative agent of gonorrhea, constitutively undergoes diversification of the Type IV pilus. Gene conversion occurs between one of the several donor silent copies located in distinct loci and the recipient pilE gene, encoding the major pilin subunit of the pilus. A guanine quadruplex (G4) DNA structure and a cis-acting sRNA (G4-sRNA) are located upstream of the pilE gene and both are required for pilin antigenic variation (Av). We show that the reduced sRNA transcription lowers pilin Av frequencies. Extended transcriptional elongation is not required for Av, since limiting the transcript to 32 nt allows for normal Av frequencies. Using chromatin immunoprecipitation (ChIP) assays, we show that cellular G4s are less abundant when sRNA transcription is lower. In addition, using ChIP, we demonstrate that the G4-sRNA forms a stable RNA:DNA hybrid (R-loop) with its template strand. However, modulating R-loop levels by controlling RNase HI expression does not alter G4 abundance quantified through ChIP. Since pilin Av frequencies were not altered when modulating R-loop levels by controlling RNase HI expression, we conclude that transcription of the sRNA is necessary, but stable R-loops are not required to promote pilin Av. © 2019 John Wiley & Sons Ltd.
Insights into the bacterial species and communities of a full-scale anaerobic/anoxic/oxic wastewater treatment plant by using third-generation sequencing.
For the first time, full-length 16S rRNA sequencing method was applied to disclose the bacterial species and communities of a full-scale wastewater treatment plant using an anaerobic/anoxic/oxic (A/A/O) process in Wuhan, China. The compositions of the bacteria at phylum and class levels in the activated sludge were similar to which revealed by Illumina Miseq sequencing. At genus and species levels, third-generation sequencing showed great merits and accuracy. Typical functional taxa classified to ammonia-oxidizing bacteria (AOB), nitrite-oxidizing bacteria (NOB), denitrifying bacteria (DB), anaerobic ammonium oxidation bacteria (ANAMMOXB) and polyphosphate-accumulating organisms (PAOs) were presented, which were Nitrosomonas (1.11%), Nitrospira (3.56%), Pseudomonas (3.88%), Planctomycetes (13.80%), Comamonadaceae (1.83%), respectively. Pseudomonas (3.88%) and Nitrospira (3.56%) were the most predominating two genera, mainly containing Pseudomonas extremaustralis (1.69%), Nitrospira defluvii (3.13%), respectively. Bacteria regarding to nitrogen and phosphorus removal at species level were put forward. The predicted functions proved that the A/A/O process was efficient regarding nitrogen and organics removal. Copyright © 2019 The Society for Biotechnology, Japan. Published by Elsevier B.V. All rights reserved.
In the wake of constant improvements in sequencing technologies, numerous insect genomes have been sequenced. Currently, 1219 insect genome-sequencing projects have been registered with the National Center for Biotechnology Information, including 401 that have genome assemblies and 155 with an official gene set of annotated protein-coding genes. Comparative genomics analysis showed that the expansion or contraction of gene families was associated with well-studied physiological traits such as immune system, metabolic detoxification, parasitism and polyphagy in insects. Here, we summarize the progress of insect genome sequencing, with an emphasis on how this impacts research on pest control. We begin with a brief introduction to the basic concepts of genome assembly, annotation and metrics for evaluating the quality of draft assemblies. We then provide an overview of genome information for numerous insect species, highlighting examples from prominent model organisms, agricultural pests and disease vectors. We also introduce the major insect genome databases. The increasing availability of insect genomic resources is beneficial for developing alternative pest control methods. However, many opportunities remain for developing data-mining tools that make maximal use of the available insect genome resources. Although rapid progress has been achieved, many challenges remain in the field of insect genomics. © 2019 The Royal Entomological Society.
Relative Performance of MinION (Oxford Nanopore Technologies) versus Sequel (Pacific Biosciences) Third-Generation Sequencing Instruments in Identification of Agricultural and Forest Fungal Pathogens.
Culture-based molecular identification methods have revolutionized detection of pathogens, yet these methods are slow and may yield inconclusive results from environmental materials. The second-generation sequencing tools have much-improved precision and sensitivity of detection, but these analyses are costly and may take several days to months. Of the third-generation sequencing techniques, the portable MinION device (Oxford Nanopore Technologies) has received much attention because of its small size and possibility of rapid analysis at reasonable cost. Here, we compare the relative performances of two third-generation sequencing instruments, MinION and Sequel (Pacific Biosciences), in identification and diagnostics of fungal and oomycete pathogens from conifer (Pinaceae) needles and potato (Solanum tuberosum) leaves and tubers. We demonstrate that the Sequel instrument is efficient for metabarcoding of complex samples, whereas MinION is not suited for this purpose due to a high error rate and multiple biases. However, we find that MinION can be utilized for rapid and accurate identification of dominant pathogenic organisms and other associated organisms from plant tissues following both amplicon-based and PCR-free metagenomics approaches. Using the metagenomics approach with shortened DNA extraction and incubation times, we performed the entire MinION workflow, from sample preparation through DNA extraction, sequencing, bioinformatics, and interpretation, in 2.5 h. We advocate the use of MinION for rapid diagnostics of pathogens and potentially other organisms, but care needs to be taken to control or account for multiple potential technical biases.IMPORTANCE Microbial pathogens cause enormous losses to agriculture and forestry, but current combined culturing- and molecular identification-based detection methods are too slow for rapid identification and application of countermeasures. Here, we develop new and rapid protocols for Oxford Nanopore MinION-based third-generation diagnostics of plant pathogens that greatly improve the speed of diagnostics. However, due to high error rate and technical biases in MinION, the Pacific BioSciences Sequel platform is more useful for in-depth amplicon-based biodiversity monitoring (metabarcoding) from complex environmental samples.Copyright © 2019 American Society for Microbiology.
Whole Genome Sequencing and Analysis of Chlorimuron-Ethyl Degrading Bacteria Klebsiella pneumoniae 2N3.
Klebsiella pneumoniae 2N3 is a strain of gram-negative bacteria that can degrade chlorimuron-ethyl and grow with chlorimuron-ethyl as the sole nitrogen source. The complete genome of Klebsiella pneumoniae 2N3 was sequenced using third generation high-throughput DNA sequencing technology. The genomic size of strain 2N3 was 5.32 Mb with a GC content of 57.33% and a total of 5156 coding genes and 112 non-coding RNAs predicted. Two hydrolases expressed by open reading frames (ORFs) 0934 and 0492 were predicted and experimentally confirmed by gene knockout to be involved in the degradation of chlorimuron-ethyl. Strains of ?ORF 0934, ?ORF 0492, and wild type (WT) reached their highest growth rates after 8-10 hours in incubation. The degradation rates of chlorimuron-ethyl by both ?ORF 0934 and ?ORF 0492 decreased in comparison to the WT during the first 8 hours in culture by 25.60% and 24.74%, respectively, while strains ?ORF 0934, ?ORF 0492, and the WT reached the highest degradation rates of chlorimuron-ethyl in 36 hours of 74.56%, 90.53%, and 95.06%, respectively. This study provides scientific evidence to support the application of Klebsiella pneumoniae 2N3 in bioremediation to control environmental pollution.
A high-quality genome sequence of any model organism is an essential starting point for genetic and other studies. Older clone-based methods are slow and expensive, whereas faster, cheaper short-read-only assemblies can be incomplete and highly fragmented, which minimizes their usefulness. The last few years have seen the introduction of many new technologies for genome assembly. These new technologies and associated new algorithms are typically benchmarked on microbial genomes or, if they scale appropriately, on larger (e.g., human) genomes. However, plant genomes can be much more repetitive and larger than the human genome, and plant biochemistry often makes obtaining high-quality DNA that is free from contaminants difficult. Reflecting their challenging nature, we observe that plant genome assembly statistics are typically poorer than for vertebrates.Here, we compare Illumina short read, Pacific Biosciences long read, 10x Genomics linked reads, Dovetail Hi-C, and BioNano Genomics optical maps, singly and combined, in producing high-quality long-range genome assemblies of the potato species Solanum verrucosum. We benchmark the assemblies for completeness and accuracy, as well as DNA compute requirements and sequencing costs.The field of genome sequencing and assembly is reaching maturity, and the differences we observe between assemblies are surprisingly small. We expect that our results will be helpful to other genome projects, and that these datasets will be used in benchmarking by assembly algorithm developers. © The Author(s) 2019. Published by Oxford University Press.
A new full-length circular DNA sequencing method for viral-sized genomes reveals that RNAi transgenic plants provoke a shift in geminivirus populations in the field.
We present a new method, CIDER-Seq (Circular DNA Enrichment sequencing) for the unbiased enrichment and long-read sequencing of viral-sized circular DNA molecules. We used CIDER-Seq to produce single-read full-length virus genomes for the first time. CIDER-Seq combines PCR-free virus enrichment with Single Molecule Real Time sequencing and a new sequence de-concatenation algorithm. We apply our technique to produce >1200 full-length, highly accurate geminivirus genomes from RNAi-transgenic and control plants in a field trial in Kenya. Using CIDER-Seq we can demonstrate for the first time that the expression of antiviral double-stranded RNA (dsRNA) in transgenic plants causes a consistent shift in virus populations towards species sharing low homology to the transgene derived dsRNA. Our method and its application in an economically important crop plant opens new possibilities in periodic virus sequence surveillance and accurate profiling of diverse circular DNA elements.
Long-read next-generation amplicon sequencing shows promise for studying complete genes or genomes from complex and diverse populations. Current long-read sequencing technologies have challenging error profiles, hindering data processing and incorporation into downstream analyses. Here we consider the problem of how to reconstruct, free of sequencing error, the true sequence variants and their associated frequencies from PacBio reads. Called ‘amplicon denoising’, this problem has been extensively studied for short-read sequencing technologies, but current solutions do not always successfully generalize to long reads with high indel error rates. We introduce two methods: one that runs nearly instantly and is very accurate for medium length reads and high template coverage, and another, slower method that is more robust when reads are very long or coverage is lower. On two Mock Virus Community datasets with ground truth, each sequenced on a different PacBio instrument, and on a number of simulated datasets, we compare our two approaches to each other and to existing algorithms. We outperform all tested methods in accuracy, with competitive run times even for our slower method, successfully discriminating templates that differ by a just single nucleotide. Julia implementations of Fast Amplicon Denoising (FAD) and Robust Amplicon Denoising (RAD), and a webserver interface, are freely available. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution.
Targeted PCR amplification and high-throughput sequencing (amplicon sequencing) of 16S rRNA gene fragments is widely used to profile microbial communities. New long-read sequencing technologies can sequence the entire 16S rRNA gene, but higher error rates have limited their attractiveness when accuracy is important. Here we present a high-throughput amplicon sequencing methodology based on PacBio circular consensus sequencing and the DADA2 sample inference method that measures the full-length 16S rRNA gene with single-nucleotide resolution and a near-zero error rate. In two artificial communities of known composition, our method recovered the full complement of full-length 16S sequence variants from expected community members without residual errors. The measured abundances of intra-genomic sequence variants were in the integral ratios expected from the genuine allelic variants within a genome. The full-length 16S gene sequences recovered by our approach allowed Escherichia coli strains to be correctly classified to the O157:H7 and K12 sub-species clades. In human fecal samples, our method showed strong technical replication and was able to recover the full complement of 16S rRNA alleles in several E. coli strains. There are likely many applications beyond microbial profiling for which high-throughput amplicon sequencing of complete genes with single-nucleotide resolution will be of use. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
Genome-Wide Screening for Enteric Colonization Factors in Carbapenem-Resistant ST258 Klebsiella pneumoniae.
A diverse, antibiotic-naive microbiota prevents highly antibiotic-resistant microbes, including carbapenem-resistant Klebsiella pneumoniae (CR-Kp), from achieving dense colonization of the intestinal lumen. Antibiotic-mediated destruction of the microbiota leads to expansion of CR-Kp in the gut, markedly increasing the risk of bacteremia in vulnerable patients. While preventing dense colonization represents a rational approach to reduce intra- and interpatient dissemination of CR-Kp, little is known about pathogen-associated factors that enable dense growth and persistence in the intestinal lumen. To identify genetic factors essential for dense colonization of the gut by CR-Kp, we constructed a highly saturated transposon mutant library with >150,000 unique mutations in an ST258 strain of CR-Kp and screened for in vitro growth and in vivo intestinal colonization in antibiotic-treated mice. Stochastic and partially reversible fluctuations in the representation of different mutations during dense colonization revealed the dynamic nature of intestinal microbial populations. We identified genes that are crucial for early and late stages of dense gut colonization and confirmed their role by testing isogenic mutants in in vivo competition assays with wild-type CR-Kp Screening of the transposon library also identified mutations that enhanced in vivo CR-Kp growth. These newly identified colonization factors may provide novel therapeutic opportunities to reduce intestinal colonization by CR-KpIMPORTANCEKlebsiella pneumoniae is a common cause of bloodstream infections in immunocompromised and hospitalized patients, and over the last 2 decades, some strains have acquired resistance to nearly all available antibiotics, including broad-spectrum carbapenems. The U.S. Centers for Disease Control and Prevention has listed carbapenem-resistant K. pneumoniae (CR-Kp) as an urgent public health threat. Dense colonization of the intestine by CR-Kp and other antibiotic-resistant bacteria is associated with an increased risk of bacteremia. Reducing the density of gut colonization by CR-Kp is likely to reduce their transmission from patient to patient in health care facilities as well as systemic infections. How CR-Kp expands and persists in the gut lumen, however, is poorly understood. Herein, we generated a highly saturated mutant library in a multidrug-resistant K. pneumoniae strain and identified genetic factors that are associated with dense gut colonization by K. pneumoniae This study sheds light on host colonization by K. pneumoniae and identifies potential colonization factors that contribute to high-density persistence of K. pneumoniae in the intestine. Copyright © 2019 Jung et al.
Amplification-free long-read sequencing of TCF4 expanded trinucleotide repeats in Fuchs Endothelial Corneal Dystrophy.
Amplification of a CAG trinucleotide motif (CTG18.1) within the TCF4 gene has been strongly associated with Fuchs Endothelial Corneal Dystrophy (FECD). Nevertheless, a small minority of clinically unaffected elderly patients who have expanded CTG18.1 sequences have been identified. To test the hypothesis that the CAG expansions in these patients are protected from FECD because they have interruptions within the CAG repeats, we utilized a combination of an amplification-free, long-read sequencing method and a new target-enrichment sequence analysis tool developed by Pacific Biosciences to interrogate the sequence structure of expanded repeats. The sequencing was successful in identifying a previously described interruption within an unexpanded allele and provided sequence data on expanded alleles greater than 2000 bases in length. The data revealed considerable heterogeneity in the size distribution of expanded repeats within each patient. Detailed analysis of the long sequence reads did not reveal any instances of interruptions to the expanded CAG repeats, but did reveal novel variants within the AGG repeats that flank the CAG repeats in two of the five samples from clinically unaffected patients with expansions. This first examination of the sequence structure of CAG repeats in CTG18.1 suggests that factors other than interruptions to the repeat structure account for the absence of disease in some elderly patients with repeat expansions in the TCF4 gene.