Pacbio reads Archives - Page 48 of 53

July 7, 2019

Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study.

Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated.We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1?kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms.

July 7, 2019

ReMILO: reference assisted misassembly detection algorithm using short and long reads.

Contigs assembled from the second generation sequencing short reads may contain misassemblies, and thus complicate downstream analysis or even lead to incorrect analysis results. Fortunately, with more and more sequenced species available, it becomes possible to use the reference genome of a closely related species to detect misassemblies. In addition, long reads of the third generation sequencing technology have been more and more widely used, and can also help detect misassemblies.Here, we introduce ReMILO, a reference assisted misassembly detection algorithm that uses both short reads and PacBio SMRT long reads. ReMILO aligns the initial short reads to both the contigs and reference genome, and then constructs a novel data structure called red-black multipositional de Bruijn graph to detect misassemblies. In addition, ReMILO also aligns the contigs to long reads and find their differences from the long reads to detect more misassemblies. In our performance test on short read assemblies of human chromosome 14 data, ReMILO can detect 41.8-77.9% extensive misassemblies and 33.6-54.5% local misassemblies. On hybrid short and long read assemblies of S.pastorianus data, ReMILO can also detect 60.6-70.9% extensive misassemblies and 28.6-54.0% local misassemblies.The ReMILO software can be downloaded for free under Artistic License 2.0 from this site: https://github.com/songc001/remilo.baoe@bjtu.edu.cn.Supplementary data are available at Bioinformatics online.© The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

July 7, 2019

A high throughput screen for active human transposable elements.

Transposable elements (TEs) are mobile genetic sequences that randomly propagate within their host’s genome. This mobility has the potential to affect gene transcription and cause disease. However, TEs are technically challenging to identify, which complicates efforts to assess the impact of TE insertions on disease. Here we present a targeted sequencing protocol and computational pipeline to identify polymorphic and novel TE insertions using next-generation sequencing: TE-NGS. The method simultaneously targets the three subfamilies that are responsible for the majority of recent TE activity (L1HS, AluYa5/8, and AluYb8/9) thereby obviating the need for multiple experiments and reducing the amount of input material required.Here we describe the laboratory protocol and detection algorithm, and a benchmark experiment for the reference genome NA12878. We demonstrate a substantial enrichment for on-target fragments, and high sensitivity and precision to both reference and NA12878-specific insertions. We report 17 previously unreported loci for this individual which are supported by orthogonal long-read evidence, and we identify 1470 polymorphic and novel TEs in 12 additional samples that were previously undocumented in databases of insertion polymorphisms.We anticipate that future applications of TE-NGS alongside exome sequencing of patients with sporadic disease will reduce the number of unresolved cases, and improve estimates of the contribution of TEs to human genetic disease.

July 7, 2019

Complete genomic and transcriptional landscape analysis using third-generation sequencing: a case study of Saccharomyces cerevisiae CEN.PK113-7D.

Completion of eukaryal genomes can be difficult task with the highly repetitive sequences along the chromosomes and short read lengths of second-generation sequencing. Saccharomyces cerevisiae strain CEN.PK113-7D, widely used as a model organism and a cell factory, was selected for this study to demonstrate the superior capability of very long sequence reads for de novo genome assembly. We generated long reads using two common third-generation sequencing technologies (Oxford Nanopore Technology (ONT) and Pacific Biosciences (PacBio)) and used short reads obtained using Illumina sequencing for error correction. Assembly of the reads derived from all three technologies resulted in complete sequences for all 16 yeast chromosomes, as well as the mitochondrial chromosome, in one step. Further, we identified three types of DNA methylation (5mC, 4mC and 6mA). Comparison between the reference strain S288C and strain CEN.PK113-7D identified chromosomal rearrangements against a background of similar gene content between the two strains. We identified full-length transcripts through ONT direct RNA sequencing technology. This allows for the identification of transcriptional landscapes, including untranslated regions (UTRs) (5′ UTR and 3′ UTR) as well as differential gene expression quantification. About 91% of the predicted transcripts could be consistently detected across biological replicates grown either on glucose or ethanol. Direct RNA sequencing identified many polyadenylated non-coding RNAs, rRNAs, telomere-RNA, long non-coding RNA and antisense RNA. This work demonstrates a strategy to obtain complete genome sequences and transcriptional landscapes that can be applied to other eukaryal organisms.

July 7, 2019

RepLong: de novo repeat identification using long read sequencing data.

The identification of repetitive elements is important in genome assembly and phylogenetic analyses. The existing de novo repeat identification methods exploiting the use of short reads are impotent in identifying long repeats. Since long reads are more likely to cover repeat regions completely, using long reads is more favorable for recognizing long repeats.In this study, we propose a novel de novo repeat elements identification method namely RepLong based on PacBio long reads. Given that the reads mapped to the repeat regions are highly overlapped with each other, the identification of repeat elements is equivalent to the discovery of consensus overlaps between reads, which can be further cast into a community detection problem in the network of read overlaps. In RepLong, we first construct a network of read overlaps based on pair-wise alignment of the reads, where each vertex indicates a read and an edge indicates a substantial overlap between the corresponding two reads. Secondly, the communities whose intra connectivity is greater than the inter connectivity are extracted based on network modularity optimization. Finally, representative reads in each community are extracted to form the repeat library. Comparison studies on Drosophila melanogaster and human long read sequencing data with genome-based and short-read-based methods demonstrate the efficiency of RepLong in identifying long repeats. RepLong can handle lower coverage data and serve as a complementary solution to the existing methods to promote the repeat identification performance on long-read sequencing data.The software of RepLong is freely available at https://github.com/ruiguo-bio/replong.ywsun@szu.edu.cn or zhuzx@szu.edu.cn.Supplementary data are available at Bioinformatics online.

July 7, 2019

Current advances in genome sequencing of common wheat and its ancestral species

Common wheat is an important and widely cultivated food crop throughout the world. Much progress has been made in regard to wheat genome sequencing in the last decade. Starting from the sequencing of single chromosomes/chromosome arms whole genome sequences of common wheat and its diploid and tetraploid ancestors have been decoded along with the development of sequencing and assembling technologies. In this review, we give a brief summary on international progress in wheat genome sequencing, and mainly focus on reviewing the effort and contributions made by Chinese scientists.

July 7, 2019

Complete genome sequence of Escherichia coli 81009, a representative of the sequence type 131 C1-M27 clade with a multidrug-resistant phenotype.

The sequence type 131 (ST131)-H30 clone is responsible for a significant proportion of multidrug-resistant extraintestinal Escherichia coli infections. Recently, the C1-M27 clade of ST131-H30, associated with blaCTX-M-27, has emerged. The complete genome sequence of E. coli isolate 81009 belonging to this clone, previously used during the development of ST131-specific monoclonal antibodies, is reported here. Copyright © 2018 Mutti et al.

July 7, 2019

Complete genome sequence of a type strain of Mycobacterium abscessus subsp. bolletii, a member of the Mycobacterium abscessus complex.

Mycobacterium abscessus subsp. bolletii is a rapidly growing mycobacterial organism for which the taxonomy is unclear. Here, we report the complete genome sequence of a Mycobacterium abscessus subsp. bolletii type strain. This sequence will provide essential information for future taxonomic and comparative genome studies of these mycobacteria.

July 7, 2019

Draft genome sequence of an active heterotrophic nitrifier-denitrifier, Cupriavidus pauculus UM1.

Here, we present the draft genome sequence ofCupriavidus pauculusUM1, a metal-resistant heterotrophic nitrifier-denitrifier capable of synthesizing nitrite from pyruvic oxime. The size of the genome is 7,402,815 bp with a GC content of 64.8%. This draft assembly consists of 38 scaffolds. Copyright © 2018 Putonti et al.

July 7, 2019

Complete genome sequence of Pseudomonas sp. strain NC02, isolated from soil.

We report here the complete genome sequence of Pseudomonas sp. strain NC02, isolated from soil in eastern Massachusetts. We assembled PacBio reads into a single closed contig with 132× mean coverage and then polished this contig using Illumina MiSeq reads, yielding a 6,890,566-bp sequence with 61.1% GC content. Copyright © 2018 Cerra et al.

July 7, 2019

Complete genome sequence of Escherichia coli ML35.

We report here the complete genome sequence of Escherichia coli strain ML35. We assembled PacBio reads into a single closed contig with 169× mean coverage and then polished this contig using Illumina MiSeq reads, yielding a 4,918,774-bp sequence with 50.8% GC content. Copyright © 2018 Casale et al.

July 7, 2019

De novo genome assembly of a Plasmodium falciparum NF54 clone using Single-Molecule Real-Time Sequencing.

Plasmodium falciparum is the species of human malaria parasite that causes the most severe form of the disease. Here, we used single-molecule real-time (SMRT) sequencing technology from Pacific Biosciences (PacBio) to sequence, assemble de novo, and annotate the genome of a P. falciparum NF54 clone. Copyright © 2018 Bryant et al.

July 7, 2019

Hercules: a profile HMM-based hybrid error correction algorithm for long reads.

Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several types of studies require long and accurate reads. In such cases researchers often combine both technologies and the erroneous long reads are corrected using the short reads. Current approaches rely on various graph or alignment based techniques and do not take the error profile of the underlying technology into account. Efficient machine learning algorithms that address these shortcomings have the potential to achieve more accurate integration of these two technologies. We propose Hercules, the first machine learning-based long read error correction algorithm. Hercules models every long read as a profile Hidden Markov Model with respect to the underlying platform’s error profile. The algorithm learns a posterior transition/emission probability distribution for each long read to correct errors in these reads. We show on two DNA-seq BAC clones (CH17-157L1 and CH17-227A2) that Hercules-corrected reads have the highest mapping rate among all competing algorithms and have the highest accuracy when the breadth of coverage is high. On a large human CHM1 cell line WGS data set, Hercules is one of the few scalable algorithms; and among those, it achieves the highest accuracy.

July 7, 2019

Gapless genome assembly of the potato and tomato early blight pathogen Alternaria solani.

The Alternaria genus consists of saprophytic fungi as well as plant-pathogenic species that have significant economic impact. To date, the genomes of multiple Alternaria species have been sequenced. These studies have yielded valuable data for molecular studies on Alternaria fungi. However, most of the current Alternaria genome assemblies are highly fragmented, thereby hampering the identification of genes that are involved in causing disease. Here, we report a gapless genome assembly of A. solani, the causal agent of early blight in tomato and potato. The genome assembly is a significant step toward a better understanding of pathogenicity of A. solani.

July 7, 2019

Complete genome sequence of the marine Rhodococcus sp. H-CA8f isolated from Comau fjord in Northern Patagonia, Chile

Rhodococcus sp. H-CA8f was isolated from marine sediments obtained from the Comau fjord, located in Northern Chilean Patagonia. Whole-genome sequencing was achieved using PacBio RS II platform, comprising one closed, complete chromosome of 6,19?Mbp with a 62.45% G?+?C content. The chromosome harbours several metabolic pathways providing a wide catabolic potential, where the upper biphenyl route is described. Also, Rhodococcus sp. H-CA8f bears one linear mega-plasmid of 301?Kbp and 62.34% of G?+?C content, where genomic analyses demonstrated that it is constituted mostly by putative ORFs with unknown functions, representing a novel genetic feature. These genetic characteristics provide relevant insights regarding Chilean marine actinobacterial strains.

Auto Tag: Pacbio reads

Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study.

ReMILO: reference assisted misassembly detection algorithm using short and long reads.

A high throughput screen for active human transposable elements.

Complete genomic and transcriptional landscape analysis using third-generation sequencing: a case study of Saccharomyces cerevisiae CEN.PK113-7D.

RepLong: de novo repeat identification using long read sequencing data.

Current advances in genome sequencing of common wheat and its ancestral species

Complete genome sequence of Escherichia coli 81009, a representative of the sequence type 131 C1-M27 clade with a multidrug-resistant phenotype.

Complete genome sequence of a type strain of Mycobacterium abscessus subsp. bolletii, a member of the Mycobacterium abscessus complex.

Draft genome sequence of an active heterotrophic nitrifier-denitrifier, Cupriavidus pauculus UM1.

Complete genome sequence of Pseudomonas sp. strain NC02, isolated from soil.

Complete genome sequence of Escherichia coli ML35.

De novo genome assembly of a Plasmodium falciparum NF54 clone using Single-Molecule Real-Time Sequencing.

Hercules: a profile HMM-based hybrid error correction algorithm for long reads.

Gapless genome assembly of the potato and tomato early blight pathogen Alternaria solani.

Complete genome sequence of the marine Rhodococcus sp. H-CA8f isolated from Comau fjord in Northern Patagonia, Chile

Subscribe for blog updates:

Filter by topic

Talk with an expert

Antimicrobial resistance research

Subscribe for blog updates:

Filter by topic

Talk with an expert