Menu
October 23, 2019  |  

ParLECH: Parallel Long-Read Error Correction with Hadoop

Long-read sequencing is emerging as a promising sequencing technology because it can tackle the short length limitation of second-generation sequencing, which has dominated the sequencing market in past years. However, it has substantially higher error rates compared to short-read sequencing (e.g., 13% vs. 0.1%), and its sequencing cost per base is typically more expensive than that of short-read sequencing. To address these limitations, we present a distributed hybrid error correction framework, called ParLECH, that is scalable and cost-efficient for PacBio long reads. For correcting the errors in the long reads, ParLECH utilizes the Illumina short reads that have the low error rate with high coverage at low cost. To efficiently analyze the high-throughput Illumina short reads, ParLECH is equipped with Hadoop and a distributed NoSQL system. To further improve the accuracy, ParLECH utilizes the k-mer coverage information of the Illumina short reads. Specifically, we develop a distributed version of the widest path algorithm, which maximizes the minimum k-mer coverage in a path of the de Bruijn graph constructed from the Illumina short reads. We replace an error region in a long read with its corresponding widest path. Our experimental results show that ParLECH can handle large-scale real-world datasets in a scalable and accurate manner. Using ParLECH, we can process a 312 GB human genome PacBio dataset, with a 452 GB Illumina dataset, on 128 nodes in less than 29 hours.


October 23, 2019  |  

A high quality assembly of the Nile Tilapia (Oreochromis niloticus) genome reveals the structure of two sex determination regions.

Tilapias are the second most farmed fishes in the world and a sustainable source of food. Like many other fish, tilapias are sexually dimorphic and sex is a commercially important trait in these fish. In this study, we developed a significantly improved assembly of the tilapia genome using the latest genome sequencing methods and show how it improves the characterization of two sex determination regions in two tilapia species.A homozygous clonal XX female Nile tilapia (Oreochromis niloticus) was sequenced to 44X coverage using Pacific Biosciences (PacBio) SMRT sequencing. Dozens of candidate de novo assemblies were generated and an optimal assembly (contig NG50 of 3.3Mbp) was selected using principal component analysis of likelihood scores calculated from several paired-end sequencing libraries. Comparison of the new assembly to the previous O. niloticus genome assembly reveals that recently duplicated portions of the genome are now well represented. The overall number of genes in the new assembly increased by 27.3%, including a 67% increase in pseudogenes. The new tilapia genome assembly correctly represents two recent vasa gene duplication events that have been verified with BAC sequencing. At total of 146Mbp of additional transposable element sequence are now assembled, a large proportion of which are recent insertions. Large centromeric satellite repeats are assembled and annotated in cichlid fish for the first time. Finally, the new assembly identifies the long-range structure of both a ~9Mbp XY sex determination region on LG1 in O. niloticus, and a ~50Mbp WZ sex determination region on LG3 in the related species O. aureus.This study highlights the use of long read sequencing to correctly assemble recent duplications and to characterize repeat-filled regions of the genome. The study serves as an example of the need for high quality genome assemblies and provides a framework for identifying sex determining genes in tilapia and related fish species.


October 23, 2019  |  

Chromosomal-level assembly of yellow catfish genome using third-generation DNA sequencing and Hi-C analysis.

The yellow catfish, Pelteobagrus fulvidraco, belonging to the Siluriformes order, is an economically important freshwater aquaculture fish species in Asia, especially in Southern China. The aquaculture industry has recently been facing tremendous challenges in germplasm degeneration and poor disease resistance. As the yellow catfish exhibits notable sex dimorphism in growth, with adult males about two- to three-fold bigger than females, the way in which the aquaculture industry takes advantage of such sex dimorphism is another challenge. To address these issues, a high-quality reference genome of the yellow catfish would be a very useful resource.To construct a high-quality reference genome for the yellow catfish, we generated 51.2 Gb short reads and 38.9 Gb long reads using Illumina and Pacific Biosciences (PacBio) sequencing platforms, respectively. The sequencing data were assembled into a 732.8 Mb genome assembly with a contig N50 length of 1.1 Mb. Additionally, we applied Hi-C technology to identify contacts among contigs, which were then used to assemble contigs into scaffolds, resulting in a genome assembly with 26 chromosomes and a scaffold N50 length of 25.8 Mb. Using 24,552 protein-coding genes annotated in the yellow catfish genome, the phylogenetic relationships of the yellow catfish with other teleosts showed that yellow catfish separated from the common ancestor of channel catfish ~81.9 million years ago. We identified 1,717 gene families to be expanded in the yellow catfish, and those gene families are mainly enriched in the immune system, signal transduction, glycosphingolipid biosynthesis, and fatty acid biosynthesis.Taking advantage of Illumina, PacBio, and Hi-C technologies, we constructed the first high-quality chromosome-level genome assembly for the yellow catfish P. fulvidraco. The genomic resources generated in this work not only offer a valuable reference genome for functional genomics studies of yellow catfish to decipher the economic traits and sex determination but also provide important chromosome information for genome comparisons in the wider evolutionary research community.


September 22, 2019  |  

Transcriptome profiling using single-molecule direct RNA sequencing approach for in-depth understanding of genes in secondary metabolism pathways of Camellia sinensis.

Characteristic secondary metabolites, including flavonoids, theanine and caffeine, are important components of Camellia sinensis, and their biosynthesis has attracted widespread interest. Previous studies on the biosynthesis of these major secondary metabolites using next-generation sequencing technologies limited the accurately prediction of full-length (FL) splice isoforms. Herein, we applied single-molecule sequencing to pooled tea plant tissues, to provide a more complete transcriptome of C. sinensis. Moreover, we identified 94 FL transcripts and four alternative splicing events for enzyme-coding genes involved in the biosynthesis of flavonoids, theanine and caffeine. According to the comparison between long-read isoforms and assemble transcripts, we improved the quality and accuracy of genes sequenced by short-read next-generation sequencing technology. The resulting FL transcripts, together with the improved assembled transcripts and identified alternative splicing events, enhance our understanding of genes involved in the biosynthesis of characteristic secondary metabolites in C. sinensis.


September 22, 2019  |  

Single Molecule Sequencing: new outlooks for solving genome assembly and transcripts identification challenges

In this review, we introduce a novel sequencing technology, named Single Molecule Real Time sequencing. Also called Single Molecule Sequencing, as it do not requires any amplification, this new technology is able to pro- duce much longer reads than previous NGS technologies such as Illumina. This read size improvements, which can reach 150 fold, will solve many challenges caused by the actual NGS technologies. Short NGS reads, reach- ing a maximum size of 300 bp, make it hard to reconstitute a whole genome and are always leading to fragmented genome assembly. It is also difficult to correctly infer transcript quantification and identification when there is a high isoforms diversity. Despite their higher error rate, long reads have shown very promising result concerning these actual issues. We show that longer reads can produce less fragmented assembly, with a better quality, but also sequence from start to end mRNA, making it much more easier to infer correct transcript quantification, and even allow new intron structure and so new isoforms discovery.


September 22, 2019  |  

Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics.

Short read massive parallel sequencing has emerged as a standard diagnostic tool in the medical setting. However, short read technologies have inherent limitations such as GC bias, difficulties mapping to repetitive elements, trouble discriminating paralogous sequences, and difficulties in phasing alleles. Long read single molecule sequencers resolve these obstacles. Moreover, they offer higher consensus accuracies and can detect epigenetic modifications from native DNA. The first commercially available long read single molecule platform was the RS system based on PacBio’s single molecule real-time (SMRT) sequencing technology, which has since evolved into their RSII and Sequel systems. Here we capsulize how SMRT sequencing is revolutionizing constitutional, reproductive, cancer, microbial and viral genetic testing.© The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.


September 22, 2019  |  

Long-read based assembly and annotation of a Drosophila simulans genome

Long-read sequencing technologies enable high-quality, contiguous genome assemblies. Here we used SMRT sequencing to assemble the genome of a Drosophila simulans strain originating from Madagascar, the ancestral range of the species. We generated 8 Gb of raw data (~50x coverage) with a mean read length of 6,410 bp, a NR50 of 9,125 bp and the longest subread at 49 kb. We benchmarked six different assemblers and merged the best two assemblies from Canu and Falcon. Our final assembly was 127.41 Mb with a N50 of 5.38 Mb and 305 contigs. We anchored more than 4 Mb of novel sequence to the major chromosome arms, and significantly improved the assembly of peri-centromeric and telomeric regions. Finally, we performed full-length transcript sequencing and used this data in conjunction with short-read RNAseq data to annotate 13,422 genes in the genome, improving the annotation in regions with complex, nested gene structures.


September 22, 2019  |  

Transcriptional diversity during lineage commitment of human blood progenitors.

Blood cells derive from hematopoietic stem cells through stepwise fating events. To characterize gene expression programs driving lineage choice, we sequenced RNA from eight primary human hematopoietic progenitor populations representing the major myeloid commitment stages and the main lymphoid stage. We identified extensive cell type-specific expression changes: 6711 genes and 10,724 transcripts, enriched in non-protein-coding elements at early stages of differentiation. In addition, we found 7881 novel splice junctions and 2301 differentially used alternative splicing events, enriched in genes involved in regulatory processes. We demonstrated experimentally cell-specific isoform usage, identifying nuclear factor I/B (NFIB) as a regulator of megakaryocyte maturation-the platelet precursor. Our data highlight the complexity of fating events in closely related progenitor populations, the understanding of which is essential for the advancement of transplantation and regenerative medicine. Copyright © 2014, American Association for the Advancement of Science.


September 22, 2019  |  

PacBio sequencing and its applications.

Single-molecule, real-time sequencing developed by Pacific BioSciences offers longer read lengths than the second-generation sequencing (SGS) technologies, making it well-suited for unsolved problems in genome, transcriptome, and epigenetics research. The highly-contiguous de novo assemblies using PacBio sequencing can close gaps in current reference assemblies and characterize structural variation (SV) in personal genomes. With longer reads, we can sequence through extended repetitive regions and detect mutations, many of which are associated with diseases. Moreover, PacBio transcriptome sequencing is advantageous for the identification of gene isoforms and facilitates reliable discoveries of novel genes and novel isoforms of annotated genes, due to its ability to sequence full-length transcripts or fragments with significant lengths. Additionally, PacBio’s sequencing technique provides information that is useful for the direct detection of base modifications, such as methylation. In addition to using PacBio sequencing alone, many hybrid sequencing strategies have been developed to make use of more accurate short reads in conjunction with PacBio long reads. In general, hybrid sequencing strategies are more affordable and scalable especially for small-size laboratories than using PacBio Sequencing alone. The advent of PacBio sequencing has made available much information that could not be obtained via SGS alone. Copyright © 2015 The Authors. Production and hosting by Elsevier Ltd.. All rights reserved.


September 22, 2019  |  

Draft genome assembly of the poultry red mite, Dermanyssus gallinae.

The poultry red mite, Dermanyssus gallinae, is a major worldwide concern in the egg-laying industry. Here, we report the first draft genome assembly and gene prediction of Dermanyssus gallinae, based on combined PacBio and MinION long-read de novo sequencing. The ~959-Mb genome is predicted to encode 14,608 protein-coding genes.


September 22, 2019  |  

De novo assembly of a Chinese soybean genome.

Soybean was domesticated in China and has become one of the most important oilseed crops. Due to bottlenecks in their introduction and dissemination, soybeans from different geographic areas exhibit extensive genetic diversity. Asia is the largest soybean market; therefore, a high-quality soybean reference genome from this area is critical for soybean research and breeding. Here, we report the de novo assembly and sequence analysis of a Chinese soybean genome for “Zhonghuang 13” by a combination of SMRT, Hi-C and optical mapping data. The assembled genome size is 1.025 Gb with a contig N50 of 3.46 Mb and a scaffold N50 of 51.87 Mb. Comparisons between this genome and the previously reported reference genome (cv. Williams 82) uncovered more than 250,000 structure variations. A total of 52,051 protein coding genes and 36,429 transposable elements were annotated for this genome, and a gene co-expression network including 39,967 genes was also established. This high quality Chinese soybean genome and its sequence analysis will provide valuable information for soybean improvement in the future.


September 22, 2019  |  

Genome and secretome analysis of Pochonia chlamydosporia provide new insight into egg-parasitic mechanisms.

Pochonia chlamydosporia infects eggs and females of economically important plant-parasitic nematodes. The fungal isolates parasitizing different nematodes are genetically distinct. To understand their intraspecific genetic differentiation, parasitic mechanisms, and adaptive evolution, we assembled seven putative chromosomes of P. chlamydosporia strain 170 isolated from root-knot nematode eggs (~44?Mb, including 7.19% of transposable elements) and compared them with the genome of the strain 123 (~41?Mb) isolated from cereal cyst nematode. We focus on secretomes of the fungus, which play important roles in pathogenicity and fungus-host/environment interactions, and identified 1,750 secreted proteins, with a high proportion of carboxypeptidases, subtilisins, and chitinases. We analyzed the phylogenies of these genes and predicted new pathogenic molecules. By comparative transcriptome analysis, we found that secreted proteins involved in responses to nutrient stress are mainly comprised of proteases and glycoside hydrolases. Moreover, 32 secreted proteins undergoing positive selection and 71 duplicated gene pairs encoding secreted proteins are identified. Two duplicated pairs encoding secreted glycosyl hydrolases (GH30), which may be related to fungal endophytic process and lost in many insect-pathogenic fungi but exist in nematophagous fungi, are putatively acquired from bacteria by horizontal gene transfer. The results help understanding genetic origins and evolution of parasitism-related genes.


September 22, 2019  |  

Pacbio sequencing of copper-tolerant Xanthomonas citri reveals presence of a chimeric plasmid structure and provides insights into reassortment and shuffling of transcription activator-like effectors among X. citri strains.

Xanthomonas citri, a causal agent of citrus canker, has been a well-studied model system due to recent availability of whole genome sequences of multiple strains from different geographical regions. Major limitations in our understanding of the evolution of pathogenicity factors in X. citri strains sequenced by short-read sequencing methods have been tracking plasmid reshuffling among strains due to inability to accurately assign reads to plasmids, and analyzing repeat regions among strains. X. citri harbors major pathogenicity determinants, including variable DNA-binding repeat region containing Transcription Activator-like Effectors (TALEs) on plasmids. The long-read sequencing method, PacBio, has allowed the ability to obtain complete and accurate sequences of TALEs in xanthomonads. We recently sequenced Xanthomonas citri str. Xc-03-1638-1-1, a copper tolerant A group strain isolated from grapefruit in 2003 from Argentina using PacBio RS II chemistry. We analyzed plasmid profiles, copy number and location of TALEs in complete genome sequences of X. citri strains.We utilized the power of long reads obtained by PacBio sequencing to enable assembly of a complete genome sequence of strain Xc-03-1638-1-1, including sequences of two plasmids, 249 kb (plasmid harboring copper resistance genes) and 99 kb (pathogenicity plasmid containing TALEs). The pathogenicity plasmid in this strain is a hybrid plasmid containing four TALEs. Due to the intriguing nature of this pathogenicity plasmid with Tn3-like transposon association, repetitive elements and multiple putative sites for origins of replication, we might expect alternative structures of this plasmid in nature, illustrating the strong adaptive potential of X. citri strains. Analysis of the pathogenicity plasmid among completely sequenced X. citri strains, coupled with Southern hybridization of the pathogenicity plasmids, revealed clues to rearrangements of plasmids and resulting reshuffling of TALEs among strains.We demonstrate in this study the importance of long-read sequencing for obtaining intact sequences of TALEs and plasmids, as well as for identifying rearrangement events including plasmid reshuffling. Rearrangement events, such as the hybrid plasmid in this case, could be a frequent phenomenon in the evolution of X. citri strains, although so far it is undetected due to the inability to obtain complete plasmid sequences with short-read sequencing methods.


September 22, 2019  |  

Extreme haplotype variation in the desiccation-tolerant clubmoss Selaginella lepidophylla.

Plant genome size varies by four orders of magnitude, and most of this variation stems from dynamic changes in repetitive DNA content. Here we report the small 109?Mb genome of Selaginella lepidophylla, a clubmoss with extreme desiccation tolerance. Single-molecule sequencing enables accurate haplotype assembly of a single heterozygous S. lepidophylla plant, revealing extensive structural variation. We observe numerous haplotype-specific deletions consisting of largely repetitive and heavily methylated sequences, with enrichment in young Gypsy LTR retrotransposons. Such elements are active but rapidly deleted, suggesting “bloat and purge” to maintain a small genome size. Unlike all other land plant lineages, Selaginella has no evidence of a whole-genome duplication event in its evolutionary history, but instead shows unique tandem gene duplication patterns reflecting adaptation to extreme drying. Gene expression changes during desiccation in S. lepidophylla mirror patterns observed across angiosperm resurrection plants.


September 22, 2019  |  

Egg case silk gene sequences from Argiope spiders: Evidence for multiple loci and a loss of function between paralogs.

Spiders swath their eggs with silk to protect developing embryos and hatchlings. Egg case silks, like other fibrous spider silks, are primarily composed of proteins called spidroins (spidroin = spider-fibroin). Silks, and thus spidroins, are important throughout the lives of spiders, yet the evolution of spidroin genes has been relatively understudied. Spidroin genes are notoriously difficult to sequence because they are typically very long (= 10 kb of coding sequence) and highly repetitive. Here, we investigate the evolution of spider silk genes through long-read sequencing of Bacterial Artificial Chromosome (BAC) clones. We demonstrate that the silver garden spiderArgiope argentatahas multiple egg case spidroin loci with a loss of function at one locus. We also use degenerate PCR primers to search the genomic DNA of congeneric species and find evidence for multiple egg case spidroin loci in otherArgiopespiders. Comparative analyses show that these multiple loci are more similar at the nucleotide level within a species than between species. This pattern is consistent with concerted evolution homogenizing gene copies within a genome. More complicated explanations include convergent evolution or recent independent gene duplications within each species. Copyright © 2018 Chaw et al.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.