Menu
July 7, 2019

The draft genome of whitefly Bemisia tabaci MEAM1, a global crop pest, provides novel insights into virus transmission, host adaptation, and insecticide resistance.

The whitefly Bemisia tabaci (Hemiptera: Aleyrodidae) is among the 100 worst invasive species in the world. As one of the most important crop pests and virus vectors, B. tabaci causes substantial crop losses and poses a serious threat to global food security. We report the 615-Mb high-quality genome sequence of B. tabaci Middle East-Asia Minor 1 (MEAM1), the first genome sequence in the Aleyrodidae family, which contains 15,664 protein-coding genes. The B. tabaci genome is highly divergent from other sequenced hemipteran genomes, sharing no detectable synteny. A number of known detoxification gene families, including cytochrome P450s and UDP-glucuronosyltransferases, are significantly expanded in B. tabaci. Other expanded gene families, including cathepsins, large clusters of tandemly duplicated B. tabaci-specific genes, and phosphatidylethanolamine-binding proteins (PEBPs), were found to be associated with virus acquisition and transmission and/or insecticide resistance, likely contributing to the global invasiveness and efficient virus transmission capacity of B. tabaci. The presence of 142 horizontally transferred genes from bacteria or fungi in the B. tabaci genome, including genes encoding hopanoid/sterol synthesis and xenobiotic detoxification enzymes that are not present in other insects, offers novel insights into the unique biological adaptations of this insect such as polyphagy and insecticide resistance. Interestingly, two adjacent bacterial pantothenate biosynthesis genes, panB and panC, have been co-transferred into B. tabaci and fused into a single gene that has acquired introns during its evolution.The B. tabaci genome contains numerous genetic novelties, including expansions in gene families associated with insecticide resistance, detoxification and virus transmission, as well as numerous horizontally transferred genes from bacteria and fungi. We believe these novelties likely have shaped B. tabaci as a highly invasive polyphagous crop pest and efficient vector of plant viruses. The genome serves as a reference for resolving the B. tabaci cryptic species complex, understanding fundamental biological novelties, and providing valuable genetic information to assist the development of novel strategies for controlling whiteflies and the viruses they transmit.


July 7, 2019

Improved assembly of noisy long reads by k-mer validation.

Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (~15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ~95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ~5% of k-mers that are error free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively.© 2016 Carvalho et al.; Published by Cold Spring Harbor Laboratory Press.


July 7, 2019

Jabba: hybrid error correction for long sequencing reads.

Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned.In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented.Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph.


July 7, 2019

CoLoRMap: Correcting Long Reads by Mapping short reads.

Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormapehaghshe@sfu.ca or cedric.chauve@sfu.caSupplementary data are available at Bioinformatics online.© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.


July 7, 2019

Susan Celniker: Foundational resources to study a dynamic genome.

The Genetics Society of America’s George W. Beadle Award honors individuals who have made outstanding contributions to the community of genetics researchers and who exemplify the qualities of its namesake. The 2016 recipient, Susan E. Celniker, played a key role in the sequencing, annotation, and characterization of the Drosophila genome. She participated in early sequencing efforts at the Lawrence Berkeley National Laboratory and led the modENCODE Fly Transcriptome Consortium. Her efforts were critical to ensuring that the Drosophila genome was well-annotated, making it one of the best curated animal genomes available. As the Principal Investigator for the BDGP, Celniker has enabled the study of proteomes by creating a collection of over 13,000 clones that match annotated genes for protein expression in cells or transgenic flies, and she has established the most comprehensive spatial gene expression atlas in any organism, with in situ imaging of more than 80% of the Drosophila protein-coding transcriptome through embryogenesis. In addition to providing the research community with these invaluable resources and reagents, she continues to develop new tools and datasets for genetics researchers to explore the spatial and temporal control of gene expression.


July 7, 2019

Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study.

Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated.We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1?kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms.


July 7, 2019

Complete genomic and transcriptional landscape analysis using third-generation sequencing: a case study of Saccharomyces cerevisiae CEN.PK113-7D.

Completion of eukaryal genomes can be difficult task with the highly repetitive sequences along the chromosomes and short read lengths of second-generation sequencing. Saccharomyces cerevisiae strain CEN.PK113-7D, widely used as a model organism and a cell factory, was selected for this study to demonstrate the superior capability of very long sequence reads for de novo genome assembly. We generated long reads using two common third-generation sequencing technologies (Oxford Nanopore Technology (ONT) and Pacific Biosciences (PacBio)) and used short reads obtained using Illumina sequencing for error correction. Assembly of the reads derived from all three technologies resulted in complete sequences for all 16 yeast chromosomes, as well as the mitochondrial chromosome, in one step. Further, we identified three types of DNA methylation (5mC, 4mC and 6mA). Comparison between the reference strain S288C and strain CEN.PK113-7D identified chromosomal rearrangements against a background of similar gene content between the two strains. We identified full-length transcripts through ONT direct RNA sequencing technology. This allows for the identification of transcriptional landscapes, including untranslated regions (UTRs) (5′ UTR and 3′ UTR) as well as differential gene expression quantification. About 91% of the predicted transcripts could be consistently detected across biological replicates grown either on glucose or ethanol. Direct RNA sequencing identified many polyadenylated non-coding RNAs, rRNAs, telomere-RNA, long non-coding RNA and antisense RNA. This work demonstrates a strategy to obtain complete genome sequences and transcriptional landscapes that can be applied to other eukaryal organisms.


July 7, 2019

RepLong: de novo repeat identification using long read sequencing data.

The identification of repetitive elements is important in genome assembly and phylogenetic analyses. The existing de novo repeat identification methods exploiting the use of short reads are impotent in identifying long repeats. Since long reads are more likely to cover repeat regions completely, using long reads is more favorable for recognizing long repeats.In this study, we propose a novel de novo repeat elements identification method namely RepLong based on PacBio long reads. Given that the reads mapped to the repeat regions are highly overlapped with each other, the identification of repeat elements is equivalent to the discovery of consensus overlaps between reads, which can be further cast into a community detection problem in the network of read overlaps. In RepLong, we first construct a network of read overlaps based on pair-wise alignment of the reads, where each vertex indicates a read and an edge indicates a substantial overlap between the corresponding two reads. Secondly, the communities whose intra connectivity is greater than the inter connectivity are extracted based on network modularity optimization. Finally, representative reads in each community are extracted to form the repeat library. Comparison studies on Drosophila melanogaster and human long read sequencing data with genome-based and short-read-based methods demonstrate the efficiency of RepLong in identifying long repeats. RepLong can handle lower coverage data and serve as a complementary solution to the existing methods to promote the repeat identification performance on long-read sequencing data.The software of RepLong is freely available at https://github.com/ruiguo-bio/replong.ywsun@szu.edu.cn or zhuzx@szu.edu.cn.Supplementary data are available at Bioinformatics online.


July 7, 2019

GenomeLandscaper: Landscape analysis of genome-fingerprints maps assessing chromosome architecture.

Assessing correctness of an assembled chromosome architecture is a central challenge. We create a geometric analysis method (called GenomeLandscaper) to conduct landscape analysis of genome-fingerprints maps (GFM), trace large-scale repetitive regions, and assess their impacts on the global architectures of assembled chromosomes. We develop an alignment-free method for phylogenetics analysis. The human Y chromosomes (GRCh.chrY, HuRef.chrY and YH.chrY) are analysed as a proof-of-concept study. We construct a galaxy of genome-fingerprints maps (GGFM) for them, and a landscape compatibility among relatives is observed. But a long sharp straight line on the GGFM breaks such a landscape compatibility, distinguishing GRCh38p1.chrY (and throughout GRCh38p7.chrY) from GRCh37p13.chrY, HuRef.chrY and YH.chrY. We delete a 1.30-Mbp target segment to rescue the landscape compatibility, matching the antecedent GRCh37p13.chrY. We re-locate it into the modelled centromeric and pericentromeric region of GRCh38p10.chrY, matching a gap placeholder of GRCh37p13.chrY. We decompose it into sub-constituents (such as BACs, interspersed repeats, and tandem repeats) and trace their homologues by phylogenetics analysis. We elucidate that most examined tandem repeats are of reasonable quality, but the BAC-sized repeats, 173U1020C (176.46 Kbp) and 5U41068C (205.34 Kbp), are likely over-repeated. These results offer unique insights into the centromeric and pericentromeric regions of the human Y chromosomes.


July 7, 2019

Supergene evolution triggered by the introgression of a chromosomal inversion.

Supergenes are groups of tightly linked loci whose variation is inherited as a single Mendelian locus and are a common genetic architecture for complex traits under balancing selection [1-8]. Supergene alleles are long-range haplotypes with numerous mutations underlying distinct adaptive strategies, often maintained in linkage disequilibrium through the suppression of recombination by chromosomal rearrangements [1, 5, 7-9]. However, the mechanism governing the formation of supergenes is not well understood and poses the paradox of establishing divergent functional haplotypes in the face of recombination. Here, we show that the formation of the supergene alleles encoding mimicry polymorphism in the butterfly Heliconius numata is associated with the introgression of a divergent, inverted chromosomal segment. Haplotype divergence and linkage disequilibrium indicate that supergene alleles, each allowing precise wing-pattern resemblance to distinct butterfly models, originate from over a million years of independent chromosomal evolution in separate lineages. These “superalleles” have evolved from a chromosomal inversion captured by introgression and maintained in balanced polymorphism, triggering supergene inheritance. This mode of evolution involving the introgression of a chromosomal rearrangement is likely to be a common feature of complex structural polymorphisms associated with the coexistence of distinct adaptive syndromes. This shows that the reticulation of genealogies may have a powerful influence on the evolution of genetic architectures in nature. Copyright © 2018 Elsevier Ltd. All rights reserved.


July 7, 2019

Satellite DNA evolution: old ideas, new approaches.

A substantial portion of the genomes of most multicellular eukaryotes consists of large arrays of tandemly repeated sequence, collectively called satellite DNA. The processes generating and maintaining different satellite DNA abundances across lineages are important to understand as satellites have been linked to chromosome mis-segregation, disease phenotypes, and reproductive isolation between species. While much theory has been developed to describe satellite evolution, empirical tests of these models have fallen short because of the challenges in assessing satellite repeat regions of the genome. Advances in computational tools and sequencing technologies now enable identification and quantification of satellite sequences genome-wide. Here, we describe some of these tools and how their applications are furthering our knowledge of satellite evolution and function. Copyright © 2018 Elsevier Ltd. All rights reserved.


July 7, 2019

To B or not to B: a tale of unorthodox chromosomes.

Highlights • B chromosomes are dispensable parts of the karyotype of many eukaryotes. • Deemed genome parasites in plants and animals, provide advantage to pathogenic fungi. • Often enriched in repeats and in fast evolving pathogenicity-related genes. • B chromosomes are not a uniform class, share certain features with core chromosomes.


July 7, 2019

A draft genome sequence for the Ixodes scapularis cell line, ISE6

Background: The tick cell line ISE6, derived from Ixodes scapularis, is commonly used for amplification and detection of arboviruses in environmental or clinical samples. Methods: To assist with sequence-based assays, we sequenced the ISE6 genome with single-molecule, long-read technology. Results: The draft assembly appears near complete based on gene content analysis, though it appears to lack some instances of repeats in this highly repetitive genome. The assembly appears to have separated the haplotypes at many loci. DNA short read pairs, used for validation only, mapped to the cell line assembly at a higher rate than they mapped to the Ixodes scapularis reference genome sequence. Conclusions: The assembly could be useful for filtering host genome sequence from sequence data obtained from cells infected with pathogens.


July 7, 2019

Whole genome sequence and phenotypic characterization of a Cbm+ serotype e strain of Streptococcus mutans.

We report the whole genome sequence of the serotype e Cbm+ strain LAR01 of Streptococcus mutans, a dental pathogen frequently associated with extra-oral infections. The LAR01 genome is a single circular chromosome of 2.1 Mb with a GC content of 36.96%. The genome contains 15 phosphotransferase system gene clusters, seven cell wall-anchored (LPxTG) proteins, all genes required for the development of natural competence and genes coding for mutacins VI and K8. Interestingly, the cbm gene is genetically linked to a putative type VII secretion system that has been found in Mycobacteria and few other Gram-positive bacteria. When compared with the UA159 type strain, phenotypic characterization of LAR01 revealed increased biofilm formation in the presence of either glucose or sucrose but similar abilities to withstand acid and oxidative stresses. LAR01 was unable to inhibit the growth of Strpetococcus gordonii, which is consistent with the genomic data that indicate absence of mutacins that can kill mitis streptococci. On the other hand, LAR01 effectively inhibited growth of other S. mutans strains, suggesting that it may be specialized to outcompete strains from its own species. In vitro and in vivo studies using mutational and heterologous expression approaches revealed that Cbm is a virulence factor of S. mutans by mediating binding to extracellular matrix proteins and intracellular invasion. Collectively, the whole genome sequence analysis and phenotypic characterization of LAR01 provides new insights on the virulence properties of S. mutans and grants further opportunities to understand the genomic fluidity of this important human pathogen.© 2018 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.


July 7, 2019

The case for not masking away repetitive DNA

In the course of analyzing whole-genome data, it is common practice to mask or filter out repetitive regions of a genome, such as transposable elements and endogenous retroviruses, in order to focus only on genes and thus simplify the results. This Commentary is a plea from one member of the Mobile DNA community to all gene-centric researchers: please do not ignore the repetitive fraction of the genome. Please stop narrowing your findings by only analyzing a minority of the genome, and instead broaden your analyses to include the rich biology of repetitive and mobile DNA. In this article, I present four arguments supporting a case for retaining repetitive DNA in your genome-wide analysis.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.