Caenorhabditis elegans was the first multicellular eukaryotic genome sequenced to apparent completion. Although this assembly employed a standard C. elegans strain (N2), it used sequence data from several laboratories, with DNA propagated in bacteria and yeast. Thus, the N2 assembly has many differences from any C. elegans available today. To provide a more accurate C. elegans genome, we performed long-read assembly of VC2010, a modern strain derived from N2. Our VC2010 assembly has 99.98% identity to N2 but with an additional 1.8 Mb including tandem repeat expansions and genome duplications. For 116 structural discrepancies between N2 and VC2010, 97 structures matching VC2010 (84%) were also found in two outgroup strains, implying deficiencies in N2. Over 98% of N2 genes encoded unchanged products in VC2010; moreover, we predicted =53 new genes in VC2010. The recompleted genome of C. elegans should be a valuable resource for genetics, genomics, and systems biology. © 2019 Yoshimura et al.; Published by Cold Spring Harbor Laboratory Press.
The robust detection of structural variants in mammalian genomes remains a challenge. It is particularly difficult in the case of genetically unstable Chinese hamster ovary (CHO) cell lines with only draft genome assemblies available. We explore the potential of the CRISPR/Cas9 system for the targeted capture of genomic loci containing integrated vectors in CHO-K1-based cell lines followed by next generation sequencing (NGS), and compare it to popular target-enrichment sequencing methods and to whole genome sequencing (WGS). Three different CRISPR/Cas9-based techniques were evaluated; all of them allow for amplification-free enrichment of target genomic regions in the range from 5 to 60 fold, and for recovery of ~15 kb-long sequences with no sequencing artifacts introduced. The utility of these protocols has been proven by the identification of transgene integration sites and flanking sequences in three CHO cell lines. The long enriched fragments helped to identify Escherichia coli genome sequences co-integrated with vectors, and were further characterized by Whole Genome Sequencing (WGS). Other advantages of CRISPR/Cas9-based methods are the ease of bioinformatics analysis, potential for multiplexing, and the production of long target templates for real-time sequencing.
Normalization of cDNA is widely used to improve the coverage of rare transcripts in analysis of transcriptomes employing next-generation sequencing. Recently, long-read technology has been emerging as a powerful tool for sequencing and construction of transcriptomes, especially for complex genomes containing highly similar transcripts and transcript-spliced isoforms. Here, we analyzed the transcriptome of sugarcane, with a highly polyploidy plant genome, by PacBio isoform sequencing (Iso-Seq) of two different cDNA library preparations, with and without a normalization step. The results demonstrated that, while the two libraries included many of the same transcripts, many longer transcripts were removed and many new generally shorter transcripts were detected by normalization. For the same input cDNA and the same data yield, the normalized library recovered more total transcript isoforms, number of predicted gene families and orthologous groups, resulting in a higher representation for the sugarcane transcriptome, compared to the non-normalized library. The non-normalized library, on the other hand, included a wider transcript length range with more longer transcripts above ~1.25 kb, more transcript isoforms per gene family and gene ontology terms per transcript. A large proportion of the unique transcripts comprising ~52% of the normalized library were expressed at a lower level than the unique transcripts from the non-normalized library, across three tissue types tested including leaf, stalk and root. About 83% of the total 5,348 predicted long noncoding transcripts was derived from the normalized library, of which ~80% was derived from the lowly expressed fraction. Functional annotation of the unique transcripts suggested that each library enriched different functional transcript fractions. This demonstrated the complementation of the two approaches in obtaining a complete transcriptome of a complex genome at the sequencing depth used in this study.
Chromosome-level assembly of the water buffalo genome surpasses human and goat genomes in sequence contiguity.
Rapid innovation in sequencing technologies and improvement in assembly algorithms have enabled the creation of highly contiguous mammalian genomes. Here we report a chromosome-level assembly of the water buffalo (Bubalus bubalis) genome using single-molecule sequencing and chromatin conformation capture data. PacBio Sequel reads, with a mean length of 11.5?kb, helped to resolve repetitive elements and generate sequence contiguity. All five B. bubalis sub-metacentric chromosomes were correctly scaffolded with centromeres spanned. Although the index animal was partly inbred, 58% of the genome was haplotype-phased by FALCON-Unzip. This new reference genome improves the contig N50 of the previous short-read based buffalo assembly more than a thousand-fold and contains only 383 gaps. It surpasses the human and goat references in sequence contiguity and facilitates the annotation of hard to assemble gene clusters such as the major histocompatibility complex (MHC).
Sweetpotato (Ipomoea batatas (L.) Lam.) is the seventh most important crop in the world and is mainly cultivated for its underground storage root (SR). The genetic studies of this species have been hindered by a lack of high-quality reference sequence due to its complex genome structure. Diploid Ipomoea trifida is the closest relative and putative progenitor of sweetpotato, which is considered a model species for sweetpotato, including genetic, cytological, and physiological analyses.Here, we generated the chromosome-scale genome sequence of SR-forming diploid I. trifida var. Y22 with high heterozygosity (2.20%). Although the chromosome-based synteny analysis revealed that the I. trifida shared conserved karyotype with Ipomoea nil after the separation, I. trifida had a much smaller genome than I. nil due to more efficient eliminations of LTR-retrotransposons and lack of species-specific amplification bursts of LTR-RTs. A comparison with four non-SR-forming species showed that the evolution of the beta-amylase gene family may be related to SR formation. We further investigated the relationship of the key gene BMY11 (with identity 47.12% to beta-amylase 1) with this important agronomic trait by both gene expression profiling and quantitative trait locus (QTL) mapping. And combining SR morphology and structure, gene expression profiling and qPCR results, we deduced that the products of the activity of BMY11 in splitting starch granules and be recycled to synthesize larger granules, contributing to starch accumulation and SR swelling. Moreover, we found the expression pattern of BMY11, sporamin proteins and the key genes involved in carbohydrate metabolism and stele lignification were similar to that of sweetpotato during the SR development.We constructed the high-quality genome reference of the highly heterozygous I. trifida through a combined approach and this genome enables a better resolution of the genomics feature and genome evolutions of this species. Sweetpotato SR development genes can be identified in I. trifida and these genes perform similar functions and patterns, showed that the diploid I. trifida var. Y22 with typical SR could be considered an ideal model for the studies of sweetpotato SR development.
Shigella flexneri has an extremely complex genome with a significant number of virulence traits acquired by mobile genetic elements including bacteriophages and plasmids. S. flexneri serotype 1c is an emerging etiological agent of bacillary dysentery in developing countries. In this study, the complete nucleotide sequence of two plasmids of S. flexneri serotype 1c strain Y394 was determined and analysed.The plasmid pINV-Y394 is an invasive or virulence plasmid of size 221,293?bp composed of a large number of insertion sequences (IS), virulence genes, regulatory and maintenance genes. Three hundred and twenty-eight open reading frames (ORFs) were identified in pINV-Y394, of which about a half (159 ORFs) were identified as IS elements. Ninety-seven ORFs were related to characterized genes (majority of which are associated with virulence and their regulons), and 72 ORFs were uncharacterized or hypothetical genes. The second plasmid pNV-Y394 is of size 10,866?bp and encodes genes conferring resistance against multiple antibiotics of clinical importance. The multidrug resistance gene cassette consists of tetracycline resistance gene tetA, streptomycin resistance gene strA-strB and sulfonamide-resistant dihydropteroate synthase gene sul2.These two plasmids together play a key role in the fitness of Y394 in the host environment. The findings from this study indicate that the pathogenic S. flexneri is a highly niche adaptive pathogen which is able to co-evolve with its host and respond to the selection pressure in its environment.
The recent release of genomic sequences for 3000 rice varieties provides access to the genetic diversity at species level for this crop. We take advantage of this resource to unravel some features of the retrotranspositional landscape of rice. We develop software TRACKPOSON specifically for the detection of transposable elements insertion polymorphisms (TIPs) from large datasets. We apply this tool to 32 families of retrotransposons and identify more than 50,000 TIPs in the 3000 rice genomes. Most polymorphisms are found at very low frequency, suggesting that they may have occurred recently in agro. A genome-wide association study shows that these activations in rice may be triggered by external stimuli, rather than by the alteration of genetic factors involved in transposable element silencing pathways. Finally, the TIPs dataset is used to trace the origin of rice domestication. Our results suggest that rice originated from three distinct domestication events.