Menu
September 22, 2019

Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads

Due to the large number of repetitive sequences in complex eukaryotic genomes, fragmented and incompletely assembled genomes lose value as reference sequences, often due to short contigs that cannot be anchored or mispositioned onto chromosomes. Here we report a novel method Highly Efficient Repeat Assembly (HERA), which includes a new concept called a connection graph as well as algorithms for constructing the graph. HERA resolves repeats at high efficiency with single-molecule sequencing data, and enables the assembly of chromosome-scale contigs by further integrating genome maps and Hi-C data. We tested HERA with the genomes of rice R498, maize B73, human HX1 and Tartary buckwheat Pinku1. HERA can correctly assemble most of the tandemly repetitive sequences in rice using single-molecule sequencing data only. Using the same maize and human sequencing data published by Jiao et al. (2017) and Shi et al. (2016), respectively, we dramatically improved on the sequence contiguity compared with the published assemblies, increasing the contig N50 from 1.3 Mb to 61.2 Mb in maize B73 assembly and from 8.3 Mb to 54.4 Mb in human HX1 assembly with HERA. We provided a high-quality maize reference genome with 96.9% of the gaps filled (only 76 gaps left) and several incorrectly positioned sequences fixed compared with the B73 RefGen_v4 assembly. Comparisons between the HERA assembly of HX1 and the human GRCh38 reference genome showed that many gaps in GRCh38 could be filled, and that GRCh38 contained some potential errors that could be fixed. We assembled the Pinku1 genome into 12 scaffolds with a contig N50 size of 27.85 Mb. HERA serves as a new genome assembly/phasing method to generate high quality sequences for complex genomes and as a curation tool to improve the contiguity and completeness of existing reference genomes, including the correction of assembly errors in repetitive regions.


September 22, 2019

Three substrains of the cyanobacterium Anabaena sp. PCC 7120 display divergence in genomic sequences and hetC function.

Anabaena sp. strain PCC 7120 is a model strain for molecular studies of cell differentiation and patterning in heterocyst-forming cyanobacteria. Subtle differences in heterocyst development have been noticed in different laboratories working on the same organism. In this study, 360 mutations, including single nucleotide polymorphisms (SNPs), small insertion/deletions (indels; 1 to 3 bp), fragment deletions, and transpositions, were identified in the genomes of three substrains. Heterogeneous/heterozygous bases were also identified due to the polyploidy nature of the genome and the multicellular morphology but could be completely segregated when plated after filament fragmentation by sonication. hetC is a gene upregulated in developing cells during heterocyst formation in Anabaena sp. strain PCC 7120 and found in approximately half of other heterocyst-forming cyanobacteria. Inactivation of hetC in 3 substrains of Anabaena sp. PCC 7120 led to different phenotypes: the formation of heterocysts, differentiating cells that keep dividing, or the presence of both heterocysts and dividing differentiating cells. The expression of P hetZ -gfp in these hetC mutants also showed different patterns of green fluorescent protein (GFP) fluorescence. Thus, the function of hetC is influenced by the genomic background and epistasis and constitutes an example of evolution under way.IMPORTANCE Our knowledge about the molecular genetics of heterocyst formation, an important cell differentiation process for global N2 fixation, is mostly based on studies with Anabaena sp. strain PCC 7120. Here, we show that rapid microevolution is under way in this strain, leading to phenotypic variations for certain genes related to heterocyst development, such as hetC This study provides an example for ongoing microevolution, marked by multiple heterogeneous/heterozygous single nucleotide polymorphisms (SNPs), in a multicellular multicopy-genome microorganism. Copyright © 2018 American Society for Microbiology.


September 22, 2019

High-quality assembly of the reference genome for scarlet sage, Salvia splendens, an economically important ornamental plant.

Salvia splendens Ker-Gawler, scarlet or tropical sage, is a tender herbaceous perennial widely introduced and seen in public gardens all over the world. With few molecular resources, breeding is still restricted to traditional phenotypic selection, and the genetic mechanisms underlying phenotypic variation remain unknown. Hence, a high-quality reference genome will be very valuable for marker-assisted breeding, genome editing, and molecular genetics.We generated 66 Gb and 37 Gb of raw DNA sequences, respectively, from whole-genome sequencing of a largely homozygous scarlet sage inbred line using Pacific Biosciences (PacBio) single-molecule real-time and Illumina HiSeq sequencing platforms. The PacBio de novo assembly yielded a final genome with a scaffold N50 size of 3.12 Mb and a total length of 808 Mb. The repetitive sequences identified accounted for 57.52% of the genome sequence, and ?54,008 protein-coding genes were predicted collectively with ab initio and homology-based gene prediction from the masked genome. The divergence time between S. splendens and Salvia miltiorrhiza was estimated at 28.21 million years ago (Mya). Moreover, 3,797 species-specific genes and 1,187 expanded gene families were identified for the scarlet sage genome.We provide the first genome sequence and gene annotation for the scarlet sage. The availability of these resources will be of great importance for further breeding strategies, genome editing, and comparative genomics among related species.


September 22, 2019

Improved Brassica rapa reference genome by single-molecule sequencing and chromosome conformation capture technologies.

Brassica rapa comprises several important cultivated vegetables and oil crops. Current reference genome assemblies of Brassica rapa are quite fragmented and not highly contiguous, thereby limiting extensive genetic and genomic analyses. Here, we report an improved assembly of the B. rapa genome (v3.0) using single-molecule sequencing, optical mapping, and chromosome conformation capture technologies (Hi-C). Relative to the previous reference genomes, our assembly features a contig N50 size of 1.45?Mb, representing a ~30-fold improvement. We also identified a new event that occurred in the B. rapa genome ~1.2 million years ago, when a long terminal repeat retrotransposon (LTR-RT) expanded. Further analysis refined the relationship of genome blocks and accurately located the centromeres in the B. rapa genome. The B. rapa genome v3.0 will serve as an important community resource for future genetic and genomic studies in B. rapa. This resource will facilitate breeding efforts in B. rapa, as well as comparative genomic analysis with other Brassica species.


September 22, 2019

The chromosome-level genome assemblies of two rattans (Calamus simplicifolius and Daemonorops jenkinsiana).

Calamus simplicifolius and Daemonorops jenkinsiana are two representative rattans, the most significant material sources for the rattan industry. However, the lack of reference genome sequences is a major obstacle for basic and applied biology on rattan.We produced two chromosome-level genome assemblies of C. simplicifolius and D. jenkinsiana using Illumina, Pacific Biosciences, and Hi-C sequencing data. A total of ~730 Gb and ~682 Gb of raw data covered the predicted genome lengths (~1.98 Gb of C. simplicifolius and ~1.61 Gb of D. jenkinsiana) to ~372 × and ~426 × read depths, respectively. The two de novo genome assemblies, ~1.94 Gb and ~1.58 Gb, were generated with scaffold N50s of ~160 Mb and ~119 Mb in C. simplicifolius and D. jenkinsiana, respectively. The C. simplicifolius and D. jenkinsiana genomes were predicted to harbor ?51,235 and ?53,342 intact protein-coding gene models, respectively. Benchmarking Universal Single-Copy Orthologs evaluation demonstrated that genome completeness reached 96.4% and 91.3% in the C. simplicifolius and D. jenkinsiana genomes, respectively. Genome evolution showed that four Arecaceae plants clustered together, and the divergence time between the two rattans was ~19.3 million years ago. Additionally, we identified 193 and 172 genes involved in the lignin biosynthesis pathway in the C. simplicifolius and D. jenkinsiana genomes, respectively.We present the first de novo assemblies of two rattan genomes (C. simplicifolius and D. jenkinsiana). These data will not only provide a fundamental resource for functional genomics, particularly in promoting germplasm utilization for breeding, but also serve as reference genomes for comparative studies between and among different species.


September 22, 2019

Draft genome of Glyptosternon maculatum, an endemic fish from Tibet Plateau.

Mechanisms for high-altitude adaption have attracted widespread interest among evolutionary biologists. Several genome-wide studies have been carried out for endemic vertebrates in Tibet, including mammals, birds, and amphibians. However, little information is available about the adaptive evolution of highland fishes. Glyptosternon maculatum (Regan 1905), also known as Regan or barkley and endemic to the Tibetan Plateau, belongs to the Sisoridae family, order Siluriformes (catfishes). This species lives at an elevation ranging from roughly 2,800 m to 4,200 m. Hence, a high-quality reference genome of G. maculatum provides an opportunity to investigate high-altitude adaption mechanisms of fishes.To obtain a high-quality reference genome sequence of G. maculatum, we combined Pacific Bioscience single-molecule real-time sequencing, Illumina paired-end sequencing, 10X Genomics linked-reads, and BioNano optical map techniques. In total, 603.99 Gb sequencing data were generated. The assembled genome was about 662.34 Mb with scaffold and contig N50 sizes of 20.90 Mb and 993.67 kb, respectively, which captured 83% complete and 3.9% partial vertebrate Benchmarking Universal Single-Copy Orthologs. Repetitive elements account for 35.88% of the genome, and ?22,066 protein-coding genes were predicted from the genome, of which 91.7% have been functionally annotated.We present the first comprehensive de novo genome of G. maculatum. This genetic resource is fundamental for investigating the origin of G. maculatum and will improve our understanding of high-altitude adaption of fishes. The assembled genome can also be used as reference for future population genetic studies of G. maculatum.


September 22, 2019

A draft genome assembly of the Chinese sillago (Sillago sinica), the first reference genome for Sillaginidae fishes.

Sillaginidae, also known as smelt-whitings, is a family of benthic coastal marine fishes in the Indo-West Pacific that have high ecological and economic importance. Many Sillaginidae species, including the Chinese sillago (Sillago sinica), have been recently described in China, providing valuable material to analyze genetic diversification of the family Sillaginidae. Here, we constructed a reference genome for the Chinese sillago, with the aim to set up a platform for comparative analysis of all species in this family.Using the single-molecule real-time DNA sequencing platform Pacific Biosciences (PacBio) Sequel, we generated ~27.3 Gb genomic DNA sequences for the Chinese sillago. We reconstructed a genome assembly of 534 Mb using a strategy that takes advantage of complementary strengths of two genome assembly programs, Canu and FALCON. The genome size was consistent with the estimated genome size based on k-mer analysis. The assembled genome consisted of 802 contigs with a contig N50 length of 2.6 Mb. We annotated 22,122 protein-coding genes in the Chinese sillago genomes using a de novo method as well as RNA sequencing data and homologies to other teleosts. According to the phylogenetic analysis using protein-coding genes, the Chinese sillago is closely related to Larimichthys crocea and Dicentrarchus labrax and diverged from their ancestor around 69.5-82.6 million years ago.Using long reads generated with PacBio sequencing technology, we have built a draft genome assembly for the Chinese sillago, which is the first reference genome for Sillaginidae species. This genome assembly sets a stage for comparative analysis of the diversification and adaptation of fishes in Sillaginidae.


September 22, 2019

Complete genome sequence of a blaKPC-2-positive Klebsiella pneumoniae strain isolated from the effluent of an urban sewage treatment plant in Japan.

Antimicrobial resistance genes (ARGs) and the bacteria that harbor them are widely distributed in the environment, especially in surface water, sewage treatment plant effluent, soil, and animal waste. In this study, we isolated a KPC-2-producing Klebsiella pneumoniae strain (GSU10-3) from a sampling site in Tokyo Bay, Japan, near a wastewater treatment plant (WWTP) and determined its complete genome sequence. Strain GSU10-3 is resistant to most ß-lactam antibiotics and other antimicrobial agents (quinolones and aminoglycosides). This strain is classified as sequence type 11 (ST11), and a core genome phylogenetic analysis indicated that strain GSU10-3 is closely related to KPC-2-positive Chinese clinical isolates from 2011 to 2017 and is clearly distinct from strains isolated from the European Union (EU), United States, and other Asian countries. Strain GSU10-3 harbors four plasmids, including a blaKPC-2-positive plasmid, pGSU10-3-3 (66.2?kb), which is smaller than other blaKPC-2-positive plasmids and notably carries dual replicons (IncFII [pHN7A8] and IncN). Such downsizing and the presence of dual replicons may promote its maintenance and stable replication, contributing to its broad host range with low fitness costs. A second plasmid, pGSU10-3-1 (159.0?kb), an IncA/C2 replicon, carries a class 1 integron (containing intI1, dfrA12, aadA2, qacE?1, and sul1) with a high degree of similarity to a broad-host-range plasmid present in the family Enterobacteriaceae The plasmid pGSU10-3-2 (134.8?kb), an IncFII(K) replicon, carries the IS26-mediated ARGs [aac(6′)Ib-cr,blaOXA-1, catB4 (truncated), and aac(3)-IId], tet(A), and a copper/arsenate resistance locus. GSU10-3 is the first nonclinical KPC-2-producing environmental Enterobacteriaceae isolate from Japan for which the whole genome has been sequenced.IMPORTANCE We isolated and determined the complete genome sequence of a KPC-2-producing K. pneumoniae strain from a sampling site in Tokyo Bay, Japan, near a wastewater treatment plant (WWTP). In Japan, the KPC type has been very rarely detected, while IMP is the most predominant type of carbapenemase in clinical carbapenemase-producing Enterobacteriaceae (CPE) isolates. Although laboratory testing thus far suggested that Japan may be virtually free of KPC-producing Enterobacteriaceae, we have detected it from effluent from a WWTP. Antimicrobial resistance (AMR) monitoring of WWTP effluent may contribute to the early detection of future AMR bacterial dissemination in clinical settings and communities; indeed, it will help illuminate the whole picture in which environmental contamination through WWTP effluent plays a part. Copyright © 2018 Sekizuka et al.


September 22, 2019

The genome of tapeworm Taenia multiceps sheds light on understanding parasitic mechanism and control of coenurosis disease.

Coenurosis, caused by the larval coenurus of the tapeworm Taenia multiceps, is a fatal central nervous system disease in both sheep and humans. Though treatment and prevention options are available, the control of coenurosis still faces presents great challenges. Here, we present a high-quality genome sequence of T. multiceps in which 240 Mb (96%) of the genome has been successfully assembled using Pacbio single-molecule real-time (SMRT) and Hi-C data with a N50 length of 44.8 Mb. In total, 49.5 Mb (20.6%) repeat sequences and 13, 013 gene models were identified. We found that Taenia spp. have an expansion of transposable elements and recent small-scale gene duplications following the divergence of Taenia from Echinococcus, but not in Echinococcus genomes, and the genes underlying environmental adaptability and dosage effect tend to be over-retained in the T. multiceps genome. Moreover, we identified several genes encoding proteins involved in proglottid formation and interactions with the host central nervous system, which may contribute to the adaption of T. multiceps to its parasitic life style. Our study not only provides insights into the biology and evolution of T. multiceps, but also identifies a set of species-specific gene targets for developing novel treatment and control tools for coenurosis.


September 22, 2019

Recovery of novel association loci in Arabidopsis thaliana and Drosophila melanogaster through leveraging INDELs association and integrated burden test.

Short insertions, deletions (INDELs) and larger structural variants have been increasingly employed in genetic association studies, but few improvements over SNP-based association have been reported. In order to understand why this might be the case, we analysed two publicly available datasets and observed that 63% of INDELs called in A. thaliana and 64% in D. melanogaster populations are misrepresented as multiple alleles with different functional annotations, i.e. where the same underlying variant is represented by inconsistent alignments leading to different variant calls. To address this issue, we have developed the software Irisas to reclassify and re-annotate these variants, which we then used for single-locus tests of association. We also integrated them to predict the functional impact of SNPs, INDELs, and structural variants for burden testing. Using both approaches, we re-analysed the genetic architecture of complex traits in A. thaliana and D. melanogaster. Heritability analysis using SNPs alone explained on average 27% and 19% of phenotypic variance for A. thaliana and D. melanogaster respectively. Our method explained an additional 11% and 3%, respectively. We also identified novel trait loci that previous SNP-based association studies failed to map, and which contain established candidate genes. Our study shows the value of the association test with INDELs and integrating multiple types of variants in association studies in plants and animals.


September 22, 2019

Complete genome sequence and characterization of linezolid-resistant Enterococcus faecalis clinical isolate KUB3006 carrying a cfr(B)-transposon on its chromosome and optrA-plasmid.

Linezolid (LZD) has become one of the most important antimicrobial agents for infections caused by gram-positive bacteria, including those caused by Enterococcus species. LZD-resistant (LR) genetic features include mutations in 23S rRNA/ribosomal proteins, a plasmid-borne 23S rRNA methyltransferase gene cfr, and ribosomal protection genes (optrA and poxtA). Recently, a cfr gene variant, cfr(B), was identified in a Tn6218-like transposon (Tn) in a Clostridioides difficile isolate. Here, we isolated an LR Enterococcus faecalis clinical isolate, KUB3006, from a urine specimen of a patient with urinary tract infection during hospitalization in 2017. Comparative and whole-genome analyses were performed to characterize the genetic features and overall antimicrobial resistance genes in E. faecalis isolate KUB3006. Complete genome sequencing of KUB3006 revealed that it carried cfr(B) on a chromosomal Tn6218-like element. Surprisingly, this Tn6218-like element was almost (99%) identical to that of C. difficile Ox3196, which was isolated from a human in the UK in 2012, and to that of Enterococcus faecium 5_Efcm_HA-NL, which was isolated from a human in the Netherlands in 2012. An additional oxazolidinone and phenicol resistance gene, optrA, was also identified on a plasmid. KUB3006 is sequence type (ST) 729, suggesting that it is a minor ST that has not been reported previously and is unlikely to be a high-risk E. faecalis lineage. In summary, LR E. faecalis KUB3006 possesses a notable Tn6218-like-borne cfr(B) and a plasmid-borne optrA. This finding raises further concerns regarding the potential declining effectiveness of LZD treatment in the future.


September 22, 2019

A complete Cannabis chromosome assembly and adaptive admixture for elevated cannabidiol (CBD) content

Cannabis has been cultivated for millennia with distinct cultivars providing either fiber and grain or tetrahydrocannabinol. Recent demand for cannabidiol rather than tetrahydrocannabinol has favored the breeding of admixed cultivars with extremely high cannabidiol content. Despite several draft Cannabis genomes, the genomic structure of cannabinoid synthase loci has remained elusive. A genetic map derived from a tetrahydrocannabinol/cannabidiol segregating population and a complete chromosome assembly from a high-cannabidiol cultivar together resolve the linkage of cannabidiolic and tetrahydrocannabinolic acid synthase gene clusters which are associated with transposable elements. High-cannabidiol cultivars appear to have been generated by integrating hemp-type cannabidiolic acid synthase gene clusters into a background of marijuana-type cannabis. Quantitative trait locus mapping suggests that overall drug potency, however, is associated with other genomic regions needing additional study.


September 22, 2019

FRI-4 carbapenemase-producing Enterobacter cloacae complex isolated in Tokyo, Japan.

A carbapenem-resistant Enterobacter cloacae complex isolated in Tokyo, Japan, produced a carbapenemase that was detected by a Carba NP test and a modified carbapenem inactivation method, but none of the ‘Big Five’ carbapenemase genes was detected by PCR. This study aimed to identify the carbapenemase.Carbapenemase genes were screened by WGS. Next, we generated a recombinant plasmid in which the carbapenemase gene was inserted. We also extracted the carbapenemase gene-carrying plasmid from the E. cloacae complex. The effects of both plasmids on the antibiotic susceptibility of Escherichia coli were then tested. The carbapenemase gene-carrying plasmid in the E. cloacae complex was completely sequenced.A novel carbapenemase gene, blaFRI-4, encoded an amino acid sequence that was 93.2% identical to French imipenemase (FRI-1). E. coli transformed with blaFRI-4 showed reduced carbapenem susceptibility. A complete sequence of the blaFRI-4-carrying 98?508?bp IncFII/IncR plasmid (pTMTA61661) showed that blaFRI-4 and the surrounding region (18.7?kb) were duplicated.The FRI-4-producing E. cloacae complex was isolated in Japan, whereas all other FRI variants have been found in Europe, suggesting that the spread of FRI carbapenemases is global.


September 22, 2019

Full gene HLA class I sequences of 79 novel and 519 mostly uncommon alleles from a large United States registry population.

HLA class I assignments were obtained at single genotype, G-level resolution from 98?855 volunteers for an unrelated donor registry in the United States. In spite of the diverse ancestry of the volunteers, over 99% of the assignments at each locus are common. Within this population, 52 novel alleles differing in exons 2 and 3 are identified and characterized. Previously reported alleles with incomplete sequences in the IPD-IMGT/HLA database (n?=?519) were selected for full gene sequencing and, from this sampling, another 27 novel alleles are described.© 2018 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.


September 22, 2019

Streptococcus suis contains multiple phase-variable methyltransferases that show a discrete lineage distribution.

Streptococcus suis is a major pathogen of swine, responsible for a number of chronic and acute infections, and is also emerging as a major zoonotic pathogen, particularly in South-East Asia. Our study of a diverse population of S. suis shows that this organism contains both Type I and Type III phase-variable methyltransferases. In all previous examples, phase-variation of methyltransferases results in genome wide methylation differences, and results in differential regulation of multiple genes, a system known as the phasevarion (phase-variable regulon). We hypothesized that each variant in the Type I and Type III systems encoded a methyltransferase with a unique specificity, and could therefore control a distinct phasevarion, either by recombination-driven shuffling between different specificities (Type I) or by biphasic on-off switching via simple sequence repeats (Type III). Here, we present the identification of the target specificities for each Type III allelic variant from S. suis using single-molecule, real-time methylome analysis. We demonstrate phase-variation is occurring in both Type I and Type III methyltransferases, and show a distinct association between methyltransferase type and presence, and population clades. In addition, we show that the phase-variable Type I methyltransferase was likely acquired at the origin of a highly virulent zoonotic sub-population.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.