As the costs for genome sequencing have decreased the number of “genome” sequences have increased at a rapid pace. Unfortunately, the quality and completeness of these so–called “genome” sequences have suffered enormously. We prefer to call such genome assemblies as “gene assembly space” (GAS). We believe it is important to distinguish GAS assemblies from reference genome assemblies (RGAs) as all subsequent research that depends on accurate genome assemblies can be highly compromised if the only assembly available is a GAS assembly.
Recent improvements in sequencing chemistry and instrument performance combine to create a new PacBio data type, Single Molecule High-Fidelity reads (HiFi reads). Increased read length and improvement in library construction enables average read lengths of 10-20 kb with average sequence identity greater than 99% from raw single molecule reads. The resulting reads have the accuracy comparable to short read NGS but with 50-100 times longer read length. Here we benchmark the performance of this data type by sequencing and genotyping the Genome in a Bottle (GIAB) HG0002 human reference sample from the National Institute of Standards and Technology (NIST). We further demonstrate the general utility of HiFi reads by analyzing multiple clones of Cabernet Sauvignon. Three different clones were sequenced and de novo assembled with the CANU assembly algorithm, generating draft assemblies of very high contiguity equal to or better than earlier assembly efforts using PacBio long reads. Using the Cabernet Sauvignon Clone 8 assembly as a reference, we mapped the HiFi reads generated from Clone 6 and Clone 47 to identify single nucleotide polymorphisms (SNPs) and structural variants (SVs) that are specific to each of the three samples.
Metagenomic analysis of type II diabetes gut microbiota using PacBio HiFi reads reveals taxonomic and functional differences
In the past decade, the human microbiome has been increasingly shown to play a major role in health. For example, imbalances in gut microbiota appear to be associated with Type II diabetes mellitus (T2DM) and cardiovascular disease. Coronary artery disease (CAD) is a major determinant of the long-term prognosis among T2DM patients, with a 2- to 4-fold increased mortality risk when present. However, the exact microbial strains or functions implicated in disease need further investigation. From a large study with 523 participants (185 healthy controls, 186 T2DM patients without CAD, and 106 T2DM patients with CAD), 3 samples from each patient group were selected for long read sequencing. Each sample was prepared and sequenced on one Sequel II System SMRT Cell, to assess whether long accurate PacBio HiFi reads could yield additional insights to those made using short reads. Each of the 9 samples was subject to metagenomic assembly and binning, taxonomic classification and functional profiling. Results from metagenomic assembly and binning show that it is possible to generate a significant number of complete MAGs (Metagenome Assembled Genomes) from each sample, with over half of the high-quality MAGs being represented by a single circular contig. We show that differences found in taxonomic and functional profiles of healthy versus diabetic patients in the small 9-sample study align with the results of the larger study, as well as with results reported in literature. For example, the abundances of beneficial short- chain fatty acid (SCFA) producers such as Phascolarctobacterium faecium and Faecalibacterium prausnitzii were decreased in T2DM gut microbiota in both studies, while the abundances of quinol and quinone biosynthesis pathways were increased as compared to healthy controls. In conclusion, metagenomic analysis of long accurate HiFi reads revealed important taxonomic and functional differences in T2DM versus healthy gut microbiota. Furthermore, metagenome assembly of long HiFi reads led to the recovery of many complete MAGs and a significant number of complete circular bacterial chromosome sequences.
Grant Cramer from the University of Nevada, Reno, and Dario Cantu from the Univeristy of Callifornia, Davis, discuss past challenges with sequencing Clone 8 of Cabernet Sauvignon (Vitis vinifera). An…
In this presentation, Sonja Vernes of the Max Plank Institute shares her work with the Bat1K project which aims to catalog the genetic diversity of all living bat species. She…
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
Domestication of clonally propagated crops such as pineapple from South America was hypothesized to be a ‘one-step operation’. We sequenced the genome of Ananas comosus var. bracteatus CB5 and assembled 513?Mb into 25 chromosomes with 29,412 genes. Comparison of the genomes of CB5, F153 and MD2 elucidated the genomic basis of fiber production, color formation, sugar accumulation and fruit maturation. We also resequenced 89 Ananas genomes. Cultivars ‘Smooth Cayenne’ and ‘Queen’ exhibited ancient and recent admixture, while ‘Singapore Spanish’ supported a one-step operation of domestication. We identified 25 selective sweeps, including a strong sweep containing a pair of tandemly duplicated bromelain inhibitors. Four candidate genes for self-incompatibility were linked in F153, but were not functional in self-compatible CB5. Our findings support the coexistence of sexual recombination and a one-step operation in the domestication of clonally propagated crops. This work guides the exploration of sexual and asexual domestication trajectories in other clonally propagated crops.
Identification and characterization of chicken circovirus from commercial broiler chickens in China.
Circoviruses are found in many species, including mammals, birds, lower vertebrates and invertebrates. To date, there are no reports of circovirus-induced diseases in chickens. In this study, we identified a new strain of chicken circovirus (CCV) by PacBio third-generation sequencing samples from chickens with acute gastroenteritis in a Shandong commercial broiler farm in China. The complete genome of CCV was verified by inverse PCR. Genomic analysis revealed that CCV codes two inverse open reading frames (ORFs), and a potential stem-loop structure was present at the 5′ end with a structure typical of a circular virus. Phylogenetic tree analysis showed that CCV formed an independent branch between mammalian and avian circovirus, and homology analysis indicated that the homology of CCV with 21 other known circoviruses was less than 40%. Thus, this CCV strain represents a new species in the genus Circovirus. The infection rate of CCV in 12 chickens with diarrhoea was 100%, but no CCV was found in healthy chickens, thereby indicating that the novel CCV strain is highly associated with acute infectious gastroenteritis in chickens. The emergence of a novel CCV in commercial broiler chickens is highly concerning for the broiler industry. © 2019 Blackwell Verlag GmbH.
China is the origin and evolutionary centre of Oriental pears. Pyrus betuleafolia is a wild species native to China and distributed in the northern region, and it is widely used as rootstock. Here, we report the de novo assembly of the genome of P. betuleafolia-Shanxi Duli using an integrated strategy that combines PacBio sequencing, BioNano mapping and chromosome conformation capture (Hi-C) sequencing. The genome assembly size was 532.7 Mb, with a contig N50 of 1.57 Mb. A total of 59 552 protein-coding genes and 247.4 Mb of repetitive sequences were annotated for this genome. The expansion genes in P. betuleafolia were significantly enriched in secondary metabolism, which may account for the organism’s considerable environmental adaptability. An alignment analysis of orthologous genes showed that fruit size, sugar metabolism and transport, and photosynthetic efficiency were positively selected in Oriental pear during domestication. A total of 573 nucleotide-binding site (NBS)-type resistance gene analogues (RGAs) were identified in the P. betuleafolia genome, 150 of which are TIR-NBS-LRR (TNL)-type genes, which represented the greatest number of TNL-type genes among the published Rosaceae genomes and explained the strong disease resistance of this wild species. The study of flavour metabolism-related genes showed that the anthocyanidin reductase (ANR) metabolic pathway affected the astringency of pear fruit and that sorbitol transporter (SOT) transmembrane transport may be the main factor affecting the accumulation of soluble organic matter. This high-quality P. betuleafolia genome provides a valuable resource for the utilization of wild pear in fundamental pear studies and breeding. © 2019 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.
Genome-Wide Association Study of Growth and Body-Shape-Related Traits in Large Yellow Croaker (Larimichthys crocea) Using ddRAD Sequencing.
Large yellow croaker (Larimichthys crocea) is an economically important marine fish species of China. Due to overfishing and marine pollution, the wild stocks of this croaker have collapsed in the past decades. Meanwhile, the cultured croaker is facing the difficulties of reduced genetic diversity and low growth rate. To explore the molecular markers related to the growth traits of croaker and providing the related SNPs for the marker-assisted selection, we used double-digest restriction-site associated DNA (ddRAD) sequencing to dissect the genetic bases of growth traits in a cultured population and identify the SNPs that associated with important growth traits by GWAS. A total of 220 individuals were genotyped by ddRAD sequencing. After quality control, 27,227 SNPs were identified in 220 samples and used for GWAS analysis. We identified 13 genome-wide significant associated SNPs of growth traits on 8 chromosomes, and the beta P of these SNPs ranged from 0.01 to 0.86. Through the definition of candidate regions and gene annotation, candidate genes related to growth were identified, including important regulators such as fgf18, fgf1, nr3c1, cyp8b1, fabp2, cyp2r1, ppara, and ccm2l. We also identified SNPs and candidate genes that significantly associated with body shape, including bmp7, col1a1, col11a2, and col18a1, which are also economically important traits for large yellow croaker aquaculture. The results provided insights into the genetic basis of growth and body shape in large yellow croaker population and would provide reliable genetic markers for molecular marker-assisted selection in the future. Meanwhile, the result established a basis for our subsequent fine mapping and related gene study.
Chromosome-level reference genome of X12, a highly virulent race of the soybean cyst nematode Heterodera glycines.
Soybean cyst nematode (SCN, Heterodera glycines) is a major pest of soybean that is spreading across major soybean production regions worldwide. Increased SCN virulence has recently been observed in both the United States and China. However, no study has reported a genome assembly for H. glycines at the chromosome scale. Herein, the first chromosome-level reference genome of X12, an unusual SCN race with high infection ability, is presented. Using whole-genome shotgun (WGS) sequencing, PacBio sequencing, Illumina paired-end sequencing, 10X Genomics linked reads and high-throughput chromatin conformation capture (Hi-C) genome scaffolding techniques, a 141.01-Mb assembled genome was obtained with scaffold and contig N50 sizes of 16.27 Mb and 330.54 kb, respectively. The assembly showed high integrity and quality, with over 90% of Illumina reads mapped to the genome. The assembly quality was evaluated using Core Eukaryotic Genes Mapping Approach (CEGMA) and Benchmarking Universal Single-Copy Orthologs (BUSCO). A total of 11,882 genes were predicted using De novo, Homolog and RNAseq data generated from eggs, second-stage juveniles (J2), third-stage juveniles (J3) and fourth-stage juveniles (J4) of X12, and 79.0% of homologous sequences were annotated in the genome. These high-quality X12 genome data will provide valuable resources for research in a broad range of areas, including fundamental nematode biology, SCN-plant interactions and coevolution, and also contribute to the development of technology for overall SCN management. This article is protected by copyright. All rights reserved.This article is protected by copyright. All rights reserved.
Genome assembly provides insights into the genome evolution and flowering regulation of orchardgrass.
Orchardgrass (Dactylis glomerata L.) is an important forage grass for cultivating livestock worldwide. Here, we report an ~1.84-Gb chromosome-scale diploid genome assembly of orchardgrass, with a contig N50 of 0.93 Mb, a scaffold N50 of 6.08 Mb and a super-scaffold N50 of 252.52 Mb, which is the first chromosome-scale assembled genome of a cool-season forage grass. The genome includes 40 088 protein-coding genes, and 69% of the assembled sequences are transposable elements, with long terminal repeats (LTRs) being the most abundant. The LTRretrotransposons may have been activated and expanded in the grass genome in response to environmental changes during the Pleistocene between 0 and 1 million years ago. Phylogenetic analysis reveals that orchardgrass diverged after rice but before three Triticeae species, and evolutionarily conserved chromosomes were detected by analysing ancient chromosome rearrangements in these grass species. We also resequenced the whole genome of 76 orchardgrass accessions and found that germplasm from Northern Europe and East Asia clustered together, likely due to the exchange of plants along the ‘Silk Road’ or other ancient trade routes connecting the East and West. Last, a combined transcriptome, quantitative genetic and bulk segregant analysis provided insights into the genetic network regulating flowering time in orchardgrass and revealed four main candidate genes controlling this trait. This chromosome-scale genome and the online database of orchardgrass developed here will facilitate the discovery of genes controlling agronomically important traits, stimulate genetic improvement of and functional genetic research on orchardgrass and provide comparative genetic resources for other forage grasses. © 2019 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.
Insights into transcriptional characteristics and homoeolog expression bias of embryo and de-embryonated kernels in developing grain through RNA-Seq and Iso-Seq.
Bread wheat (Triticum aestivum L.) is an allohexaploid, and the transcriptional characteristics of the wheat embryo and endosperm during grain development remain unclear. To analyze the transcriptome, we performed isoform sequencing (Iso-Seq) for wheat grain and RNA sequencing (RNA-Seq) for the embryo and de-embryonated kernels. The differential regulation between the embryo and de-embryonated kernels was found to be greater than the difference between the two time points for each tissue. Exactly 2264 and 4790 tissue-specific genes were found at 14 days post-anthesis (DPA), while 5166 and 3784 genes were found at 25 DPA in the embryo and de-embryonated kernels, respectively. Genes expressed in the embryo were more likely to be related to nucleic acid and enzyme regulation. In de-embryonated kernels, genes were rich in substance metabolism and enzyme activity functions. Moreover, 4351, 4641, 4516, and 4453 genes with the A, B, and D homoeoloci were detected for each of the four tissues. Expression characteristics suggested that the D genome may be the largest contributor to the transcriptome in developing grain. Among these, 48, 66, and 38 silenced genes emerged in the A, B, and D genomes, respectively. Gene ontology analysis showed that silenced genes could be inclined to different functions in different genomes. Our study provided specific gene pools of the embryo and de-embryonated kernels and a homoeolog expression bias model on a large scale. This is helpful for providing new insights into the molecular physiology of wheat.
Complete genome sequence of Paracoccus sp. Arc7-R13, a silver nanoparticles synthesizing bacterium isolated from Arctic Ocean sediments
Paracoccus sp. Arc7-R13, a silver nanoparticles (AgNPs) synthesizing bacterium, was isolated from Arctic Ocean sediment. Here we describe the complete genome of Paracoccus sp. Arc7-R13. The complete genome contains 4,040,012?bp with 66.66?mol%?G?+?C content, including one circular chromosome of 3,231,929?bp (67.45?mol%?G?+?C content), and eight plasmids with length ranging from 24,536?bp to 199,685?bp. The genome contains 3835 protein-coding genes (CDSs), 49 tRNA genes, as well as 3 rRNA operons as 16S-23S-5S rRNA. Based on the gene annotation and Swiss-Prot analysis, a total of 15 genes belonging to 11 kinds, including silver exporting P-type ATPase (SilP), alkaline phosphatase, nitroreductase, thioredoxin reductase, NADPH dehydrogenase and glutathione peroxidase, might be related to the synthesis of AgNPs. Meanwhile, many additional genes associated with synthesis of AgNPs such as protein-disulfide isomerase, c-type cytochrome, glutathione synthase and dehydrogenase reductase were also identified.
Rapid evolution of a-gliadin gene family revealed by analyzing Gli-2 locus regions of wild emmer wheat.
a-Gliadins are a major group of gluten proteins in wheat flour that contribute to the end-use properties for food processing and contain major immunogenic epitopes that can cause serious health-related issues including celiac disease (CD). a-Gliadins are also the youngest group of gluten proteins and are encoded by a large gene family. The majority of the gene family members evolved independently in the A, B, and D genomes of different wheat species after their separation from a common ancestral species. To gain insights into the origin and evolution of these complex genes, the genomic regions of the Gli-2 loci encoding a-gliadins were characterized from the tetraploid wild emmer, a progenitor of hexaploid bread wheat that contributed the AABB genomes. Genomic sequences of Gli-2 locus regions for the wild emmer A and B genomes were first reconstructed using the genome sequence scaffolds along with optical genome maps. A total of 24 and 16 a-gliadin genes were identified for the A and B genome regions, respectively. a-Gliadin pseudogene frequencies of 86% for the A genome and 69% for the B genome were primarily caused by C to T substitutions in the highly abundant glutamine codons, resulting in the generation of premature stop codons. Comparison with the homologous regions from the hexaploid wheat cv. Chinese Spring indicated considerable sequence divergence of the two A genomes at the genomic level. In comparison, conserved regions between the two B genomes were identified that included a-gliadin pseudogenes containing shared nested TE insertions. Analyses of the genomic organization and phylogenetic tree reconstruction indicate that although orthologous gene pairs derived from speciation were present, large portions of a-gliadin genes were likely derived from differential gene duplications or deletions after the separation of the homologous wheat genomes ~?0.5 MYA. The higher number of full-length intact a-gliadin genes in hexaploid wheat than that in wild emmer suggests that human selection through domestication might have an impact on a-gliadin evolution. Our study provides insights into the rapid and dynamic evolution of genomic regions harboring the a-gliadin genes in wheat.