Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays and generate a de novo assembly of 2.93?Gb (contig N50: 8.3?Mb, scaffold N50: 22.0?Mb, including 39.3?Mb N-bases), together with 206?Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8?Mb of HX1-specific sequences, including 4.1?Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.
Genome mining for fungal polyketide-diterpenoid hybrids: discovery of key terpene cyclases and multifunctional P450s for structural diversification
A biosynthetic gene cluster for chevalone E (1) and its oxidized derivatives have been identified within the genome of the endophytic fungus Aspergillus versicolor 0312, by a mining strategy targeting a polyke- tide-diterpenoid hybrid molecule. The biosynthetic pathway has been successfully reconstituted in the heterologous fungus Aspergillus oryzae. Interestingly, two P450 monooxygenases, Cle2 and Cle4, were found to transform 1 into seven new analogues including 7 and 8 that possess a unique five-membered lactone ring. Furthermore, the replacement of the terpene cyclase gene with that from another fungus led to the production of sartorypyrone D (11), which has a monocyclic terpenoid moiety. Finally, some of the compounds obtained in this study synergistically enhanced the cytotoxicity of doxorubicin (DOX) in breast cancer cells.
Pineapple (Ananas comosus (L.) Merr.) is the most economically valuable crop possessing crassulacean acid metabolism (CAM), a photosynthetic carbon assimilation pathway with high water-use efficiency, and the second most important tropical fruit. We sequenced the genomes of pineapple varieties F153 and MD2 and a wild pineapple relative, Ananas bracteatus accession CB5. The pineapple genome has one fewer ancient whole-genome duplication event than sequenced grass genomes and a conserved karyotype with seven chromosomes from before the ? duplication event. The pineapple lineage has transitioned from C3 photosynthesis to CAM, with CAM-related genes exhibiting a diel expression pattern in photosynthetic tissues. CAM pathway genes were enriched with cis-regulatory elements associated with the regulation of circadian clock genes, providing the first cis-regulatory link between CAM and circadian clock regulation. Pineapple CAM photosynthesis evolved by the reconfiguration of pathways in C3 plants, through the regulatory neofunctionalization of preexisting genes and not through the acquisition of neofunctionalized genes via whole-genome or tandem gene duplication.
Accurate sequence and assembly of genomes is a critical first step for studies of genetic variation. We generated a high-quality assembly of the gorilla genome using single-molecule, real-time sequence technology and a string graph de novo assembly algorithm. The new assembly improves contiguity by two to three orders of magnitude with respect to previously released assemblies, recovering 87% of missing reference exons and incomplete gene models. Although regions of large, high-identity segmental duplications remain largely unresolved, this comprehensive assembly provides new biological insight into genetic diversity, structural variation, gene loss, and representation of repeat structures within the gorilla genome. The approach provides a path forward for the routine assembly of mammalian genomes at a level approaching that of the current quality of the human genome. Copyright © 2016, American Association for the Advancement of Science.
Deciphering the genetic basis of human disease requires a comprehensive knowledge of genetic variants irrespective of their class or frequency. Although an impressive number of human genetic variants have been catalogued, a large fraction of the genetic difference that distinguishes two human genomes is still not understood at the base-pair level. This is because the emphasis has been on single-nucleotide variation as opposed to less tractable and more complex genetic variants, including indels and structural variants. The latter, we propose, will have a large impact on human phenotypes but require a more systematic assessment of genomes at deeper coverage and alternate sequencing and mapping technologies. Copyright © 2016 by the Genetics Society of America.
Large deletions at the SHOX locus in the pseudoautosomal region are associated with skeletal atavism in Shetland ponies.
Skeletal atavism in Shetland ponies is a heritable disorder characterized by abnormal growth of the ulna and fibula that extend the carpal and tarsal joints, respectively. This causes abnormal skeletal structure, impaired movements, and affected foals are usually euthanized. In order to identify the causal mutation we subjected six confirmed Swedish cases and a DNA pool consisting of 21 control individuals to whole genome resequencing. We screened for polymorphisms where the cases and the control pool were fixed for opposite alleles and observed this signature for only 25 SNPs, most of which were scattered on genome assembly unassigned scaffolds. Read depth analysis at these loci revealed homozygosity or compound heterozygosity for two partially overlapping large deletions in the pseudoautosomal region (PAR) of chromosome X/Y in cases but not in the control pool. One of these deletions removes the entire coding region of the SHOX gene and both deletions remove parts of the CRLF2 gene located downstream of SHOX. The horse reference assembly of the PAR is highly fragmented, and in order to characterize this region we sequenced bacterial artificial chromosome (BAC) clones by single-molecule real-time (SMRT) sequencing technology. This considerably improved the assembly and enabled size estimations of the two deletions to 160-180 kb and 60-80 kb, respectively. Complete association between the presence of these deletions and disease status was verified in eight other affected horses. The result of the present study is consistent with previous studies in humans showing crucial importance of SHOX for normal skeletal development. Copyright © 2016 Author et al.
De novo assembly of human genomes is now a tractable effort due in part to advances in sequencing and mapping technologies. We use PacBio single-molecule, real-time (SMRT) sequencing and BioNano genomic maps to construct the first de novo assembly of NA19240, a Yoruban individual from Africa. This chromosome-scaffolded assembly of 3.08 Gb with a contig N50 of 7.25 Mb and a scaffold N50 of 78.6 Mb represents one of the most contiguous high-quality human genomes. We utilize a BAC library derived from NA19240 DNA and novel haplotype-resolving sequencing technologies and algorithms to characterize regions of complex genomic architecture that are normally lost due to compression to a linear haploid assembly. Our results demonstrate that multiple technologies are still necessary for complete genomic representation, particularly in regions of highly identical segmental duplications. Additionally, we show that diploid assembly has utility in improving the quality of de novo human genome assemblies.
The genome of Dendrobium officinale illuminates the biology of the important traditional Chinese orchid herb.
Dendrobium officinale Kimura et Migo is a traditional Chinese orchid herb that has both ornamental value and a broad range of therapeutic effects. Here, we report the first de novo assembled 1.35 Gb genome sequences for D. officinale by combining the second-generation Illumina Hiseq 2000 and third-generation PacBio sequencing technologies. We found that orchids have a complete inflorescence gene set and have some specific inflorescence genes. We observed gene expansion in gene families related to fungus symbiosis and drought resistance. We analyzed biosynthesis pathways of medicinal components of D. officinale and found extensive duplication of SPS and SuSy genes, which are related to polysaccharide generation, and that the pathway of D. officinale alkaloid synthesis could be extended to generate 16-epivellosimine. The D. officinale genome assembly demonstrates a new approach to deciphering large complex genomes and, as an important orchid species and a traditional Chinese medicine, the D. officinale genome will facilitate future research on the evolution of orchid plants, as well as the study of medicinal components and potential genetic breeding of the dendrobe. Copyright © 2015 The Author. Published by Elsevier Inc. All rights reserved.
Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.
Danshen (Salvia miltiorrhiza Bunge), also known as Chinese red sage, is a member of Lamiaceae family. It is valued in traditional Chinese medicine, primarily for the treatment of cardiovascular and cerebrovascular diseases. Because of its pharmacological potential, ongoing research aims to identify novel bioactive compounds in danshen, and their biosynthetic pathways. To date, only expressed sequence tag (EST) and RNA-seq data for this herbal plant are available to the public. We therefore propose that the construction of a reference genome for danshen will help elucidate the biosynthetic pathways of important secondary metabolites, thereby advancing the investigation of novel drugs from this plant.
Segmental duplications contribute to human evolution, adaptation and genomic instability but are often poorly characterized. We investigate the evolution, genetic variation and coding potential of human-specific segmental duplications (HSDs). We identify 218 HSDs based on analysis of 322 deeply sequenced archaic and contemporary hominid genomes. We sequence 550 human and nonhuman primate genomic clones to reconstruct the evolution of the largest, most complex regions with protein-coding potential (N?=?80 genes from 33 gene families). We show that HSDs are non-randomly organized, associate preferentially with ancestral ape duplications termed ‘core duplicons’ and evolved primarily in an interspersed inverted orientation. In addition to Homo sapiens-specific gene expansions (such as TCAF1/TCAF2), we highlight ten gene families (for example, ARHGAP11B and SRGAP2C) where copy number never returns to the ancestral state, there is evidence of mRNA splicing and no common gene-disruptive mutations are observed in the general population. Such duplicates are candidates for the evolution of human-specific adaptive traits.
Most evolutionary new centromeres (ENC) are composed of large arrays of satellite DNA and surrounded by segmental duplications. However, the hypothesis is that ENCs are seeded in an anonymous sequence and only over time have acquired the complexity of “normal” centromeres. Up to now evidence to test this hypothesis was lacking. We recently discovered that the well-known polymorphism of orangutan chromosome 12 was due to the presence of an ENC. We sequenced the genome of an orangutan homozygous for the ENC, and we focused our analysis on the comparison of the ENC domain with respect to its wild type counterpart. No significant variations were found. This finding is the first clear evidence that ENC seedings are epigenetic in nature. The compaction of the ENC domain was found significantly higher than the corresponding WT region and, interestingly, the expression of the only gene embedded in the region was significantly repressed.
Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly.
The human reference genome assembly plays a central role in nearly all aspects of today’s basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health. © 2017 Schneider et al.; Published by Cold Spring Harbor Laboratory Press.
The plants in the Erigeron genus of the Compositae (Asteraceae) family are commonly called fleabanes, possibly due to the belief that certain chemicals in these plants repel fleas. In the traditional Chinese medicine, Erigeron breviscapus , which is native to China, was widely used in the treatment of cerebrovascular disease. A handful of bioactive compounds, including scutellarin, 3,5-dicaffeoylquinic acid, and 3,4-dicaffeoylquinic acid, have been isolated from the plant. With the purpose of finding novel medicinal compounds and understanding their biosynthetic pathways, we propose to sequence the genome of E. breviscapus . We assembled the highly heterozygous E. breviscapus genome using a combination of PacBio single-molecular real-time sequencing and next-generation sequencing methods on the Illumina HiSeq platform. The final draft genome is approximately 1.2 Gb, with contig and scaffold N50 sizes of 18.8 kb and 31.5 kb, respectively. Further analyses predicted 37 504 protein-coding genes in the E. breviscapus genome and 8172 shared gene families among Compositae species. The E. breviscapus genome provides a valuable resource for the investigation of novel bioactive compounds in this Chinese herb.
The complete genome sequence of Streptomyces autolyticus CGMCC 0516, the producer of geldanamycin, autolytimycin, reblastatin and elaiophylin.
Streptomyces autolyticus CGMCC 0516 produces the anti-tumor benzoquinone ansamycins geldanamycin, autolytimycin, and reblastatin and the 16-membered macrodiolide elaiophylin. Here, we report the complete genome sequence of S. autolyticus CGMCC 0516, which consists of a 10,029,028bp linear chromosome and seven circular plasmids. Fifty-seven putative biosynthetic gene clusters for secondary metabolites were found. The geldanamycin, autolytimycin, and reblastatin biosynthetic gene clusters were located on the left arm (2.06-2.15Mb) of the chromosome, and the elaiophylin gene cluster was located on the right arm (9.45-9.53Mb). Twenty-one putative gene clusters with high or moderate similarity to important antibiotic biosynthetic gene clusters were found, including the antitumor agents echoside, bafilomycin, hygrocin, and toxoflavin; the antibacterial/antifungal agents nigericin, skyllamycin, kanamycin, naphthomycin, eco-02301, and bottromycin A2; the immunosuppressants meridamycin and brasilicardin A; the anti-inflammatory agent cyclooctatin; and the acute iron poisoning medication desferrioxamine B. The genome sequence reported here will enable us to study the biosynthetic mechanism of these important antibiotics and will facilitate the discovery of novel secondary metabolites with potential applications to human health. Copyright © 2017 Elsevier B.V. All rights reserved.