AGBT 2013 Presentation Slides: Cold Spring Harbor Laboratory’s Michael Schatz presented strategies for de novo assembly of crop genomes with PacBio technolgy.
Rapid full-length Iso-Seq cDNA sequencing of rice mRNA to facilitate annotation and identify splice-site variation.
PacBio’s new Iso-Seq technology allows for rapid generation of full-length cDNA sequences without the need for assembly steps. The technology was tested on leaf mRNA from two model O. sativa ssp. indica cultivars – Minghui 63 and Zhenshan 97. Even though each transcriptome was not exhaustively sequenced, several thousand isoforms described genes over a wide size range, most of which are not present in any currently available FL cDNA collection. In addition, the lack of an assembly requirement provides direct and immediate access to complete mRNA sequences and rapid unraveling of biological novelties.
As the costs for genome sequencing have decreased the number of “genome” sequences have increased at a rapid pace. Unfortunately, the quality and completeness of these so–called “genome” sequences have suffered enormously. We prefer to call such genome assemblies as “gene assembly space” (GAS). We believe it is important to distinguish GAS assemblies from reference genome assemblies (RGAs) as all subsequent research that depends on accurate genome assemblies can be highly compromised if the only assembly available is a GAS assembly.
Since the advent of Next-Generation Sequencing (NGS), the cost of de novo genome sequencing and assembly have dropped precipitately, which has spurred interest in genome sequencing overall. Unfortunately the contiguity of the NGS assembled sequences, as well as the accuracy of these assemblies have suffered. Additionally, most NGS de novo assemblies leave large portions of genomes unresolved, and repetitive regions are often collapsed. When compared to the reference quality genome sequences produced before the NGS era, the new sequences are highly fragmented and often prove to be difficult to properly annotate. In some cases the contiguous portions are smaller than the average gene size making the sequence not nearly as useful for biologists as the earlier reference quality genomes including of Human, Mouse, C. elegans, or Drosophila. Recently, new 3rd generation sequencing technologies, long-range molecular techniques, and new informatics tools have facilitated a return to high quality assembly. We will discuss the capabilities of the technologies and assess their impact on assembly projects across the tree of life from small microbial and fungal genomes through large plant and animal genomes. Beyond improvements to contiguity, we will focus on the additional biological insights that can be made with better assemblies, including more complete analysis genes in their flanking regulatory context, in-depth studies of transposable elements and other complex gene families, and long-range synteny analysis of entire chromosomes. We will also discuss the need for new algorithms for representing and analyzing collections of many complete genomes at once.
Maize is an amazingly diverse crop. A study in 20051 demonstrated that half of the genome sequence and one-third of the gene content between two inbred lines of maize were not shared. This diversity, which is more than two orders of magnitude larger than the diversity found between humans and chimpanzees, highlights the inability of a single reference genome to represent the full pan-genome of maize and all its variants. Here we present and review several efforts to characterize the complete diversity within maize using the highly accurate long reads of PacBio Single Molecule, Real-Time (SMRT) Sequencing. These methods provide a framework for a pan-genomic approach that can be applied to studies of a wide variety of important crop species.
Genomics studies have shown that the insertions, deletions, duplications, translocations, inversions, and tandem repeat expansions in the structural variant (SV) size range (>50 bp) contribute to the evolution of traits and often have significant associations with agronomically important phenotypes. However, most SVs are too small to detect with array comparative genomic hybridization and too large to reliably discover with short-read DNA sequencing. While de novo assembly is the most comprehensive way to identify variants in a genome, recent studies in human genomes show that PacBio SMRT Sequencing sensitively detects structural variants at low coverage. Here we present SV characterization in the major crop species Oryza sativa subsp. indica (rice) with low-fold coverage of long reads. In addition, we provide recommendations for sequencing and analysis for the application of this workflow to other important agricultural species.
HiFi reads (>99% accurate, 15-20 kb) from the PacBio Sequel II System consistently provide complete and contiguous genome assemblies. In addition to completeness and contiguity, accuracy is of critical importance, as assembly errors complicate downstream analysis, particularly by disrupting gene frames. Metrics used to assess assembly accuracy include: 1) in-frame gene count, 2) kmer consistency, and 3) concordance to a benchmark, where discordances are interpreted as assembly errors. Genome in a Bottle (GIAB) provides a benchmark for the human genome with estimated accuracy of 99.9999% (Q60). Concordance for human HiFi assemblies exceeds Q50, which provides excellent genomes for downstream analysis, but presents a challenge that any new benchmark must significantly exceed Q50 or the discordance will represent the error rate of the benchmark. To establish benchmarks for Oryza sativa and Drosophila melanogaster, we collected draft references, Illumina short reads, and PacBio HiFi reads. By species, the benchmark was defined as regions of normal coverage that are not within 5 bp of a small variant or 50 bp of a structural variant. For both species, the benchmark regions span around 60% of the genome and HiFi assemblies achieve Q50 accuracy, which is notably more accurate than assemblies with other technologies and meets typical standards for a finished, reference-grade assembly. Here we present a protocol to generate benchmarks for any sample that rival the GIAB benchmark in accuracy. These benchmarks allow the comparison and improvement of genome assemblies and highlight the superior accuracy of assemblies generated with PacBio HiFi reads.
PAG PacBio Workshop: Introducing 5 new high-quality PacBio genome assemblies for rice to help solve the 10-billion people question
At PAG 2017, Rod Wing presented five new, high-quality rice genome assemblies developed with SMRT Sequencing, including one that has eight complete chromosomes including centromeres. He also offered an early…
In this presentation, Justin Blethrow provides an overview of recent and upcoming developments across PacBio’s SMRT Sequencing product portfolio, and their implications for PacBio’s major applications. In presenting the product…
The release of the PacBio Sequel II System in 2019 brought dramatic throughput improvements and protocols for producing a new data type, highly accurate long reads or HiFi reads. PacBio…
Domestication of clonally propagated crops such as pineapple from South America was hypothesized to be a ‘one-step operation’. We sequenced the genome of Ananas comosus var. bracteatus CB5 and assembled 513?Mb into 25 chromosomes with 29,412 genes. Comparison of the genomes of CB5, F153 and MD2 elucidated the genomic basis of fiber production, color formation, sugar accumulation and fruit maturation. We also resequenced 89 Ananas genomes. Cultivars ‘Smooth Cayenne’ and ‘Queen’ exhibited ancient and recent admixture, while ‘Singapore Spanish’ supported a one-step operation of domestication. We identified 25 selective sweeps, including a strong sweep containing a pair of tandemly duplicated bromelain inhibitors. Four candidate genes for self-incompatibility were linked in F153, but were not functional in self-compatible CB5. Our findings support the coexistence of sexual recombination and a one-step operation in the domestication of clonally propagated crops. This work guides the exploration of sexual and asexual domestication trajectories in other clonally propagated crops.
Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline
Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and allow for annotation of TEs. There are numerous methods for each class of elements with unknown relative performance metrics. We benchmarked existing programs based on a curated library of rice TEs. Using the most robust programs, we created a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a condensed TE library for annotations of structurally intact and fragmented elements. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.List of abbreviationsTETransposable ElementsLTRLong Terminal RepeatLINELong Interspersed Nuclear ElementSINEShort Interspersed Nuclear ElementMITEMiniature Inverted Transposable ElementTIRTerminal Inverted RepeatTSDTarget Site DuplicationTPTrue PositivesFPFalse PositivesTNTrue NegativeFNFalse NegativesGRFGeneric Repeat FinderEDTAExtensive de-novo TE Annotator
Full-length mRNA sequencing and gene expression profiling reveal broad involvement of natural antisense transcript gene pairs in pepper development and response to stresses.
Pepper is an important vegetable with great economic value and unique biological features. In the past few years, significant development has been made towards understanding the huge complex pepper genome; however, pepper functional genomics has not been well studied. To better understand the pepper gene structure and pepper gene regulation, we conducted full-length mRNA sequencing by PacBio sequencing and obtained 57862 high-quality full-length mRNA sequences derived from 18362 previously annotated and 5769 newly detected genes. New gene models were built that combined the full-length mRNA sequences and corrected approximately 500 fragmented gene models from previous annotations. Based on the full-length mRNA, we identified 4114 and 5880 pepper genes forming natural antisense transcript (NAT) genes in-cis and in-trans, respectively. Most of these genes accumulate small RNAs in their overlapping regions. By analyzing these NAT gene expression patterns in our transcriptome data, we identified many NAT pairs responsive to a variety of biological processes in pepper. Pepper formate dehydrogenase 1 (FDH1), which is required for R-gene-mediated disease resistance, may be regulated by nat-siRNAs and participate in a positive feedback loop in salicylic acid biosynthesis during resistance responses. Several cis-NAT pairs and subgroups of trans-NAT genes were responsive to pepper pericarp and placenta development, which may play roles in capsanthin and capsaicin biosynthesis. Using a comparative genomics approach, the evolutionary mechanisms of cis-NATs were investigated, and we found that an increase in intergenic sequences accounted for the loss of most cis-NATs, while transposon insertion contributed to the formation of most new cis-NATs. This article is protected by copyright. All rights reserved.This article is protected by copyright. All rights reserved.
Plantibacter flavus, Curtobacterium herbarum, Paenibacillus taichungensis, and Rhizobium selenitireducens Endophytes Provide Host-Specific Growth Promotion of Arabidopsis thaliana, Basil, Lettuce, and Bok Choy Plants.
A collection of bacterial endophytes isolated from stem tissues of plants growing in soils highly contaminated with petroleum hydrocarbons were screened for plant growth-promoting capabilities. Twenty-seven endophytic isolates significantly improved the growth of Arabidopsis thaliana plants in comparison to that of uninoculated control plants. The five most beneficial isolates, one strain each of Curtobacterium herbarum, Paenibacillus taichungensis, and Rhizobium selenitireducens and two strains of Plantibacter flavus were further examined for growth promotion in Arabidopsis, lettuce, basil, and bok choy plants. Host-specific plant growth promotion was observed when plants were inoculated with the five bacterial strains. P. flavus strain M251 increased the total biomass and total root length of Arabidopsis plants by 4.7 and 5.8 times, respectively, over that of control plants and improved lettuce and basil root growth, while P. flavus strain M259 promoted Arabidopsis shoot and root growth, lettuce and basil root growth, and bok choy shoot growth. A genome comparison between P. flavus strains M251 and M259 showed that both genomes contain up to 70 actinobacterial putative plant-associated genes and genes involved in known plant-beneficial pathways, such as those for auxin and cytokinin biosynthesis and 1-aminocyclopropane-1-carboxylate deaminase production. This study provides evidence of direct plant growth promotion by Plantibacter flavusIMPORTANCE The discovery of new plant growth-promoting bacteria is necessary for the continued development of biofertilizers, which are environmentally friendly and cost-efficient alternatives to conventional chemical fertilizers. Biofertilizer effects on plant growth can be inconsistent due to the complexity of plant-microbe interactions, as the same bacteria can be beneficial to the growth of some plant species and neutral or detrimental to others. We examined a set of bacterial endophytes isolated from plants growing in a unique petroleum-contaminated environment to discover plant growth-promoting bacteria. We show that strains of Plantibacter flavus exhibit strain-specific plant growth-promoting effects on four different plant species.Copyright © 2019 American Society for Microbiology.
De novo genome assembly of the endangered Acer yangbiense, a plant species with extremely small populations endemic to Yunnan Province, China.
Acer yangbiense is a newly described critically endangered endemic maple tree confined to Yangbi County in Yunnan Province in Southwest China. It was included in a programme for rescuing the most threatened species in China, focusing on “plant species with extremely small populations (PSESP)”.We generated 64, 94, and 110 Gb of raw DNA sequences and obtained a chromosome-level genome assembly of A. yangbiense through a combination of Pacific Biosciences Single-molecule Real-time, Illumina HiSeq X, and Hi-C mapping, respectively. The final genome assembly is ~666 Mb, with 13 chromosomes covering ~97% of the genome and scaffold N50 sizes of 45 Mb. Further, BUSCO analysis recovered 95.5% complete BUSCO genes. The total number of repetitive elements account for 68.0% of the A. yangbiense genome. Genome annotation generated 28,320 protein-coding genes, assisted by a combination of prediction and transcriptome sequencing. In addition, a nearly 1:1 orthology ratio of dot plots of longer syntenic blocks revealed a similar evolutionary history between A. yangbiense and grape, indicating that the genome has not undergone a whole-genome duplication event after the core eudicot common hexaploidization.Here, we report a high-quality de novo genome assembly of A. yangbiense, the first genome for the genus Acer and the family Aceraceae. This will provide fundamental conservation genomics resources, as well as representing a new high-quality reference genome for the economically important Acer lineage and the wider order of Sapindales. © The Author(s) 2019. Published by Oxford University Press.