Third generation single molecule sequencing technology from Pacific Biosciences, Moleculo, Oxford Nanopore, and other companies are revolutionizing genomics by enabling the sequencing of long, individual molecules of DNA and RNA. One major advantage of these technologies over current short read sequencing is the ability to sequence much longer molecules, thousands or tens of thousands of nucleotides instead of mere hundreds. This capacity gives researchers substantially greater power to probe into microbial, plant, and animal genomes, but it remains unknown on how to best use these data. To answer this, we systematically evaluated the human genome and 25 other important genomes across the tree of life ranging in size from 1Mbp to 3Gbp in an attempt to answer how long the reads need to be and how much coverage is necessary to completely assemble their chromosomes with single molecule sequencing. We also present a novel error correction and assembly algorithm using a combination of PacBio and pre-assembled Illumina sequencing. This new algorithm greatly outperforms other published hybrid algorithms.
Rapid full-length Iso-Seq cDNA sequencing of rice mRNA to facilitate annotation and identify splice-site variation.
PacBio’s new Iso-Seq technology allows for rapid generation of full-length cDNA sequences without the need for assembly steps. The technology was tested on leaf mRNA from two model O. sativa ssp. indica cultivars – Minghui 63 and Zhenshan 97. Even though each transcriptome was not exhaustively sequenced, several thousand isoforms described genes over a wide size range, most of which are not present in any currently available FL cDNA collection. In addition, the lack of an assembly requirement provides direct and immediate access to complete mRNA sequences and rapid unraveling of biological novelties.
Since the advent of Next-Generation Sequencing (NGS), the cost of de novo genome sequencing and assembly have dropped precipitately, which has spurred interest in genome sequencing overall. Unfortunately the contiguity of the NGS assembled sequences, as well as the accuracy of these assemblies have suffered. Additionally, most NGS de novo assemblies leave large portions of genomes unresolved, and repetitive regions are often collapsed. When compared to the reference quality genome sequences produced before the NGS era, the new sequences are highly fragmented and often prove to be difficult to properly annotate. In some cases the contiguous portions are smaller than the average gene size making the sequence not nearly as useful for biologists as the earlier reference quality genomes including of Human, Mouse, C. elegans, or Drosophila. Recently, new 3rd generation sequencing technologies, long-range molecular techniques, and new informatics tools have facilitated a return to high quality assembly. We will discuss the capabilities of the technologies and assess their impact on assembly projects across the tree of life from small microbial and fungal genomes through large plant and animal genomes. Beyond improvements to contiguity, we will focus on the additional biological insights that can be made with better assemblies, including more complete analysis genes in their flanking regulatory context, in-depth studies of transposable elements and other complex gene families, and long-range synteny analysis of entire chromosomes. We will also discuss the need for new algorithms for representing and analyzing collections of many complete genomes at once.
Genomics studies have shown that the insertions, deletions, duplications, translocations, inversions, and tandem repeat expansions in the structural variant (SV) size range (>50 bp) contribute to the evolution of traits and often have significant associations with agronomically important phenotypes. However, most SVs are too small to detect with array comparative genomic hybridization and too large to reliably discover with short-read DNA sequencing. While de novo assembly is the most comprehensive way to identify variants in a genome, recent studies in human genomes show that PacBio SMRT Sequencing sensitively detects structural variants at low coverage. Here we present SV characterization in the major crop species Oryza sativa subsp. indica (rice) with low-fold coverage of long reads. In addition, we provide recommendations for sequencing and analysis for the application of this workflow to other important agricultural species.
HiFi reads (>99% accurate, 15-20 kb) from the PacBio Sequel II System consistently provide complete and contiguous genome assemblies. In addition to completeness and contiguity, accuracy is of critical importance, as assembly errors complicate downstream analysis, particularly by disrupting gene frames. Metrics used to assess assembly accuracy include: 1) in-frame gene count, 2) kmer consistency, and 3) concordance to a benchmark, where discordances are interpreted as assembly errors. Genome in a Bottle (GIAB) provides a benchmark for the human genome with estimated accuracy of 99.9999% (Q60). Concordance for human HiFi assemblies exceeds Q50, which provides excellent genomes for downstream analysis, but presents a challenge that any new benchmark must significantly exceed Q50 or the discordance will represent the error rate of the benchmark. To establish benchmarks for Oryza sativa and Drosophila melanogaster, we collected draft references, Illumina short reads, and PacBio HiFi reads. By species, the benchmark was defined as regions of normal coverage that are not within 5 bp of a small variant or 50 bp of a structural variant. For both species, the benchmark regions span around 60% of the genome and HiFi assemblies achieve Q50 accuracy, which is notably more accurate than assemblies with other technologies and meets typical standards for a finished, reference-grade assembly. Here we present a protocol to generate benchmarks for any sample that rival the GIAB benchmark in accuracy. These benchmarks allow the comparison and improvement of genome assemblies and highlight the superior accuracy of assemblies generated with PacBio HiFi reads.
By 2050, there will be 9 billion people on the planet. What will they eat? This is the question that led Rod Wing, Director of the Arizona Genomics Institute, into…
PAG PacBio Workshop: Introducing 5 new high-quality PacBio genome assemblies for rice to help solve the 10-billion people question
At PAG 2017, Rod Wing presented five new, high-quality rice genome assemblies developed with SMRT Sequencing, including one that has eight complete chromosomes including centromeres. He also offered an early…
In a poster presented at AGBT 2017, Fritz Sedlazeck from Johns Hopkins University describes the comparison of genome assemblies produced using long-read PacBio sequencing and short-read sequencing with 10x Genomics…
We have isolated several Osiaa23 rice mutants with different knockout genotypes, resulting in different phenotypes, which suggested that different genetic backgrounds or mutation types influence gene function. The Auxin/Indole-3-Acetic Acid (Aux/IAA) gene family performs critical roles in auxin signal transduction in plants. In rice, the gene OsIAA23 (Os06t0597000) is known to affect development of roots and shoots, but previous knockouts in OsIAA23 have been sterile and difficult for research continuously. Here, we isolate new Osiaa23 mutants using the CRISPR/Cas9 system in japonica (Wuyunjing24) and indica (Kasalath) rice, with extensive genome re-sequencing to confirm the absence of off-target effects. In Kasalath, mutants with a 13-amino acid deletion showed profoundly greater dwarfing, lateral root developmental disorder, and fertility deficiency, relative to mutants with a single amino acid deletion, demonstrating that those 13 amino acids in Kasalath are essential to gene function. In Wuyunjing24, we predicted that mutants with a single base-pair frameshift insertion would experience premature termination and strong phenotypic defects, but instead these lines exhibited negligible phenotypic difference and normal fertility. Through RNA-seq, we show here that new mosaic transcripts of OsIAA23 were produced de novo, which circumvented the premature termination and thereby preserved the wild-type phenotype. This finding is a notable demonstration in plants that mutants can mask loss of function CRISPR/Cas9 editing of the target gene through de novo changes in alternative splicing.
PacBio full-length cDNA sequencing integrated with RNA-seq reads drastically improves the discovery of splicing transcripts in rice.
In eukaryotes, alternative splicing (AS) greatly expands the diversity of transcripts. However, it is challenging to accurately determine full-length splicing isoforms. Recently, more studies have taken advantage of Pacific Bioscience (PacBio) long-read sequencing to identify full-length transcripts. Nevertheless, the high error rate of PacBio reads seriously offsets the advantages of long reads, especially for accurately identifying splicing junctions. To best capitalize on the features of long reads, we used Illumina RNA-seq reads to improve PacBio circular consensus sequence (CCS) quality and to validate splicing patterns in the rice transcriptome. We evaluated the impact of CCS accuracy on the number and the validation rate of splicing isoforms, and integrated a comprehensive pipeline of splicing transcripts analysis by Iso-Seq and RNA-seq (STAIR) to identify the full-length multi-exon isoforms in rice seedling transcriptome (Oryza sativa L. ssp. japonica). STAIR discovered 11 733 full-length multi-exon isoforms, 6599 more than the SMRT Portal RS_IsoSeq pipeline did. Of these splicing isoforms identified, 4453 (37.9%) were missed in assembled transcripts from RNA-seq reads, and 5204 (44.4%), including 268 multi-exon long non-coding RNAs (lncRNAs), were not reported in the MSU_osa1r7 annotation. Some randomly selected unreported splicing junctions were verified by polymerase chain reaction (PCR) amplification. In addition, we investigated alternative polyadenylation (APA) events in transcripts and identified 829 major polyadenylation [poly(A)] site clusters (PACs). The analysis of splicing isoforms and APA events will facilitate the annotation of the rice genome and studies on the expression and polyadenylation of AS genes in different developmental stages or growth conditions of rice. © 2018 The Authors The Plant Journal © 2018 John Wiley & Sons Ltd.
Gene targeting by the TAL effector PthXo2 reveals cryptic resistance gene for bacterial blight of rice.
Bacterial blight of rice is caused by the ?-proteobacterium Xanthomonas oryzae pv. oryzae, which utilizes a group of type III TAL (transcription activator-like) effectors to induce host gene expression and condition host susceptibility. Five SWEET genes are functionally redundant to support bacterial disease, but only two were experimentally proven targets of natural TAL effectors. Here, we report the identification of the sucrose transporter gene OsSWEET13 as the disease-susceptibility gene for PthXo2 and the existence of cryptic recessive resistance to PthXo2-dependent X. oryzae pv. oryzae due to promoter variations of OsSWEET13 in japonica rice. PthXo2-containing strains induce OsSWEET13 in indica rice IR24 due to the presence of an unpredicted and undescribed effector binding site not present in the alleles in japonica rice Nipponbare and Kitaake. The specificity of effector-associated gene induction and disease susceptibility is attributable to a single nucleotide polymorphism (SNP), which is also found in a polymorphic allele of OsSWEET13 known as the recessive resistance gene xa25 from the rice cultivar Minghui 63. The mutation of OsSWEET13 with CRISPR/Cas9 technology further corroborates the requirement of OsSWEET13 expression for the state of PthXo2-dependent disease susceptibility to X. oryzae pv. oryzae. Gene profiling of a collection of 104 strains revealed OsSWEET13 induction by 42 isolates of X. oryzae pv. oryzae. Heterologous expression of OsSWEET13 in Nicotiana benthamiana leaf cells elevates sucrose concentrations in the apoplasm. The results corroborate a model whereby X. oryzae pv. oryzae enhances the release of sucrose from host cells in order to exploit the host resources.© 2015 The Authors The Plant Journal © 2015 John Wiley & Sons Ltd.
A knowledge-based molecular screen uncovers a broad-spectrum OsSWEET14 resistance allele to bacterial blight from wild rice.
Transcription activator-like (TAL) effectors are type III-delivered transcription factors that enhance the virulence of plant pathogenic Xanthomonas species through the activation of host susceptibility (S) genes. TAL effectors recognize their DNA target(s) via a partially degenerate code, whereby modular repeats in the TAL effector bind to nucleotide sequences in the host promoter. Although this knowledge has greatly facilitated our power to identify new S genes, it can also be easily used to screen plant genomes for variations in TAL effector target sequences and to predict for loss-of-function gene candidates in silico. In a proof-of-principle experiment, we screened a germplasm of 169 rice accessions for polymorphism in the promoter of the major bacterial blight susceptibility S gene OsSWEET14, which encodes a sugar transporter targeted by numerous strains of Xanthomonas oryzae pv. oryzae. We identified a single allele with a deletion of 18 bp overlapping with the binding sites targeted by several TAL effectors known to activate the gene. We show that this allele, which we call xa41(t), confers resistance against half of the tested Xoo strains, representative of various geographic origins and genetic lineages, highlighting the selective pressure on the pathogen to accommodate OsSWEET14 polymorphism, and reciprocally the apparent limited possibilities for the host to create variability at this particular S gene. Analysis of xa41(t) conservation across the Oryza genus enabled us to hypothesize scenarios as to its evolutionary history, prior to and during domestication. Our findings demonstrate that resistance through TAL effector-dependent loss of S-gene expression can be greatly fostered upon knowledge-based molecular screening of a large collection of host plants.© 2015 The Authors The Plant Journal © 2015 John Wiley & Sons Ltd.
Various stable circular RNAs (circRNAs) are newly identified to be the abundance of noncoding RNAs in Archaea, Caenorhabditis elegans, mice, and humans through high-throughput deep sequencing coupled with analysis of massive transcriptional data. CircRNAs play important roles in miRNA function and transcriptional controlling by acting as competing endogenous RNAs or positive regulators on their parent coding genes. However, little is known regarding circRNAs in plants. Here, we report 2354 rice circRNAs that were identified through deep sequencing and computational analysis of ssRNA-seq data. Among them, 1356 are exonic circRNAs. Some circRNAs exhibit tissue-specific expression. Rice circRNAs have a considerable number of isoforms, including alternative backsplicing and alternative splicing circularization patterns. Parental genes with multiple exons are preferentially circularized. Only 484 circRNAs have backsplices derived from known splice sites. In addition, only 92 circRNAs were found to be enriched for miniature inverted-repeat transposable elements (MITEs) in flanking sequences or to be complementary to at least 18-bp flanking intronic sequences, indicating that there are some other production mechanisms in addition to direct backsplicing in rice. Rice circRNAs have no significant enrichment for miRNA target sites. A transgenic study showed that overexpression of a circRNA construct could reduce the expression level of its parental gene in transgenic plants compared with empty-vector control plants. This suggested that circRNA and its linear form might act as a negative regulator of its parental gene. Overall, these analyses reveal the prevalence of circRNAs in rice and provide new biological insights into rice circRNAs.© 2015 Lu et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza.
The genus Oryza is a model system for the study of molecular evolution over time scales ranging from a few thousand to 15 million years. Using 13 reference genomes spanning the Oryza species tree, we show that despite few large-scale chromosomal rearrangements rapid species diversification is mirrored by lineage-specific emergence and turnover of many novel elements, including transposons, and potential new coding and noncoding genes. Our study resolves controversial areas of the Oryza phylogeny, showing a complex history of introgression among different chromosomes in the young ‘AA’ subclade containing the two domesticated species. This study highlights the prevalence of functionally coupled disease resistance genes and identifies many new haplotypes of potential use for future crop protection. Finally, this study marks a milestone in modern rice research with the release of a complete long-read assembly of IR 8 ‘Miracle Rice’, which relieved famine and drove the Green Revolution in Asia 50 years ago.
The wild relatives of rice have adapted to different ecological environments and constitute a useful reservoir of agronomic traits for genetic improvement. Here we present the ~777?Mb de novo assembled genome sequence of Oryza granulata. Recent bursts of long-terminal repeat retrotransposons, especially RIRE2, led to a rapid twofold increase in genome size after O. granulata speciation. Universal centromeric tandem repeats are absent within its centromeres, while gypsy-type LTRs constitute the main centromere-specific repetitive elements. A total of 40,116 protein-coding genes were predicted in O. granulata, which is close to that of Oryza sativa. Both the copy number and function of genes involved in photosynthesis and energy production have undergone positive selection during the evolution of O. granulata, which might have facilitated its adaptation to the low light habitats. Together, our findings reveal the rapid genome expansion, distinctive centromere organization, and adaptive evolution of O. granulata.