Single Molecule, Real-Time (SMRT) Sequencing provides efficient, streamlined solutions to address new frontiers in plant genomes and transcriptomes. Inherent challenges presented by highly repetitive, low-complexity regions and duplication events are directly addressed with multi- kilobase read lengths exceeding 8.5 kb on average, with many exceeding 20 kb. Differentiating between transcript isoforms that are difficult to resolve with short-read technologies is also now possible. We present solutions available for both reference genome and transcriptome research that best leverage long reads in several plant projects including algae, Arabidopsis, rice, and spinach using only the PacBio platform. Benefits for these applications are further realized with consistent use of size-selection of input sample using the BluePippin™ device from Sage Science. We will share highlights from our genome projects using the latest P5- C3 chemistry to generate high-quality reference genomes with the highest contiguity, contig N50 exceeding 1 Mb, and average base quality of QV50. Additionally, the value of long, intact reads to provide a no-assembly approach to investigate transcript isoforms using our Iso-Seq protocol will be presented for full transcriptome characterization and targeted surveys of genes with complex structures. PacBio provides the most comprehensive assembly with annotation when combining offerings for both genome and transcriptome research efforts. For more focused investigation, PacBio also offers researchers opportunities to easily investigate and survey genes with complex structures.
Outside of the simplest cases (haploid, bacteria, or inbreds), genomic information is not carried in a single reference per individual, but rather has higher ploidy (n=>2) for almost all organisms. The existence of two or more highly related sequences within an individual makes it extremely difficult to build high quality, highly contiguous genome assemblies from short DNA fragments. Based on the earlier work on a polyploidy aware assembler, FALCON ( https://github.com/PacificBiosciences/FALCON) , we developed new algorithms and software (“FALCON-unzip”) for de novo haplotype reconstructions from SMRT Sequencing data. We generate two datasets for developing the algorithms and the prototype software: (1) whole genome sequencing data from a highly repetitive diploid fungal (Clavicorona pyxidata) and (2) whole genome sequencing data from an F1 hybrid from two inbred Arabidopsis strains: Cvi-0 and Col-0. For the fungal genome, we achieved an N50 of 1.53 Mb (of the 1n assembly contigs) of the ~42 Mb 1n genome and an N50 of the haplotigs (haplotype specific contigs) of 872 kb from a 95X read length N50 ~16 kb dataset. We found that ~ 45% of the genome was highly heterozygous and ~55% of the genome was highly homozygous. We developed methods to assess the base-level accuracy and local haplotype phasing accuracy of the assembly with short-read data from the Illumina® platform. For the ArabidopsisF1 hybrid genome, we found that 80% of the genome could be separated into haplotigs. The long range accuracy of phasing haplotigs was evaluated by comparing them to the assemblies from the two inbred parental lines. We show that a more complete view of all haplotypes could provide useful biological insights through improved annotation, characterization of heterozygous variants of all sizes, and resolution of differential allele expression. The current Falcon-Unzip method will lead to understand how to solve more difficult polyploid genome assembly problems and improve the computational efficiency for large genome assemblies. Based on this work, we can develop a pipeline enabling routinely assemble diploid or polyploid genomes as haplotigs, representing a comprehensive view of the genomes that can be studied with the information at hand.
While genome assembly projects have been successful in many haploid and inbred species, the assembly of non-inbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.
Endogenous sequence patterns predispose the repair modes of CRISPR/Cas9-induced DNA double-stranded breaks in Arabidopsis thaliana.
The possibility to predict the outcome of targeted DNA double-stranded break (DSB) repair would be desirable for genome editing. Furthermore the consequences of mis-repair of potentially cell-lethal DSBs and the underlying pathways are not yet fully understood. Here we study the clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9-induced mutation spectra at three selected endogenous loci in Arabidopsis thaliana by deep sequencing of long amplicon libraries. Notably, we found sequence-dependent genomic features that affected the DNA repair outcome. Deletions of 1-bp to <1000-bp size and/or very short insertions, deletions >1 kbp (all due to NHEJ) and deletions combined with insertions between 5-bp to >100 bp [caused by a synthesis-dependent strand annealing (SDSA)-like mechanism] occurred most frequently at all three loci. The appearance of single-stranded annealing events depends on the presence and distance between repeats flanking the DSB. The frequency and size of insertions is increased if a sequence with high similarity to the target site was available in cis. Most deletions were linked to pre-existing microhomology. Deletion and/or insertion mutations were blunt-end ligated or via de novo generated microhomology. While most mutation types and, to some degree, their predictability are comparable with animal systems, the broad range of deletion mutations seems to be a peculiar feature of the plant A. thaliana.© 2017 The Authors The Plant Journal © 2017 John Wiley & Sons Ltd.
The utility of genome assemblies does not only rely on the quality of the assembled genome sequence, but also on the quality of the gene annotations. The Pacific Biosciences Iso-Seq technology is a powerful support for accurate eukaryotic gene model annotation as it allows for direct readout of full-length cDNA sequences without the need for noisy short read-based transcript assembly. We propose the implementation of the TeloPrime Full Length cDNA Amplification kit to the Pacific Biosciences Iso-Seq technology in order to enrich for genuine full-length transcripts in the cDNA libraries. We provide evidence that TeloPrime outperforms the commonly used SMARTer PCR cDNA Synthesis Kit in identifying transcription start and end sites in Arabidopsis thaliana. Furthermore, we show that TeloPrime-based Pacific Biosciences Iso-Seq can be successfully applied to the polyploid genome of bread wheat (Triticum aestivum) not only to efficiently annotate gene models, but also to identify novel transcription sites, gene homeologs, splicing isoforms and previously unidentified gene loci.
Proteogenomic analysis reveals alternative splicing and translation as part of the abscisic acid response in Arabidopsis seedlings.
In eukaryotes, mechanisms such as alternative splicing (AS) and alternative translation initiation (ATI) contribute to organismal protein diversity. Specifically, splicing factors play crucial roles in responses to environment and development cues; however, the underlying mechanisms are not well investigated in plants. Here, we report the parallel employment of short-read RNA sequencing, single molecule long-read sequencing and proteomic identification to unravel AS isoforms and previously unannotated proteins in response to abscisic acid (ABA) treatment. Combining the data from the two sequencing methods, approximately 83.4% of intron-containing genes were alternatively spliced. Two AS types, which are referred to as alternative first exon (AFE) and alternative last exon (ALE), were more abundant than intron retention (IR); however, by contrast to AS events detected under normal conditions, differentially expressed AS isoforms were more likely to be translated. ABA extensively affects the AS pattern, indicated by the increasing number of non-conventional splicing sites. This work also identified thousands of unannotated peptides and proteins by ATI based on mass spectrometry and a virtual peptide library deduced from both strands of coding regions within the Arabidopsis genome. The results enhance our understanding of AS and alternative translation mechanisms under normal conditions, and in response to ABA treatment.© 2017 The Authors The Plant Journal © 2017 John Wiley & Sons Ltd.
High-resolution expression map of the Arabidopsis root reveals alternative splicing and lincRNA regulation.
The extent to which alternative splicing and long intergenic noncoding RNAs (lincRNAs) contribute to the specialized functions of cells within an organ is poorly understood. We generated a comprehensive dataset of gene expression from individual cell types of the Arabidopsis root. Comparisons across cell types revealed that alternative splicing tends to remove parts of coding regions from a longer, major isoform, providing evidence for a progressive mechanism of splicing. Cell-type-specific intron retention suggested a possible origin for this common form of alternative splicing. Coordinated alternative splicing across developmental stages pointed to a role in regulating differentiation. Consistent with this hypothesis, distinct isoforms of a transcription factor were shown to control developmental transitions. lincRNAs were generally lowly expressed at the level of individual cell types, but co-expression clusters provided clues as to their function. Our results highlight insights gained from analysis of expression at the level of individual cell types. Copyright © 2016 Elsevier Inc. All rights reserved.
We have explored the importance of the phyllosphere microbiome in plant resistance in the cuticle mutants bdg (BODYGUARD) or lacs2.3 (LONG CHAIN FATTY ACID SYNTHASE 2) that are strongly resistant to the fungal pathogen Botrytis cinerea. The study includes infection of plants under sterile conditions, 16S ribosomal DNA sequencing of the phyllosphere microbiome, and isolation and high coverage sequencing of bacteria from the phyllosphere. When inoculated under sterile conditions bdg became as susceptible as wild-type (WT) plants whereas lacs2.3 mutants retained the resistance. Adding washes of its phyllosphere microbiome could restore the resistance of bdg mutants, whereas the resistance of lacs2.3 results from endogenous mechanisms. The phyllosphere microbiome showed distinct populations in WT plants compared to cuticle mutants. One species identified as Pseudomonas sp isolated from the microbiome of bdg provided resistance to B. cinerea on Arabidopsis thaliana as well as on apple fruits. No direct activity was observed against B. cinerea and the action of the bacterium required the plant. Thus, microbes present on the plant surface contribute to the resistance to B. cinerea. These results open new perspectives on the function of the leaf microbiome in the protection of plants.© 2016 The Authors. New Phytologist © 2016 New Phytologist Trust.
DNA methylation on N6-adenine (6mA) has recently been found to be a potentially epigenetic mark in several unicellular and multicellular eukaryotes. However, its distribution patterns and potential functions in land plants, which are primary producers for most ecosystems, remain largely unknown. Here we report global profiling of 6mA sites at single-nucleotide resolution in the genome of Arabidopsis thaliana at different developmental stages using single-molecule real-time sequencing. 6mA sites are widely distributed across the Arabidopsis genome and enriched over the pericentromeric heterochromatin regions. 6mA occurs more frequently in gene bodies than intergenic regions. Analysis of 6mA methylomes and RNA sequencing data demonstrates that 6mA frequency positively correlates with the gene expression level and the transition from vegetative to reproductive growth in Arabidopsis. Our results uncover 6mA as a DNA mark associated with actively expressed genes in Arabidopsis, suggesting that 6mA serves as a hitherto unknown epigenetic mark in land plants. Copyright © 2018 Elsevier Inc. All rights reserved.
Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.
While genome assembly projects have been successful in many haploid and inbred species, the assembly of noninbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.
Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characteristics of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4C2 and P5C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.
The power of Single Molecule Real-Time sequencing technology in the de novo assembly of a eukaryotic genome.
Second-generation sequencers (SGS) have been game-changing, achieving cost-effective whole genome sequencing in many non-model organisms. However, a large portion of the genomes still remains unassembled. We reconstructed azuki bean (Vigna angularis) genome using single molecule real-time (SMRT) sequencing technology and achieved the best contiguity and coverage among currently assembled legume crops. The SMRT-based assembly produced 100 times longer contigs with 100 times smaller amount of gaps compared to the SGS-based assemblies. A detailed comparison between the assemblies revealed that the SMRT-based assembly enabled a more comprehensive gene annotation than the SGS-based assemblies where thousands of genes were missing or fragmented. A chromosome-scale assembly was generated based on the high-density genetic map, covering 86% of the azuki bean genome. We demonstrated that SMRT technology, though still needed support of SGS data, achieved a near-complete assembly of a eukaryotic genome.
Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms.
Resequencing or reference-based assemblies reveal large parts of the small-scale sequence variation. However, they typically fail to separate such local variation into colinear and rearranged variation, because they usually do not recover the complement of large-scale rearrangements, including transpositions and inversions. Besides the availability of hundreds of genomes of diverse Arabidopsis thaliana accessions, there is so far only one full-length assembled genome: the reference sequence. We have assembled 117 Mb of the A. thaliana Landsberg erecta (Ler) genome into five chromosome-equivalent sequences using a combination of short Illumina reads, long PacBio reads, and linkage information. Whole-genome comparison against the reference sequence revealed 564 transpositions and 47 inversions comprising ~3.6 Mb, in addition to 4.1 Mb of nonreference sequence, mostly originating from duplications. Although rearranged regions are not different in local divergence from colinear regions, they are drastically depleted for meiotic recombination in heterozygotes. Using a 1.2-Mb inversion as an example, we show that such rearrangement-mediated reduction of meiotic recombination can lead to genetically isolated haplotypes in the worldwide population of A. thaliana Moreover, we found 105 single-copy genes, which were only present in the reference sequence or the Ler assembly, and 334 single-copy orthologs, which showed an additional copy in only one of the genomes. To our knowledge, this work gives first insights into the degree and type of variation, which will be revealed once complete assemblies will replace resequencing or other reference-dependent methods.
Deletion-bias in DNA double-strand break repair differentially contributes to plant genome shrinkage.
In order to prevent genome instability, cells need to be protected by a number of repair mechanisms, including DNA double-strand break (DSB) repair. The extent to which DSB repair, biased towards deletions or insertions, contributes to evolutionary diversification of genome size is still under debate. We analyzed mutation spectra in Arabidopsis thaliana and in barley (Hordeum vulgare) by PacBio sequencing of three DSB-targeted loci each, uncovering repair via gene conversion, single strand annealing (SSA) or nonhomologous end-joining (NHEJ). Furthermore, phylogenomic comparisons between A. thaliana and two related species were used to detect naturally occurring deletions during Arabidopsis evolution. Arabidopsis thaliana revealed significantly more and larger deletions after DSB repair than barley, and barley displayed more and larger insertions. Arabidopsis displayed a clear net loss of DNA after DSB repair, mainly via SSA and NHEJ. Barley revealed a very weak net loss of DNA, apparently due to less active break-end resection and easier copying of template sequences into breaks. Comparative phylogenomics revealed several footprints of SSA in the A. thaliana genome. Quantitative assessment of DNA gain and loss through DSB repair processes suggests deletion-biased DSB repair causing ongoing genome shrinking in A. thaliana, whereas genome size in barley remains nearly constant.© 2017 The Authors. New Phytologist © 2017 New Phytologist Trust.