AGBT 2013 Presentation Slides: Cold Spring Harbor Laboratory’s Michael Schatz presented strategies for de novo assembly of crop genomes with PacBio technolgy.
Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers to understand molecular mechanisms in evolution and gain insight into adaptive strategies. With read lengths exceeding 10 kb, we are able to sequence high-quality, closed microbial genomes with associated plasmids, and investigate large genome complexities, such as long, highly repetitive, low-complexity regions and multiple tandem-duplication events. Improved genome quality, observed at 99.9999% (QV60) consensus accuracy, and significant reduction of gap regions in reference genomes (up to and beyond 50%) allow researchers to better understand coding sequences with high confidence, investigate potential regulatory mechanisms in noncoding regions, and make inferences about evolutionary strategies that are otherwise missed by the coverage biases associated with short- read sequencing technologies. Additional benefits afforded by SMRT Sequencing include the simultaneous capability to detect epigenomic modifications and obtain full-length cDNA transcripts that obsolete the need for assembly. With direct sequencing of DNA in real-time, this has resulted in the identification of numerous base modifications and motifs, which genome-wide profiles have linked to specific methyltransferase activities. Our new offering, the Iso-Seq Application, allows for the accurate differentiation between transcript isoforms that are difficult to resolve with short-read technologies. PacBio reads easily span transcripts such that both 5’/3’ primers for cDNA library generation and the poly-A tail are observed. As such, exon configuration and intron retention events can be analyzed without ambiguity. This technological advance is useful for characterizing transcript diversity and improving gene structure annotations in reference genomes. We review solutions available with SMRT Sequencing, from targeted sequencing efforts to obtaining reference genomes (>100 Mb). This includes strategies for identifying microsatellites and conducting phylogenetic comparisons with targeted gene families. We highlight how to best leverage our long reads that have exceeded 20 kb in length for research investigations, as well as currently available bioinformatics strategies for analysis. Benefits for these applications are further realized with consistent use of size selection of input sample using the BluePippin™ device from Sage Science as demonstrated in our genome improvement projects. Using the latest P5-C3 chemistry on model organisms, these efforts have yielded an observed contig N50 of ~6 Mb, with the longest contig exceeding 12.5 Mb and an average base quality of QV50.
Single Molecule, Real-Time (SMRT) Sequencing provides efficient, streamlined solutions to address new frontiers in plant genomes and transcriptomes. Inherent challenges presented by highly repetitive, low-complexity regions and duplication events are directly addressed with multi- kilobase read lengths exceeding 8.5 kb on average, with many exceeding 20 kb. Differentiating between transcript isoforms that are difficult to resolve with short-read technologies is also now possible. We present solutions available for both reference genome and transcriptome research that best leverage long reads in several plant projects including algae, Arabidopsis, rice, and spinach using only the PacBio platform. Benefits for these applications are further realized with consistent use of size-selection of input sample using the BluePippin™ device from Sage Science. We will share highlights from our genome projects using the latest P5- C3 chemistry to generate high-quality reference genomes with the highest contiguity, contig N50 exceeding 1 Mb, and average base quality of QV50. Additionally, the value of long, intact reads to provide a no-assembly approach to investigate transcript isoforms using our Iso-Seq protocol will be presented for full transcriptome characterization and targeted surveys of genes with complex structures. PacBio provides the most comprehensive assembly with annotation when combining offerings for both genome and transcriptome research efforts. For more focused investigation, PacBio also offers researchers opportunities to easily investigate and survey genes with complex structures.
Goat is an important source of milk, meat, and fiber, especially in developing countries. An advantage of goats as livestock is the low maintenance requirements and high adaptability compared to other milk producers. The global population of domestic goats exceeds 800 million. In Africa, goat production is characterized by low productivity levels, and attempts to introduce more productive breeds have met with poor success due in part to nutritional constraints. It has been suggested that incorporation of selective breeding within the herds adapted for survival could represent one approach to improving food security across Africa. A recently produced genome assembly of a Chinese Yunnan breed goat, based on 192 Gb of short reads across a range of insert sizes from 180 bp to 20 kb, reported a contig N50 of 18.7 kb. The scaffold N50 was improved from 2.2 Mb to 3.1 Mb by addition of fosmid end sequence, with an estimated 140 million Ns in gaps and 91% coverage. The assembly has proven somewhat problematic for pursuing genome-wide association analysis with SNP arrays, apparently due in part to errors in ordering of markers using the draft genome. In order to provide a higher quality assembly, we sequenced a highly inbred, San Clemente breed goat genome using 458 SMRT cells on the Pacific Biosciences platform. These cells generated 193.5 Gbases of sequence after processing into subreads, with mean 5110 bases and max subread length of 40.5 kb. This sequence data generated an assembly using the recently reported MHAP error correction approach and Celera Assembler v8.2. The contig N50 was 2.5 Mb, with the largest contig spanning 19.5 Mb. Additional characteristics of the assembly will be presented.
De novo assembly of a complex panicoid grass genome using ultra-long PacBio reads with P6C4 chemistry
Drought is responsible for much of the global losses in crop yields and understanding how plants naturally cope with drought stress is essential for breeding and engineering crops for the changing climate. Resurrection plants desiccate to complete dryness during times of drought, then “come back to life” once water is available making them an excellent model for studying drought tolerance. Understanding the molecular networks governing how resurrection plants handle desiccation will provide targets for crop engineering. Oropetium thomaeum (Oro) is a resurrection plant that also has the smallest known grass genome at 250 Mb compared to Brachypodium distachyon (300 Mb) and rice (350 Mb). Plant genomes, especially grasses, have complex repeat structures such as telomeres, centromeres, and ribosomal gene cassettes, and high heterozygosity, which makes them difficult to assembly using short read next generation sequencing technologies. Ultra-long PacBio reads using the new P6C4 chemistry and the latest 15kb Blue Pippin size-selection protocol to generate 20kb insert libraries that yielded an average read length of 12kb providing ~72X coverage, and 10X coverage with reads over 20kb. The HGAP assembly covers 98% of the genome with a contig N50 of 2.4 Mb, which makes it one of the highest quality and most complete plant genomes assembled to date. Oro has a compact genome structure compared to other grasses with only 16% repeat sequences but has very good collinearity with other grasses. Understanding the genomic mechanisms of extreme desiccation tolerance in resurrection plants like Oro will provide insights for engineering and intelligent breeding of improved food, fuel, and fiber crops.
A comprehensive study of the sugar pine (Pinus lambertiana) transcriptome implemented through diverse next-generation sequencing approaches
The assembly, annotation, and characterization of the sugar pine (Pinus lambertiana Dougl.) transcriptome represents an opportunity to study the genetic mechanisms underlying resistance to the invasive white pine blister rust (Cronartium ribicola) as well as responses to other abiotic stresses. The assembled transcripts also provide a resource to improve the genome assembly. We selected a diverse set of tissues allowing the first comprehensive evaluation of the sugar pine gene space. We have combined short read sequencing technologies (Illumina MiSeq and HiSeq) with the relatively new Pacific Biosciences Iso-Seq approach. From the 2.5 billion and 1.6 million Illumina and PacBio (46 SMRT cells) reads, 33,720 unigenes were de novo assembled. Comparison of sequencing technologies revealed improved coverage with Illumina HiSeq reads and better splice variant detection with PacBio Iso-Seq reads. The genes identified as unique to each library ranges from 199 transcripts (basket seedling) to 3,482 transcripts (female cones). In total, 10,026 transcripts were shared by all libraries. Genes differentially expressed in response to these provided insight on abiotic and biotic stress responses. To analyze orthologous sequences, we compared the translated sequences against 19 plant species, identifying 7,229 transcripts that clustered uniquely among the conifers. We have generated here a high quality transcriptome from one WPBR susceptible and one WPBR resistant sugar pine individual. Through the comprehensive tissue sampling and the depth of the sequencing achieved, detailed information on disease resistance can be further examined.
Maize is an amazingly diverse crop. A study in 20051 demonstrated that half of the genome sequence and one-third of the gene content between two inbred lines of maize were not shared. This diversity, which is more than two orders of magnitude larger than the diversity found between humans and chimpanzees, highlights the inability of a single reference genome to represent the full pan-genome of maize and all its variants. Here we present and review several efforts to characterize the complete diversity within maize using the highly accurate long reads of PacBio Single Molecule, Real-Time (SMRT) Sequencing. These methods provide a framework for a pan-genomic approach that can be applied to studies of a wide variety of important crop species.
In this PacBio User Group Meeting presentation, PacBio scientist Kristin Mars speaks about recent updates, such as the single-day library prep that’s now possible with the Iso-Seq Express workflow. She…
Analyses of the Complete Genome Sequence of the Strain Bacillus pumilus ZB201701 Isolated from Rhizosphere Soil of Maize under Drought and Salt Stress.
Bacillus pumilus ZB201701 is a rhizobacterium with the potential to promote plant growth and tolerance to drought and salinity stress. We herein present the complete genome sequence of the Gram-positive bacterium B. pumilus ZB201701, which consists of a linear chromosome with 3,640,542 base pairs, 3,608 protein-coding sequences, 24 ribosomal RNAs, and 80 transfer RNAs. Genome analyses using bioinformatics revealed some of the putative gene clusters involved in defense mechanisms. In addition, activity analyses of the strain under salt and simulated drought stress suggested its potential tolerance to abiotic stress. Plant growth-promoting bacteria-based experiments indicated that the strain promotes the salt tolerance of maize. The complete genome of B. pumilus ZB201701 provides valuable insights into rhizobacteria-mediated salt and drought tolerance and rhizobacteria-based solutions for abiotic stress in agriculture.
We present high quality, phased genome assemblies representative of taurine and indicine cattle, subspecies that differ markedly in productivity-related traits and environmental adaptation. We report a new haplotype-aware scaffolding and polishing pipeline using contigs generated by the trio binning method to produce haplotype-resolved, chromosome-level genome assemblies of Angus (taurine) and Brahman (indicine) cattle breeds. These assemblies were used to identify structural and copy number variants that differentiate the subspecies and we found variant detection was sensitive to the specific reference genome chosen. Six gene families with immune related functions are expanded in the indicine lineage. Assembly of the genomes of both subspecies from a single individual enabled transcripts to be phased to detect allele-specific expression, and to study genome-wide selective sweeps. An indicus-specific extra copy of fatty acid desaturase is under positive selection and may contribute to indicine adaptation to heat and drought.
Genome analysis and Hi-C assisted assembly of Elaeagnus angustifolia L., a deciduous tree belonging to Elaeagnaceae
Elaeagnus angustifolia L. is a deciduous tree of the Elaeagnaceae family. It is widely used in the study of abiotic stress tolerance in plants and for the improvement of desertification-affected land due to its characteristics of drought resistance, salt tolerance, cold resistance, wind resistance, and other environmental adaptation. Here, we report the complete genome sequencing using the Pacific Biosciences (PacBio) platform and Hi-C assisted assembly of E. angustifolia. A total of 44.27 Gb raw PacBio sequel reads were obtained after filtering out low-quality data, with an average length of 8.64 Kb. Assembly using Canu gave an assembly length of 781.09 Mb, with a contig N50 of 486.92 Kb. A total of 39.56 Gb of clean reads was obtained, with a sequencing coverage of 75×, and Q30 ratio > 95.46%. The 510.71 Mb genomic sequence was mapped to the chromosome, accounting for 96.94% of the total length of the sequence, and the corresponding number of sequences was 269, accounting for 45.83% of the total number of sequences. The genome sequence study of E. angustifolia can be a valuable source for the comparative genome analysis of the Elaeagnaceae family members, and can help to understand the evolutionary response mechanisms of the Elaeagnaceae to drought, salt, cold and wind resistance, and thereby provide effective theoretical support for the improvement of desertification-affected land.
Strengths and potential pitfalls of hay-transfer for ecological restoration revealed by RAD-seq analysis in floodplain Arabis species
Achieving high intraspecific genetic diversity is a critical goal in ecological restoration as it increases the adaptive potential and long-term resilience of populations. Thus, we investigated genetic diversity within and between pristine sites in a fossil floodplain and compared it to sites restored by hay-transfer between 1997 and 2014. RAD-seq genotyping revealed that the stenoecious flood-plain species Arabis nemorensis is co-occurring with individuals that, based on ploidy, ITS-sequencing and morphology, probably belong to the close relative Arabis sagittata, which has a documented preference for dry calcareous grasslands but has not been reported in floodplain meadows. We show that hay-transfer maintains genetic diversity for both species. Additionally, in A. sagittata, transfer from multiple genetically isolated pristine sites resulted in restored sites with increased diversity and admixed local genotypes. In A. nemorensis, transfer did not create novel admixture dynamics because genetic diversity between pristine sites was less differentiated. Thus, the effects of hay-transfer on genetic diversity also depend on the genetic makeup of the donor communities of each species, especially when local material is mixed. Our results demonstrate the efficiency of hay-transfer for habitat restoration and emphasize the importance of pre-restoration characterization of micro-geographic patterns of intraspecific diversity of the community to guarantee that restoration practices reach their goal, i.e. maximize the adaptive potential of the entire restored plant community. Overlooking these patterns may alter the balance between species in the community. Additionally, our comparison of summary statistics obtained from de novo and reference-based RAD-seq pipelines shows that the genomic impact of restoration can be reliably monitored in species lacking prior genomic knowledge.
Optimized Cas9 expression systems for highly efficient Arabidopsis genome editing facilitate isolation of complex alleles in a single generation.
Genetic resources for the model plant Arabidopsis comprise mutant lines defective in almost any single gene in reference accession Columbia. However, gene redundancy and/or close linkage often render it extremely laborious or even impossible to isolate a desired line lacking a specific function or set of genes from segregating populations. Therefore, we here evaluated strategies and efficiencies for the inactivation of multiple genes by Cas9-based nucleases and multiplexing. In first attempts, we succeeded in isolating a mutant line carrying a 70 kb deletion, which occurred at a frequency of ~?1.6% in the T2 generation, through PCR-based screening of numerous individuals. However, we failed to isolate a line lacking Lhcb1 genes, which are present in five copies organized at two loci in the Arabidopsis genome. To improve efficiency of our Cas9-based nuclease system, regulatory sequences controlling Cas9 expression levels and timing were systematically compared. Indeed, use of DD45 and RPS5a promoters improved efficiency of our genome editing system by approximately 25-30-fold in comparison to the previous ubiquitin promoter. Using an optimized genome editing system with RPS5a promoter-driven Cas9, putatively quintuple mutant lines lacking detectable amounts of Lhcb1 protein represented approximately 30% of T1 transformants. These results show how improved genome editing systems facilitate the isolation of complex mutant alleles, previously considered impossible to generate, at high frequency even in a single (T1) generation.
Yellowhorn (Xanthoceras sorbifolium) is a species of the Sapindaceae family native to China and is an oil tree that can withstand cold and drought conditions. A pseudomolecule-level genome assembly for this species will not only contribute to understanding the evolution of its genes and chromosomes but also bring yellowhorn breeding into the genomic era.Here, we generated 15 pseudomolecules of yellowhorn chromosomes, on which 97.04% of scaffolds were anchored, using the combined Illumina HiSeq, Pacific Biosciences Sequel, and Hi-C technologies. The length of the final yellowhorn genome assembly was 504.2 Mb with a contig N50 size of 1.04 Mb and a scaffold N50 size of 32.17 Mb. Genome annotation revealed that 68.67% of the yellowhorn genome was composed of repetitive elements. Gene modelling predicted 24,672 protein-coding genes. By comparing orthologous genes, the divergence time of yellowhorn and its close sister species longan (Dimocarpus longan) was estimated at ~33.07 million years ago. Gene cluster and chromosome synteny analysis demonstrated that the yellowhorn genome shared a conserved genome structure with its ancestor in some chromosomes.This genome assembly represents a high-quality reference genome for yellowhorn. Integrated genome annotations provide a valuable dataset for genetic and molecular research in this species. We did not detect whole-genome duplication in the genome. The yellowhorn genome carries syntenic blocks from ancient chromosomes. These data sources will enable this genome to serve as an initial platform for breeding better yellowhorn cultivars. © The Author(s) 2019. Published by Oxford University Press.
Pecan (Carya illinoinensis) and Chinese hickory (C. cathayensis) are important commercially cultivated nut trees in the genus Carya (Juglandaceae), with high nutritional value and substantial health benefits.We obtained >187.22 and 178.87 gigabases of sequence, and ~288× and 248× genome coverage, to a pecan cultivar (“Pawnee”) and a domesticated Chinese hickory landrace (ZAFU-1), respectively. The total assembly size is 651.31 megabases (Mb) for pecan and 706.43 Mb for Chinese hickory. Two genome duplication events before the divergence from walnut were found in these species. Gene family analysis highlighted key genes in biotic and abiotic tolerance, oil, polyphenols, essential amino acids, and B vitamins. Further analyses of reduced-coverage genome sequences of 16 Carya and 2 Juglans species provide additional phylogenetic perspective on crop wild relatives.Cooperative characterization of these valuable resources provides a window to their evolutionary development and a valuable foundation for future crop improvement. © The Author(s) 2019. Published by Oxford University Press.