Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers to understand molecular mechanisms in evolution and gain insight into adaptive strategies. With read lengths exceeding 10 kb, we are able to sequence high-quality, closed microbial genomes with associated plasmids, and investigate large genome complexities, such as long, highly repetitive, low-complexity regions and multiple tandem-duplication events. Improved genome quality, observed at 99.9999% (QV60) consensus accuracy, and significant reduction of gap regions in reference genomes (up to and beyond 50%) allow researchers to better understand coding sequences with high confidence, investigate potential regulatory mechanisms in noncoding regions, and make inferences about evolutionary strategies that are otherwise missed by the coverage biases associated with short- read sequencing technologies. Additional benefits afforded by SMRT Sequencing include the simultaneous capability to detect epigenomic modifications and obtain full-length cDNA transcripts that obsolete the need for assembly. With direct sequencing of DNA in real-time, this has resulted in the identification of numerous base modifications and motifs, which genome-wide profiles have linked to specific methyltransferase activities. Our new offering, the Iso-Seq Application, allows for the accurate differentiation between transcript isoforms that are difficult to resolve with short-read technologies. PacBio reads easily span transcripts such that both 5’/3’ primers for cDNA library generation and the poly-A tail are observed. As such, exon configuration and intron retention events can be analyzed without ambiguity. This technological advance is useful for characterizing transcript diversity and improving gene structure annotations in reference genomes. We review solutions available with SMRT Sequencing, from targeted sequencing efforts to obtaining reference genomes (>100 Mb). This includes strategies for identifying microsatellites and conducting phylogenetic comparisons with targeted gene families. We highlight how to best leverage our long reads that have exceeded 20 kb in length for research investigations, as well as currently available bioinformatics strategies for analysis. Benefits for these applications are further realized with consistent use of size selection of input sample using the BluePippin™ device from Sage Science as demonstrated in our genome improvement projects. Using the latest P5-C3 chemistry on model organisms, these efforts have yielded an observed contig N50 of ~6 Mb, with the longest contig exceeding 12.5 Mb and an average base quality of QV50.
From RNA to full-length transcripts: The PacBio Iso-Seq method for transcriptome analysis and genome annotation
A single gene may encode a surprising number of proteins, each with a distinct biological function. This is especially true in complex eukaryotes. Short- read RNA sequencing (RNA-seq) works by physically shearing transcript isoforms into smaller pieces and bioinformatically reassembling them, leaving opportunity for misassembly or incomplete capture of the full diversity of isoforms from genes of interest. The PacBio Isoform Sequencing (Iso-Seq™) method employs long reads to sequence transcript isoforms from the 5’ end to their poly-A tails, eliminating the need for transcript reconstruction and inference. These long reads result in complete, unambiguous information about alternatively spliced exons, transcriptional start sites, and poly- adenylation sites. This allows for the characterization of the full complement of isoforms within targeted genes, or across an entire transcriptome. Here we present improved genome annotations for two avian models of vocal learning, Anna’s hummingbird (Calypte anna) and zebra finch (Taeniopygia guttata), using the Iso-Seq method. We present graphical user interface and command line analysis workflows for the data sets. From brain total RNA, we characterize more than 15,000 isoforms in each species, 9% and 5% of which were previously unannotated in hummingbird and zebra finch, respectively. We highlight one example where capturing full-length transcripts identifies additional exons and UTRs.
Maize is an amazingly diverse crop. A study in 20051 demonstrated that half of the genome sequence and one-third of the gene content between two inbred lines of maize were not shared. This diversity, which is more than two orders of magnitude larger than the diversity found between humans and chimpanzees, highlights the inability of a single reference genome to represent the full pan-genome of maize and all its variants. Here we present and review several efforts to characterize the complete diversity within maize using the highly accurate long reads of PacBio Single Molecule, Real-Time (SMRT) Sequencing. These methods provide a framework for a pan-genomic approach that can be applied to studies of a wide variety of important crop species.
Incomplete annotation of genomes represents a major impediment to understanding biological processes, functional differences between species, and evolutionary mechanisms. Often, genes that are large, embedded within duplicated genomic regions, or associated with repeats are difficult to study by short-read expression profiling and assembly. In addition, most genes in eukaryotic organisms produce alternatively spliced isoforms, broadening the diversity of proteins encoded by the genome, which are difficult to resolve with short-read methods. Short-read RNA sequencing (RNA-seq) works by physically shearing transcript isoforms into smaller pieces and bioinformatically reassembling them, leaving opportunity for misassembly or incomplete capture of the full diversity of isoforms from genes of interest. In contrast, Single Molecule, Real-Time (SMRT) Sequencing directly sequences full-length transcripts without the need for assembly and imputation. Here we apply the Iso-Seq method (long-read RNA sequencing) to detect full-length isoforms and the new IsoPhase algorithm to retrieve allele-specific isoform information for two avian models of vocal learning, Anna’s hummingbird (Calypte anna) and zebra finch (Taeniopygia guttata).
In a poster presented at AGBT 2017, Fritz Sedlazeck from Johns Hopkins University describes the comparison of genome assemblies produced using long-read PacBio sequencing and short-read sequencing with 10x Genomics…
AGBT Conference: A community effort using multiple technologies to produce a dramatically improved genome assembly of the Zika virus mosquito vector
At AGBT 2017, the Broad Institute’s Daniel Neafsey reported a large collaborative effort to sequence the mosquito that carries Zika virus. The team is using long-read PacBio sequencing to produce…
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
Background New sequencing technologies have lowered financial barriers to whole genome sequencing, but resulting assemblies are often fragmented and far from textquoteleftfinishedtextquoteright. Updating multi-scaffold drafts to chromosome-level status can be achieved through experimental mapping or re-sequencing efforts. Avoiding the costs associated with such approaches, comparative genomic analysis of gene order conservation (synteny) to predict scaffold neighbours (adjacencies) offers a potentially useful complementary method for improving draft assemblies.Results We employed three gene synteny-based methods applied to 21 Anopheles mosquito assemblies to produce consensus sets of scaffold adjacencies. For subsets of the assemblies we integrated these with additional supporting data to confirm and complement the synteny-based adjacencies: six with physical mapping data that anchor scaffolds to chromosome locations, 13 with paired-end RNA sequencing (RNAseq) data, and three with new assemblies based on re-scaffolding or Pacific Biosciences long-read data. Our combined analyses produced 20 new superscaffolded assemblies with improved contiguities: seven for which assignments of non-anchored scaffolds to chromosome arms span more than 75% of the assemblies, and a further seven with chromosome anchoring including an 88% anchored Anopheles arabiensis assembly and, respectively, 73% and 84% anchored assemblies with comprehensively updated cytogenetic photomaps for Anopheles funestus and Anopheles stephensi.Conclusions Experimental data from probe mapping, RNAseq, or long-read technologies, where available, all contribute to successful upgrading of draft assemblies. Our comparisons show that gene synteny-based computational methods represent a valuable alternative or complementary approach. Our improved Anopheles reference assemblies highlight the utility of applying comparative genomics approaches to improve community genomic resources.ADADSEQAGOAGOUTI-basedAGOUTIannotated genome optimization using transcriptome information toolALNalignment-basedCAMSAcomparative analysis and merging of scaffold assemblies toolDPdynamic programmingFISHfluorescence in situ hybridizationGAGOS-ASMGOS-ASMGene order scaffold assemblerKbpkilobasepairsMbpmegabasepairsOSORTHOSTITCHPacBioPacific BiosciencesPBPacBio-basedPHYphysical-mapping-basedRNAseqRNA sequencingQTLquantitative trait lociSYNsynteny-based.
Optimized Cas9 expression systems for highly efficient Arabidopsis genome editing facilitate isolation of complex alleles in a single generation.
Genetic resources for the model plant Arabidopsis comprise mutant lines defective in almost any single gene in reference accession Columbia. However, gene redundancy and/or close linkage often render it extremely laborious or even impossible to isolate a desired line lacking a specific function or set of genes from segregating populations. Therefore, we here evaluated strategies and efficiencies for the inactivation of multiple genes by Cas9-based nucleases and multiplexing. In first attempts, we succeeded in isolating a mutant line carrying a 70 kb deletion, which occurred at a frequency of ~?1.6% in the T2 generation, through PCR-based screening of numerous individuals. However, we failed to isolate a line lacking Lhcb1 genes, which are present in five copies organized at two loci in the Arabidopsis genome. To improve efficiency of our Cas9-based nuclease system, regulatory sequences controlling Cas9 expression levels and timing were systematically compared. Indeed, use of DD45 and RPS5a promoters improved efficiency of our genome editing system by approximately 25-30-fold in comparison to the previous ubiquitin promoter. Using an optimized genome editing system with RPS5a promoter-driven Cas9, putatively quintuple mutant lines lacking detectable amounts of Lhcb1 protein represented approximately 30% of T1 transformants. These results show how improved genome editing systems facilitate the isolation of complex mutant alleles, previously considered impossible to generate, at high frequency even in a single (T1) generation.
African cichlid fishes are well known for their rapid radiations and are a model system for studying evolutionary processes. Here we compare multiple, high-quality, chromosome-scale genome assemblies to elucidate the genetic mechanisms underlying cichlid diversification and study how genome structure evolves in rapidly radiating lineages.We re-anchored our recent assembly of the Nile tilapia (Oreochromis niloticus) genome using a new high-density genetic map. We also developed a new de novo genome assembly of the Lake Malawi cichlid, Metriaclima zebra, using high-coverage Pacific Biosciences sequencing, and anchored contigs to linkage groups (LGs) using 4 different genetic maps. These new anchored assemblies allow the first chromosome-scale comparisons of African cichlid genomes. Large intra-chromosomal structural differences (~2-28 megabase pairs) among species are common, while inter-chromosomal differences are rare (<10 megabase pairs total). Placement of the centromeres within the chromosome-scale assemblies identifies large structural differences that explain many of the karyotype differences among species. Structural differences are also associated with unique patterns of recombination on sex chromosomes. Structural differences on LG9, LG11, and LG20 are associated with reduced recombination, indicative of inversions between the rock- and sand-dwelling clades of Lake Malawi cichlids. M. zebra has a larger number of recent transposable element insertions compared with O. niloticus, suggesting that several transposable element families have a higher rate of insertion in the haplochromine cichlid lineage.This study identifies novel structural variation among East African cichlid genomes and provides a new set of genomic resources to support research on the mechanisms driving cichlid adaptation and speciation. © The Author(s) 2019. Published by Oxford University Press.
Divergent evolution in the genomes of closely related lacertids, Lacerta viridis and L. bilineata, and implications for speciation.
Lacerta viridis and Lacerta bilineata are sister species of European green lizards (eastern and western clades, respectively) that, until recently, were grouped together as the L. viridis complex. Genetic incompatibilities were observed between lacertid populations through crossing experiments, which led to the delineation of two separate species within the L. viridis complex. The population history of these sister species and processes driving divergence are unknown. We constructed the first high-quality de novo genome assemblies for both L. viridis and L. bilineata through Illumina and PacBio sequencing, with annotation support provided from transcriptome sequencing of several tissues. To estimate gene flow between the two species and identify factors involved in reproductive isolation, we studied their evolutionary history, identified genomic rearrangements, detected signatures of selection on non-coding RNA, and on protein-coding genes.Here we show that gene flow was primarily unidirectional from L. bilineata to L. viridis after their split at least 1.15 million years ago. We detected positive selection of the non-coding repertoire; mutations in transcription factors; accumulation of divergence through inversions; selection on genes involved in neural development, reproduction, and behavior, as well as in ultraviolet-response, possibly driven by sexual selection, whose contribution to reproductive isolation between these lacertid species needs to be further evaluated.The combination of short and long sequence reads resulted in one of the most complete lizard genome assemblies. The characterization of a diverse array of genomic features provided valuable insights into the demographic history of divergence among European green lizards, as well as key species differences, some of which are candidates that could have played a role in speciation. In addition, our study generated valuable genomic resources that can be used to address conservation-related issues in lacertids. © The Author(s) 2018. Published by Oxford University Press.
Genome assembly and annotation of the Trichoplusia ni Tni-FNL insect cell line enabled by long-read technologies.
Trichoplusiani derived cell lines are commonly used to enable recombinant protein expression via baculovirus infection to generate materials approved for clinical use and in clinical trials. In order to develop systems biology and genome engineering tools to improve protein expression in this host, we performed de novo genome assembly of the Trichoplusiani-derived cell line Tni-FNL.By integration of PacBio single-molecule sequencing, Bionano optical mapping, and 10X Genomics linked-reads data, we have produced a draft genome assembly of Tni-FNL.Our assembly contains 280 scaffolds, with a N50 scaffold size of 2.3 Mb and a total length of 359 Mb. Annotation of the Tni-FNL genome resulted in 14,101 predicted genes and 93.2% of the predicted proteome contained recognizable protein domains. Ortholog searches within the superorder Holometabola provided further evidence of high accuracy and completeness of the Tni-FNL genome assembly.This first draft Tni-FNL genome assembly was enabled by complementary long-read technologies and represents a high-quality, well-annotated genome that provides novel insight into the complexity of this insect cell line and can serve as a reference for future large-scale genome engineering work in this and other similar recombinant protein production hosts.
Schistosoma japonicum is a parasitic flatworm that causes human schistosomiasis, which is a significant cause of morbidity in China and the Philippines. A single draft genome was available for S. japonicum, yet this assembly is very fragmented and only covers 90% of the genome, which make it difficult to be applied as a reference in functional genome analysis and genes discovery.In this study, we present a high-quality assembly of the fluke S. japonicum genome by combining 20 G (~53X) long single molecule real time sequencing reads with 80 G (~ 213X) Illumina paired-end reads. This improved genome assembly is approximately 370.5 Mb, with contig and scaffold N50 length of 871.9 kb and 1.09 Mb, representing 142.4-fold and 6.2-fold improvement over the released WGS-based assembly, respectively. Additionally, our assembly captured 85.2% complete and 4.6% partial eukaryotic Benchmarking Universal Single-Copy Orthologs. Repetitive elements account for 46.80% of the genome, and 10,089 of the protein-coding genes were predicted from the improved genome, of which 96.5% have been functionally annotated. Lastly, using the improved assembly, we identified 20 significantly expanded gene families in S. japonicum, and those genes were primarily enriched in functions of proteolysis and protein glycosylation.Using the combination of PacBio and Illumina Sequencing technologies, we provided an improved high-quality genome of S. japonicum. This improved genome assembly, as well as the annotation, will be useful for the comparative genomics of the flukes and more importantly facilitate the molecular studies of this important parasite in the future.
The Genome of Cucurbita argyrosperma (Silver-Seed Gourd) Reveals Faster Rates of Protein-Coding Gene and Long Noncoding RNA Turnover and Neofunctionalization within Cucurbita.
Whole-genome duplications are an important source of evolutionary novelties that change the mode and tempo at which genetic elements evolve within a genome. The Cucurbita genus experienced a whole-genome duplication around 30 million years ago, although the evolutionary dynamics of the coding and noncoding genes in this genus have not yet been scrutinized. Here, we analyzed the genomes of four Cucurbita species, including a newly assembled genome of Cucurbita argyrosperma, and compared the gene contents of these species with those of five other members of the Cucurbitaceae family to assess the evolutionary dynamics of protein-coding and long intergenic noncoding RNA (lincRNA) genes after the genome duplication. We report that Cucurbita genomes have a higher protein-coding gene birth-death rate compared with the genomes of the other members of the Cucurbitaceae family. C. argyrosperma gene families associated with pollination and transmembrane transport had significantly faster evolutionary rates. lincRNA families showed high levels of gene turnover throughout the phylogeny, and 67.7% of the lincRNA families in Cucurbita showed evidence of birth from the neofunctionalization of previously existing protein-coding genes. Collectively, our results suggest that the whole-genome duplication in Cucurbita resulted in faster rates of gene family evolution through the neofunctionalization of duplicated genes. Copyright © 2019 The Author. Published by Elsevier Inc. All rights reserved.
Microsatellite marker set for genetic diversity assessment of primitive Chitala chitala (Hamilton, 1822) derived through SMRT sequencing technology.
In present study, single molecule-real time sequencing technology was used to obtain a validated set of microsatellite markers for application in population genetics of the primitive fish, Chitala chitala. Assembly of circular consensus sequencing reads resulted into 1164 sequences which contained 2005 repetitive motifs. A total of 100 sequences were used for primer designing and amplification yielded a set of 28 validated polymorphic markers. These loci were used to genotype n?=?72 samples from three distant riverine populations of India, namely Son, Satluj and Brahmaputra, for determining intraspecific genetic variation. The microsatellite loci exhibited high level of polymorphism with PIC values ranging from 0.281 to 0.901. The genetic parameters revealed that mean heterozygosity ranged from 0.6802 to 0.6826 and the populations were found to be genetically diverse (Fst 0.03-0.06). This indicated the potential application of these microsatellite marker set that can used for stock characterization of C. chitala, in the wild. These newly developed loci were assayed for cross transferability in another notopterid fish, Notopterus notopterus.