Single Molecule Real-Time (SMRT) Sequencing was used to generate long reads for whole genome shotgun sequencing of the genome of the`alala (Hawaiian crow). The ‘alala is endemic to Hawaii, and the only surviving lineage of the crow family, Corvidae, in the Hawaiian Islands. The population declined to less than 20 individuals in the 1990s, and today this charismatic species is extinct in the wild. Currently existing in only two captive breeding facilities, reintroduction of the ‘alala is scheduled to begin in the Fall of 2016. Reintroduction efforts will be assisted by information from the ‘alala genome generated and assembled by SMRT Technology, which will allow detailed analysis of genes associated with immunity, behavior, and learning. Using SMRT Sequencing, we present here best practices for achieving long reads for whole genome shotgun sequencing for complex plant and animal genomes such as the ‘alala genome. With recent advances in SMRTbell library preparation, P6-C4 chemistry and 6-hour movies, the number of useable bases now exceeds 1 Gb per SMRT Cell. Read lengths averaging 10 – 15 kb can be routinely achieved, with the longest reads approaching 70 kb. Furthermore, > 25% of useable bases are in reads greater than 30 kb, advantageous for generating contiguous draft assemblies of contig N50 up to 5 Mb. De novo assemblies of large genomes are now more tractable using SMRT Sequencing as the standalone technology. We also present guidelines for planning out projects for the de novo assembly of large genomes.
The complex immune regions of the genome, including MHC and KIR, contain large copy number variants (CNVs), a high density of genes, hyper-polymorphic gene alleles, and conserved extended haplotypes (CEH) with enormous linkage disequilibrium (LDs). This level of complexity and inherent biases of short-read sequencing make it challenging for extracting immune region haplotype information from reference-reliant, shotgun sequencing and GWAS methods. As NGS based genome and exome sequencing and SNP arrays have become a routine for population studies, numerous efforts are being made for developing software to extract and or impute the immune gene information from these datasets. Despite these efforts, the fine mapping of causal variants of immune genes for their well-documented association with cancer, drug-induced hypersensitivity and immune-related diseases, has been slower than expected. This has in many ways limited our understanding of the mechanisms leading to immune disease. In the present work, we demonstrate the advantages of long reads delivered by SMRT Sequencing for assembling complete haplotypes of MHC and KIR gene clusters, as well as calling correct genotypes of genes comprised within them. All the genotype information is detected at allele- level with full phasing information across SNP-poor regions. Genotypes were called correctly from targeted gene amplicons, haplotypes, as well as from a completely assembled 5 Mb contig of the MHC region from a de novo assembly of whole genome shotgun data. De novo analysis pipeline used in all these approaches allowed for reference-free analysis without imputation, a key for interrogation without prior knowledge about ethnic backgrounds. These methods are thus easily adoptable for previously uncharacterized human or non-human species.
De novo PacBio long-read assembled avian genomes correct and add to genes important in neuroscience and conservation research
To test the impact of high-quality genome assemblies on biological research, we applied PacBio long-read sequencing in conjunction with the new, diploid-aware FALCON-Unzip assembler to a number of bird species. These included: the zebra finch, for which a consortium-generated, Sanger-based reference exists, to determine how the FALCON-Unzip assembly would compare to the current best references available; Anna’s hummingbird genome, which had been assembled with short-read sequencing methods as part of the Avian Phylogenomics phase I initiative; and two critically endangered bird species (kakapo and ‘alala) of high importance for conservations efforts, whose genomes had not previously been sequenced and assembled.
From RNA to full-length transcripts: The PacBio Iso-Seq method for transcriptome analysis and genome annotation
A single gene may encode a surprising number of proteins, each with a distinct biological function. This is especially true in complex eukaryotes. Short- read RNA sequencing (RNA-seq) works by physically shearing transcript isoforms into smaller pieces and bioinformatically reassembling them, leaving opportunity for misassembly or incomplete capture of the full diversity of isoforms from genes of interest. The PacBio Isoform Sequencing (Iso-Seq™) method employs long reads to sequence transcript isoforms from the 5’ end to their poly-A tails, eliminating the need for transcript reconstruction and inference. These long reads result in complete, unambiguous information about alternatively spliced exons, transcriptional start sites, and poly- adenylation sites. This allows for the characterization of the full complement of isoforms within targeted genes, or across an entire transcriptome. Here we present improved genome annotations for two avian models of vocal learning, Anna’s hummingbird (Calypte anna) and zebra finch (Taeniopygia guttata), using the Iso-Seq method. We present graphical user interface and command line analysis workflows for the data sets. From brain total RNA, we characterize more than 15,000 isoforms in each species, 9% and 5% of which were previously unannotated in hummingbird and zebra finch, respectively. We highlight one example where capturing full-length transcripts identifies additional exons and UTRs.
Incomplete annotation of genomes represents a major impediment to understanding biological processes, functional differences between species, and evolutionary mechanisms. Often, genes that are large, embedded within duplicated genomic regions, or associated with repeats are difficult to study by short-read expression profiling and assembly. In addition, most genes in eukaryotic organisms produce alternatively spliced isoforms, broadening the diversity of proteins encoded by the genome, which are difficult to resolve with short-read methods. Short-read RNA sequencing (RNA-seq) works by physically shearing transcript isoforms into smaller pieces and bioinformatically reassembling them, leaving opportunity for misassembly or incomplete capture of the full diversity of isoforms from genes of interest. In contrast, Single Molecule, Real-Time (SMRT) Sequencing directly sequences full-length transcripts without the need for assembly and imputation. Here we apply the Iso-Seq method (long-read RNA sequencing) to detect full-length isoforms and the new IsoPhase algorithm to retrieve allele-specific isoform information for two avian models of vocal learning, Anna’s hummingbird (Calypte anna) and zebra finch (Taeniopygia guttata).
PAG PacBio Workshop: Comparative analyses of next generation technologies for generating chromosome-level reference genome assemblies
At PAG 2017, Rockefeller University’s Erich Jarvis offered an in-depth comparison of methods for generating highly contiguous genome assemblies, using hummingbird as the basis to evaluate a number of sequencing…
In this webinar, Emily Hatas of PacBio shares information about the applications and benefits of SMRT Sequencing in plant and animal biology, agriculture, and industrial research fields. This session contains…
Richard Kuo’s research at the Roslin Institute exploring non-coding RNA of avian species requires high accuracy. SMRT Sequencing on the PacBio Sequel II System and the Iso-Seq method have given…
Identification and characterization of chicken circovirus from commercial broiler chickens in China.
Circoviruses are found in many species, including mammals, birds, lower vertebrates and invertebrates. To date, there are no reports of circovirus-induced diseases in chickens. In this study, we identified a new strain of chicken circovirus (CCV) by PacBio third-generation sequencing samples from chickens with acute gastroenteritis in a Shandong commercial broiler farm in China. The complete genome of CCV was verified by inverse PCR. Genomic analysis revealed that CCV codes two inverse open reading frames (ORFs), and a potential stem-loop structure was present at the 5′ end with a structure typical of a circular virus. Phylogenetic tree analysis showed that CCV formed an independent branch between mammalian and avian circovirus, and homology analysis indicated that the homology of CCV with 21 other known circoviruses was less than 40%. Thus, this CCV strain represents a new species in the genus Circovirus. The infection rate of CCV in 12 chickens with diarrhoea was 100%, but no CCV was found in healthy chickens, thereby indicating that the novel CCV strain is highly associated with acute infectious gastroenteritis in chickens. The emergence of a novel CCV in commercial broiler chickens is highly concerning for the broiler industry. © 2019 Blackwell Verlag GmbH.
Chromosomal organization is relatively stable among avian species, especially with regards to sex chromosomes. Members of the large Sylvioidea clade however have a pair of neo-sex chromosomes which is unique to this clade and originate from a parallel translocation of a region of the ancestral 4A chromosome on both W and Z chromosomes. Here, we took advantage of this unusual event to study the early stages of sex chromosome evolution. To do so, we sequenced a female (ZW) of two Sylvioidea species, a Zosterops borbonicus and a Z. pallidus. Then, we organized the Z. borbonicus scaffolds along chromosomes and annotated genes. Molecular phylogenetic dating under various methods and calibration sets confidently confirmed the recent diversification of the genus Zosterops (1-3.5 million years ago), thus representing one of the most exceptional rates of diversification among vertebrates. We then combined genomic coverage comparisons of five males and seven females, and homology with the zebra finch genome (Taeniopygia guttata) to identify sex chromosome scaffolds, as well as the candidate chromosome breakpoints for the two translocation events. We observed reduced levels of within-species diversity in both translocated regions and, as expected, even more so on the neoW chromosome. In order to compare the rates of molecular evolution in genomic regions of the autosomal-to-sex transitions, we then estimated the ratios of non-synonymous to synonymous polymorphisms (pN/pS) and substitutions (dN/dS). Based on both ratios, no or little contrast between autosomal and Z genes was observed, thus representing a very different outcome than the higher ratios observed at the neoW genes. In addition, we report significant changes in base composition content for translocated regions on the W and Z chromosomes and a large accumulation of transposable elements (TE) on the newly W region. Our results revealed contrasted signals of molecular evolution changes associated to these autosome-to-sex chromosome transitions, with congruent signals of a W chromosome degeneration yet a surprisingly weak support for a fast-Z effect.
Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. These assemblies can be created in various ways, such as use of tissues that contain single-haplotype (haploid) genomes, or by co-sequencing of parental genomes, but these approaches can be impractical in many situations. We present FALCON-Phase, which integrates long-read sequencing data and ultra-long-range Hi-C chromatin interaction data of a diploid individual to create high-quality, phased diploid genome assemblies. The method was evaluated by application to three datasets, including human, cattle, and zebra finch, for which high-quality, fully haplotype resolved assemblies were available for benchmarking. Phasing algorithm accuracy was affected by heterozygosity of the individual sequenced, with higher accuracy for cattle and zebra finch (>97%) compared to human (82%). In addition, scaffolding with the same Hi-C chromatin contact data resulted in phased chromosome-scale scaffolds.
Background New sequencing technologies have lowered financial barriers to whole genome sequencing, but resulting assemblies are often fragmented and far from textquoteleftfinishedtextquoteright. Updating multi-scaffold drafts to chromosome-level status can be achieved through experimental mapping or re-sequencing efforts. Avoiding the costs associated with such approaches, comparative genomic analysis of gene order conservation (synteny) to predict scaffold neighbours (adjacencies) offers a potentially useful complementary method for improving draft assemblies.Results We employed three gene synteny-based methods applied to 21 Anopheles mosquito assemblies to produce consensus sets of scaffold adjacencies. For subsets of the assemblies we integrated these with additional supporting data to confirm and complement the synteny-based adjacencies: six with physical mapping data that anchor scaffolds to chromosome locations, 13 with paired-end RNA sequencing (RNAseq) data, and three with new assemblies based on re-scaffolding or Pacific Biosciences long-read data. Our combined analyses produced 20 new superscaffolded assemblies with improved contiguities: seven for which assignments of non-anchored scaffolds to chromosome arms span more than 75% of the assemblies, and a further seven with chromosome anchoring including an 88% anchored Anopheles arabiensis assembly and, respectively, 73% and 84% anchored assemblies with comprehensively updated cytogenetic photomaps for Anopheles funestus and Anopheles stephensi.Conclusions Experimental data from probe mapping, RNAseq, or long-read technologies, where available, all contribute to successful upgrading of draft assemblies. Our comparisons show that gene synteny-based computational methods represent a valuable alternative or complementary approach. Our improved Anopheles reference assemblies highlight the utility of applying comparative genomics approaches to improve community genomic resources.ADADSEQAGOAGOUTI-basedAGOUTIannotated genome optimization using transcriptome information toolALNalignment-basedCAMSAcomparative analysis and merging of scaffold assemblies toolDPdynamic programmingFISHfluorescence in situ hybridizationGAGOS-ASMGOS-ASMGene order scaffold assemblerKbpkilobasepairsMbpmegabasepairsOSORTHOSTITCHPacBioPacific BiosciencesPBPacBio-basedPHYphysical-mapping-basedRNAseqRNA sequencingQTLquantitative trait lociSYNsynteny-based.
Genome sequence analysis of 91 Salmonella Enteritidis isolates from mice caught on poultry farms in the mid 1990s.
A total of 91 draft genome sequences were used to analyze isolates of Salmonella enterica serovar Enteritidis obtained from feral mice caught on poultry farms in Pennsylvania. One objective was to find mutations disrupting open reading frames (ORFs) and another was to determine if ORF-disruptive mutations were present in isolates obtained from other sources. A total of 83 mice were obtained between 1995-1998. Isolates separated into two genomic clades and 12 subgroups due to 742 mutations. Nineteen ORF-disruptive mutations were found, and in addition, bigA had exceptional heterogeneity requiring additional evaluation. The TRAMS algorithm detected only 6 ORF disruptions. The sefD mutation was the most frequently encountered mutation and it was prevalent in human, poultry, environmental and mouse isolates. These results confirm previous assessments of the mouse as a rich source of Salmonella enterica serovar Enteritidis that varies in genotype and phenotype. Copyright © 2019. Published by Elsevier Inc.
The ruminants are one of the most successful mammalian lineages, exhibiting morphological and habitat diversity and containing several key livestock species. To better understand their evolution, we generated and analyzed de novo assembled genomes of 44 ruminant species, representing all six Ruminantia families. We used these genomes to create a time-calibrated phylogeny to resolve topological controversies, overcoming the challenges of incomplete lineage sorting. Population dynamic analyses show that population declines commenced between 100,000 and 50,000 years ago, which is concomitant with expansion in human populations. We also reveal genes and regulatory elements that possibly contribute to the evolution of the digestive system, cranial appendages, immune system, metabolism, body size, cursorial locomotion, and dentition of the ruminants. Copyright © 2019 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works.
Dynamic Changes in Metabolite Accumulation and the Transcriptome during Leaf Growth and Development in Eucommia ulmoides.
Eucommia ulmoides Oliver is widely distributed in China. This species has been used mainly in medicine due to the high concentration of chlorogenic acid (CGA), flavonoids, lignans, and other compounds in the leaves and barks. However, the categories of metabolites, dynamic changes in metabolite accumulation and overall molecular mechanisms involved in metabolite biosynthesis during E. ulmoides leaf growth and development remain unknown. Here, a total of 515 analytes, including 127 flavonoids, 46 organic acids, 44 amino acid derivatives, 9 phenolamides, and 16 vitamins, were identified from four E. ulmoides samples using ultraperformance liquid chromatography-mass spectrometry (UPLC-MS) (for widely targeted metabolites). The accumulation of most flavonoids peaked in growing leaves, followed by old leaves. UPLC-MS analysis indicated that CGA accumulation increased steadily to a high concentration during leaf growth and development, and rutin showed a high accumulation level in leaf buds and growing leaves. Based on single-molecule long-read sequencing technology, 69,020 transcripts and 2880 novel loci were identified in E. ulmoides. Expression analysis indicated that isoforms in the flavonoid biosynthetic pathway and flavonoid metabolic pathway were highly expressed in growing leaves and old leaves. Co-expression network analysis suggested a potential direct link between the flavonoid and phenylpropanoid biosynthetic pathways via the regulation of transcription factors, including MYB (v-myb avian myeloblastosis viral oncogene homolog) and bHLH (basic/helix-loop-helix). Our study predicts dynamic metabolic models during leaf growth and development and will support further molecular biological studies of metabolite biosynthesis in E. ulmoides. In addition, our results significantly improve the annotation of the E. ulmoides genome.