The complex immune regions of the genome, including MHC and KIR, contain large copy number variants (CNVs), a high density of genes, hyper-polymorphic gene alleles, and conserved extended haplotypes (CEH) with enormous linkage disequilibrium (LDs). This level of complexity and inherent biases of short-read sequencing make it challenging for extracting immune region haplotype information from reference-reliant, shotgun sequencing and GWAS methods. As NGS based genome and exome sequencing and SNP arrays have become a routine for population studies, numerous efforts are being made for developing software to extract and or impute the immune gene information from these datasets. Despite these efforts, the fine mapping of causal variants of immune genes for their well-documented association with cancer, drug-induced hypersensitivity and immune-related diseases, has been slower than expected. This has in many ways limited our understanding of the mechanisms leading to immune disease. In the present work, we demonstrate the advantages of long reads delivered by SMRT Sequencing for assembling complete haplotypes of MHC and KIR gene clusters, as well as calling correct genotypes of genes comprised within them. All the genotype information is detected at allele- level with full phasing information across SNP-poor regions. Genotypes were called correctly from targeted gene amplicons, haplotypes, as well as from a completely assembled 5 Mb contig of the MHC region from a de novo assembly of whole genome shotgun data. De novo analysis pipeline used in all these approaches allowed for reference-free analysis without imputation, a key for interrogation without prior knowledge about ethnic backgrounds. These methods are thus easily adoptable for previously uncharacterized human or non-human species.
De novo PacBio long-read assembled avian genomes correct and add to genes important in neuroscience and conservation research
To test the impact of high-quality genome assemblies on biological research, we applied PacBio long-read sequencing in conjunction with the new, diploid-aware FALCON-Unzip assembler to a number of bird species. These included: the zebra finch, for which a consortium-generated, Sanger-based reference exists, to determine how the FALCON-Unzip assembly would compare to the current best references available; Anna’s hummingbird genome, which had been assembled with short-read sequencing methods as part of the Avian Phylogenomics phase I initiative; and two critically endangered bird species (kakapo and ‘alala) of high importance for conservations efforts, whose genomes had not previously been sequenced and assembled.
From RNA to full-length transcripts: The PacBio Iso-Seq method for transcriptome analysis and genome annotation
A single gene may encode a surprising number of proteins, each with a distinct biological function. This is especially true in complex eukaryotes. Short- read RNA sequencing (RNA-seq) works by physically shearing transcript isoforms into smaller pieces and bioinformatically reassembling them, leaving opportunity for misassembly or incomplete capture of the full diversity of isoforms from genes of interest. The PacBio Isoform Sequencing (Iso-Seq™) method employs long reads to sequence transcript isoforms from the 5’ end to their poly-A tails, eliminating the need for transcript reconstruction and inference. These long reads result in complete, unambiguous information about alternatively spliced exons, transcriptional start sites, and poly- adenylation sites. This allows for the characterization of the full complement of isoforms within targeted genes, or across an entire transcriptome. Here we present improved genome annotations for two avian models of vocal learning, Anna’s hummingbird (Calypte anna) and zebra finch (Taeniopygia guttata), using the Iso-Seq method. We present graphical user interface and command line analysis workflows for the data sets. From brain total RNA, we characterize more than 15,000 isoforms in each species, 9% and 5% of which were previously unannotated in hummingbird and zebra finch, respectively. We highlight one example where capturing full-length transcripts identifies additional exons and UTRs.
Incomplete annotation of genomes represents a major impediment to understanding biological processes, functional differences between species, and evolutionary mechanisms. Often, genes that are large, embedded within duplicated genomic regions, or associated with repeats are difficult to study by short-read expression profiling and assembly. In addition, most genes in eukaryotic organisms produce alternatively spliced isoforms, broadening the diversity of proteins encoded by the genome, which are difficult to resolve with short-read methods. Short-read RNA sequencing (RNA-seq) works by physically shearing transcript isoforms into smaller pieces and bioinformatically reassembling them, leaving opportunity for misassembly or incomplete capture of the full diversity of isoforms from genes of interest. In contrast, Single Molecule, Real-Time (SMRT) Sequencing directly sequences full-length transcripts without the need for assembly and imputation. Here we apply the Iso-Seq method (long-read RNA sequencing) to detect full-length isoforms and the new IsoPhase algorithm to retrieve allele-specific isoform information for two avian models of vocal learning, Anna’s hummingbird (Calypte anna) and zebra finch (Taeniopygia guttata).
PAG PacBio Workshop: Comparative analyses of next generation technologies for generating chromosome-level reference genome assemblies
At PAG 2017, Rockefeller University’s Erich Jarvis offered an in-depth comparison of methods for generating highly contiguous genome assemblies, using hummingbird as the basis to evaluate a number of sequencing…