Fishing for Human-Specific Isoforms
Wednesday, November 7, 2018
When scientists want to investigate human-specific evolution, the best place to start is often with a comparison to our closest cousins, the great apes. Some recent high-quality PacBio genome assemblies have provided solid new foundations for these projects, but gene annotation has proven challenging, particularly for segmental duplications — sets of gene families duplicated in the human lineage relative to our last common ancestor with the chimpanzee. Could these photocopied gene families be involved in human-specific traits like the development of a larger frontal cortex?
Until now, technical limitations have stood in the way of answering that question. Two common methods to quantify mRNA abundance, the expression microarray and short-read RNA sequencing, are not very useful when comparing paralogs that diverged so recently. Many of the human-specific segmental duplications are more than 98% identical on the genomic level.
Additionally, what may appear like an exact copy of a gene is often not so simple. In humans, segmental duplications can copy-paste in a genomic context to keep all of the regulatory information, effectively doubling up on that gene’s dose. But segments can also copy in a manner that loosens the selective pressure on one copy, allowing mutations to accumulate and even relegating one copy to the “lost function” or pseudogene category. Duplications can even place the new copy in a different regulatory landscape or adjacent to a neighboring gene, allowing natural gene fusion events to occur.
While a handful of human-specific duplicate genes have seen careful mRNA characterization to distinguish the expressed paralogs, the fate of many of these genes remains unknown. Since automated annotations cannot be relied upon in these highly identical regions, a recent study published in Genome Research by Dougherty and Underwood et al. took on the technical hurdles of characterizing mRNAs with isoform-level resolution for the human-specific duplicate genes.
Those hurdles were overcome largely with the PacBio Iso-Seq method, a long-read sequencing method that reads full-length isoforms. RNA from adult and developing human brain tissue was used as starting material for a modified Iso-Seq method that incorporates barcodes at both ends of the cDNA molecules. The brain cDNAs of interest were enriched using hybridization-capture techniques with probes designed against the exons of duplicate genes. This meant that isoform information for each locus could be effectively purified in cDNA form prior to sequencing.
Eight of the 19 gene families showed a nearly identical photocopy of the original gene, while the others showed patterns of gene truncation or fusion to a neighboring gene. Most of these latter cases represent new gene innovations that appear to be present only in humans.
One interesting case highlighted by the study is CD8B and its paralog, CD8B2. While the CD8B2 paralog used to be considered a pseudogene, the new isoform data indicate that the protein open reading frame is intact, with just a few amino acid changes relative to CD8B.
With better annotations in hand, the researchers went back and queried a large RNA-seq data set called GTEx to see which tissues might express these newly discovered duplicate gene isoforms. Surprisingly, most of the reads that were uniquely assignable to the CD8B2 paralog were found in brain tissue, not, like CD8B, in the blood. The scientists deduced that the segmental duplication event that created CD8B2 did not bring along the regulatory information from CD8B that drives its expression in the blood; instead, it landed in a spot with mild transcriptional activity in the brain, resulting in a complete ORF encoding mRNA for CD8B2 that is expressed in the cortex.
With this modified Iso-Seq method, scientists who know just a little about a gene can still find out a great deal about its expressed isoforms. Along with other recent capture methods, this should be broadly applicable to those interested in studying extremely close paralogs or haplotype-specific isoforms that are difficult to distinguish using short read sequences alone.