This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
When scientists want to investigate human-specific evolution, the best place to start is often with a comparison to our closest cousins, the great apes. Some recent high-quality PacBio genome assemblies have provided solid new foundations for these projects, but gene annotation has proven challenging, particularly for segmental duplications — sets of gene families duplicated in the human lineage relative to our last common ancestor with the chimpanzee. Could these photocopied gene families be involved in human-specific traits like the development of a larger frontal cortex?
Until now, technical limitations have stood in the way of answering that question. Two common methods to quantify mRNA abundance, the expression microarray and short-read RNA sequencing, are not very useful when comparing paralogs that diverged so recently. Many of the human-specific segmental duplications are more than 98% identical on the genomic level.
Additionally, what may appear like an exact copy of a gene is often not so simple. In humans, segmental duplications can copy-paste in a genomic context to keep all of the regulatory information, effectively doubling up on that gene’s dose. But segments can also copy in a manner that loosens the selective pressure on one copy, allowing mutations to accumulate and even relegating one copy to the “lost function” or pseudogene category. Duplications can even place the new copy in a different regulatory landscape or adjacent to a neighboring gene, allowing natural gene fusion events to occur.
While a handful of human-specific duplicate genes have seen careful mRNA characterization to distinguish the expressed paralogs, the fate of many of these genes remains unknown. Since automated annotations cannot be relied upon in these highly identical regions, a recent study published in Genome Research by Dougherty and Underwood et al. took on the technical hurdles of characterizing mRNAs with isoform-level resolution for the human-specific duplicate genes.
Those hurdles were overcome largely with the PacBio Iso-Seq method, a long-read sequencing method that reads full-length isoforms. RNA from adult and developing human brain tissue was used as starting material for a modified Iso-Seq method that incorporates barcodes at both ends of the cDNA molecules. The brain cDNAs of interest were enriched using hybridization-capture techniques with probes designed against the exons of duplicate genes. This meant that isoform information for each locus could be effectively purified in cDNA form prior to sequencing.
Eight of the 19 gene families showed a nearly identical photocopy of the original gene, while the others showed patterns of gene truncation or fusion to a neighboring gene. Most of these latter cases represent new gene innovations that appear to be present only in humans.
One interesting case highlighted by the study is CD8B and its paralog, CD8B2. While the CD8B2 paralog used to be considered a pseudogene, the new isoform data indicate that the protein open reading frame is intact, with just a few amino acid changes relative to CD8B.
With better annotations in hand, the researchers went back and queried a large RNA-seq data set called GTEx to see which tissues might express these newly discovered duplicate gene isoforms. Surprisingly, most of the reads that were uniquely assignable to the CD8B2 paralog were found in brain tissue, not, like CD8B, in the blood. The scientists deduced that the segmental duplication event that created CD8B2 did not bring along the regulatory information from CD8B that drives its expression in the blood; instead, it landed in a spot with mild transcriptional activity in the brain, resulting in a complete ORF encoding mRNA for CD8B2 that is expressed in the cortex.
With this modified Iso-Seq method, scientists who know just a little about a gene can still find out a great deal about its expressed isoforms. Along with other recent capture methods, this should be broadly applicable to those interested in studying extremely close paralogs or haplotype-specific isoforms that are difficult to distinguish using short read sequences alone.
We’re pleased to announce the winner of the 2018 Microbial Genomics SMRT Grant. Mark Webber, Research Leader at Quadram Institute Bioscience in the UK, will get free SMRT Sequencing and analysis from our certified service provider, the Genomics Resource Center at the University of Maryland. His goal is to further a project designed to understand how bacteria on the skin of premature babies in neonatal intensive care units acquire resistance to the antiseptics used to prevent infections. We spoke with Mark to learn more about his work and how the SMRT Grant will make a difference.
Q: What’s your research focus?
A: We’re interested in how bacteria deal with stress — how do bugs become resistant to drugs? We’re particularly interested in Staphylococci and how they deal with the antiseptics that we use in hospitals. We’ve looked at patients in intensive care in the UK, examining isolates over time to see how susceptible they were to two antiseptics in a large teaching hospital that had changed its antiseptic use. What we could see was that as you used more and more chlorhexidine, the bugs were more and more tolerant. When they introduced another antiseptic; octenidine, a population quickly emerged that was less tolerant of octenidine.
Q: What inspired the proposal you submitted for this SMRT Grant?
A: We have been studying premature babies in neonatal intensive care. Every year about 450,000 premature babies are born in the US and 60,000 are born in the UK. About 15,000 of these| will suffer from late-onset sepsis. mainly caused by Staphylococci infections. These babies are often very immune-suppressed and the risk of death from sepsis is quite high. The babies almost all have peripheral catheters to allow feeding and drugs to be administered but these are a potential route for bugs to get in to their blood. Therefore, to prevent infection from bugs living on the skin there’s a lot of antiseptic use. We wanted to see how antiseptic tolerance might be developing amongst the bugs living on the skin of these premature babies.
Q: What have you learned so far about this issue?
A: At hospitals in the UK and with our collaborators in Germany, we collected isolates each week from hundreds of babies over a three-month period. We now have a total of 1,300 isolates of Staphylococci which we have tested for their susceptibility to chlorhexidine, the antiseptic used in the UK, and octenidine, the antiseptic used in Germany. We have seen that babies in intensive care appear to pick up Staphylococci with high tolerance to antiseptics very quickly and our UK population is particularly robust in dealing with chlorhexidine which we use.
Q: How will you use the sequencing capacity from your SMRT Grant?
A: After studying all 1,300 isolates, we want to understand how related these bacteria are and take about 20 representatives of the major branches of the Staphylococcal family tree and get really high-quality, full genome sequences. With the PacBio long reads, we will be able to see whether there are common mobile elements that explain the acquisition of tolerance. We can also look for duplications, rearrangements, and methylation patterns which may be responsible for antiseptic adaptation. PacBio sequencing will give us a higher quality picture of what’s going on than other technologies, and this SMRT Grant will let us capture most of the major branches of the phylogeny that we’re hoping to see.
Q: What do you hope to learn when you’ve had a chance to analyze these new reference genomes?
A: Doing the bioinformatics analysis will not take us that long — it should be a matter of a few days to get a pretty decent picture of what differs between antiseptic tolerant and sensitive strains. We hope to understand the mechanisms of resistance that we can then compare to isolates collected in other parts of the world. That will help us determine whether antiseptic use is likely to fail, whether different antiseptics are more or less likely to select for tolerance and whether antiseptic tolerance is linked to antibiotic resistance. Together this information should help us understand whether we need to change the way we use antiseptics to keep babies safe.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for another chance to win SMRT Sequencing. Thank you to our co-sponsor, the University of Maryland’s Genomics Resource Center, for supporting the Microbial Genomics SMRT Grant Program!
It’s one of the most ambitious sequencing projects ever attempted — the assembly of all 1.5 million known species of animals, plants, protozoa and fungi on Earth — and SMRT Sequencing will play a major part.
A greater understanding of Earth’s biodiversity and the responsible stewarding of its resources are among the most crucial scientific and social challenges of the new millennium, and overcoming these challenges requires new scientific knowledge of evolution and interactions among millions of the planet’s organisms, said Earth BioGenome Project Chair Professor Harris Lewin of the University of California, Davis.
“The Earth BioGenome Project can be viewed as the infrastructure for new biology,” Lewin added. “Having the roadmap, the blueprints for all living species of eukaryotes will be a tremendous resource for new discoveries, understanding the rules of life, how evolution works, new approaches for the conservation of rare and endangered species, and provide new resources for people in agricultural and medical fields.”
Headed by the Wellcome Sanger Institute, the Darwin Tree of Life Project will explore the genetic code of 66,000 species in the UK. The Natural History Museum in London, Royal Botanic Gardens, Kew, Earlham Institute, Edinburgh Genomics, University of Edinburgh, EMBL-EBI and others will collaborate in sample collection, DNA sequencing, assembling and annotating genomes, and storing the data.
SMRT Sequencing was previously used to decode the genomes of 25 UK species for the first time in a project to mark the 25th anniversary of the Wellcome Sanger Institute. The insights gained from the 25 Genomes Project form the basis for scaling up to sequence the genomes of 66,000 species.
“We are honored to be an integral part of the Darwin Tree of Life Project as it deploys the power of our sequencing technology on a much broader scale,” said our CSO, Jonas Korlach. “With the recent and ongoing improvements in our technology, we are well positioned to support the needs for scaling the sequencing and assembling of the genomes for the large number of species targeted by this project as well as the Earth BioGenome Project.”
Genome-wide association studies (GWAS) may be powerful tools for the identification of genes underlying complex traits, but what if you have an incredibly complex, uncharacterized genome, with no sequenced progenitor or related species?
A team of scientists from the Chinese Academy of Agricultural Sciences in Changsha, China came up with a solution: a transcriptome-referenced association study (TRAS), powered by our Iso-Seq method.
The approach, outlined in this DNA Research paper, utilized a transcriptome generated by SMRT Sequencing as a reference to score population variation at both transcript sequence and expression levels. The team, led by Touming Liu and first author Xiaojun Chen, used the approach to study the shape of garlic cloves.
Cultivated globally for more than 5,000 years as a vegetable, spice, and medicinal plant, garlic (Allium sativum L.) is a diploid species with a giant genome: ~15.9 Gb, 32 times larger than rice. The most widely consumed part of the plant, the bulb, consists of several cloves that are actually abnormal axillary buds rarely found among vascular plants. The shape of these cloves are economically important quantitative traits, but their genetic mechanisms are poorly understood.
Plant quantitative traits are typically controlled by several major and minor effect genes that constitute complex regulatory networks, and characterization of these traits is time-consuming and labor-intensive when using traditional mapping methods that involve the identification and cloning of dozens of trait-control genes. Previous studies that conducted de novo assembly of the garlic transcriptome were able to produce more than 120,000 transcripts, but many were considered incomplete, with an average length of less than 600 bp, of which only 35–42% were functionally annotated.
So, Liu and colleagues collected bulb samples from 92 landraces in China and 10 from other countries, and selected one candidate from China for Iso-Seq long-read RNA transcript sequencing. From this, they created a high-quality reference transcriptome that consisted of 36,321 transcripts of lengths ranging from 120 to 4,803 bp, accounting for 54.48 million bases in total.
The Iso-Seq method “significantly improved the transcriptome quality—the mean length of the transcripts was 1,500 bp; more than 70% of the transcripts had a complete 3′ end; and only less than 1% of the transcripts remained functionally unannotated,” the authors wrote.
To characterize the genotypes of the rest of the 102 landraces in both sequence and expression, they sequenced the transcriptomes of developing bulbs in the population. The read sequences were aligned to the reference transcriptome, and the variation in both sequence (SNPs) and GE of transcripts were scored.
The team ultimately identified 22 candidate transcripts, most of which showed extensive interactions. Eight transcripts were long non-coding RNAs (lncRNAs), and the others encoded proteins involved mainly in carbohydrate metabolism and protein degradation. These findings can provide a basis for improving clove shape traits in garlic breeding, as well as validate the TRAS approach, the authors said.
“Our results demonstrate that TRAS is a useful approach for association studies, and its independence from a reference genome will extend the applicability of association studies to a broad range of species,” they wrote. TRAS also offered additional advantages in comparison with the GWAS approach, the team noted.
It can directly detect candidate transcripts for a trait by integrating sequence data with expression data, in contrast to GWAS, which identifies only a genome region in which markers are in linkage disequilibrium for the loci controlling the trait. Also, unlike GWAS, after identifying a genome region based on the sequence variation, TRAS uses the information on transcript expression in the identified region to determine whether or not the corresponding transcript is associated with a given trait. And TRAS can detect potential interaction of transcripts by eQTL analysis, and the potential relationship among the transcripts is helpful for further validation of these interactions.
It’s a murder mystery of massive proportion, albeit on a miniature scale: Male-killing among several species of insects, caused by selfish symbiotic bacteria.
Swiss researchers believe they have finally solved a question that has stumped scientists for decades, with potential implications for pest and infection control.
In a recent Nature publication, Toshiyuki Harumoto and Bruno Lemaitre of the Global Health Institute at the École Polytechnique Fédérale de Lausanne (EPFL) in Lausanne, Switzerland, have reported their findings regarding a toxin in Spiroplasma poulsonii, one of several types of symbiotic bacteria that manipulate host reproduction to spread in a population by distorting host sex ratios.
A notable feature of S. poulsonii is male killing in Drosphila, whereby the sons of infected female hosts are selectively killed during development. Male killing has also independently evolved in at least six bacterial taxa, including Wolbachia, which is being investigated by the Gates Foundation as a potential tool to control the transmission of mosquito-borne viruses such as dengue, chikungunya and Zika.
Although male killing in Drosophila caused by S. poulsonii has been studied since the 1950s, its underlying mechanism was not known. Previous studies attributed the selective killing of male progeny to an unknown substance called ‘androcidin’, assumed to be secreted by the bacterium. Further identification of the toxin was hampered by a lack of practical methods to characterize it, but SMRT Sequencing and a chance discovery enabled the Swiss team to pinpoint the protein responsible.
“Our study has uncovered a bacterial protein that affects host cellular machinery in a sex-specific way, which is likely to be the long-searched-for factor responsible for S. poulsonii-induced male killing,” the authors write.
While studying S. poulsonii, the researchers unexpectedly identified a mutant strain that showed reduced male-killing ability (MSRO-SE; the partial male-killing strain), where almost half of the male progeny survived. To identify the genetic basis of this reduced male killing, they sequenced the genome of MSRO-SE and compared it with that of an androcide competent strain, MSRO-H99.
They found a candidate gene that was altered in the compromised strain — encoding a 1,065-amino-acid protein with ankyrin repeats and an OTU (ovarian tumor deubiquitinase) domain — that they named Spaid (S. poulsonii androcidin)
“Overexpression of Spaid in D. melanogaster kills males but not females, and induces massive apoptosis and neural defects, recapitulating the pathology observed in S. poulsonii-infected male embryos,” the authors write. Their data suggests that Spaid targets the dosage compensation machinery on the male X chromosome to mediate its effects, the paper states.
The identification of Spaid in S. poulsonii could also boost the study of androcidins in other symbiont bacteria, including Wolbachia, which has been notoriously difficult to sequence. Wolbachia contains ankyrin-repeat proteins like Spaid, but in much higher frequency than S. poulsonii. Wolbachia genomes encode more than 20 ankyrin-repeat proteins, whereas Spaid is the sole ankyrin-repeat protein in the S. poulsonii genome. While fully investigating all the ankyrin-repeat proteins in Wolbachia would be an ambitious project, the findings from S. poulsonii suggest it might be a fruitful place to search for androcidin activity.
“A thorough understanding of the reproductive manipulations induced by symbionts would not only provide novel insights into fundamental aspects of development, sex determination, and their evolution in insects, but could also provide clues to control insect populations,” the authors conclude.
Today we’re pleased to announce the release of Sequel System 6.0, including new software, consumable reagents and a new SMRT Cell.
Combined, the enhancements in the release improve the performance and affordability of Single Molecule, Real-Time (SMRT) Sequencing by providing individual long reads with greater than 99% accuracy, increasing the throughput up to 50 Gb per SMRT Cell, and delivering average read lengths up to 100,000 base pairs, depending on insert size. These improvements are expected to greatly enhance the accuracy and cost effectiveness of applications such as whole genome sequencing, human structural variant detection, targeted sequencing and RNA transcript isoform sequencing (Iso-Seq method).
- For amplicon and RNA sequencing projects, customers can generate up to 500,000 single-molecule reads with high fidelity (>99% single-molecule accuracy); and
- For whole genome sequencing projects, users can achieve up to 20 Gb per SMRT Cell with average read lengths up to 30 kb and high consensus accuracy (>99.999%).
Since SMRT Sequencing technology was first commercialized in 2011, we have increased the throughput per SMRT Cell by 2,000-fold. These ongoing throughput increases provide a significant cost savings for sequencing projects in the human, plant and animal markets, which allows researchers the opportunity to increase the size and scope of their projects.
“These enhancements represent the most significant improvement in terms of read length, throughput and accuracy that we have ever achieved in a single product release,” said Chief Executive Officer Michael Hunkapiller, Ph.D. “Customers can now enjoy unprecedented capabilities with a new paradigm in long-read sequencing — highly accurate single-molecule reads. Further, many users no longer need to trade off between read length and accuracy, because it is now possible to achieve Sanger-quality reads as long as 15 kb.”
Jonas Korlach, Ph.D., Chief Scientific Officer, added: “Our latest Sequel System improvements open new opportunities for comprehensively mapping all human genetic variation — from SNVs to indels to SVs — in a single assay and pave the way for a new era of population-scale, high-quality human genome studies.”
We’re proud to announce the release of the most contiguous diploid human genome assembly of a single individual to date, representing the nearly complete DNA sequence from all 46 chromosomes inherited from both parents. The sample used was derived from a Puerto Rican female who has been included in population genetics studies such as the 1000 Genomes Project. The phased diploid assembly will give unprecedented views of population-specific variation through the long-range resolution of maternal and paternal haplotypes.
This work is part of a larger effort in the field of personalized medicine and human genomics to add ethnic diversity to the available human reference genomes. More than 40 global initiatives are currently underway to apply de novo assembly methods to individuals representing multiple ethnic populations. Notable among these initiatives is the McDonnell Genome Institute at Washington University, which has contributed 11 high-quality PacBio genomes for individuals representing populations from Africa, Asia, Europe, and the Americas.
Our approach to the Puerto Rican genome relied upon the current best practices for de novo assembly while also pushing read lengths ever longer and adding new methods and data types to better tackle the problem of diploid genome assembly.
The Puerto Rican sample was sequenced on the Sequel System with 2.1 chemistry and v5.1 software using a large insert library aggressively size selected to 35 kb. The resulting contig assembly totaling 2.89 Gb has the highest contiguity to date, with half of the genome contained in gapless contigs longer than 27 Mb. These results are even better than the consistently stellar assemblies MGI has been producing, which typically have contig N50s of 20-25 Mb.
Like the MGI genomes, the new Puerto Rican genome was assembled using FALCON, but with a newer version of FALCON-Unzip that includes algorithmic improvements to phasing and accuracy. Nearly 85% of the genome was resolved as maternal and paternal haplotypes, with more than 600 Mb of sequence in haplotype blocks longer than 1 Mb. An analysis of variants within phase blocks indicates high accuracy with 95% of SNPs showing concordant inheritance from a single parent.
In addition to the improvements in PacBio’s FALCON-Unzip assembler, the Puerto Rican assembly includes the novel use of Hi-C data to extend phasing between haplotype blocks. In collaboration with Phase Genomics, PacBio developed a new method for enhanced phasing that does not rely on family trio data. The new method, called FALCON-Phase, maps ultra-long range Hi-C reads to the FALCON-Unzip contigs to extend phasing to the contig scale. The Hi-C data was also used to scaffold the phased contigs before performing another round of phasing on the scaffolds.
The resulting assembly consists of 46 chromosome-scale scaffolds, representing the maternal and paternal chromosome set for the Puerto Rican individual. Each set of 23 scaffolds contain only 511 gaps and are a total of 2.83 Gb long. The remainder of each haploid genome is contained in 260 scaffolds of 63 Mb in length.
Genome: https://www.ncbi.nlm.nih.gov/genome/?term=RBJD00000000 (currently not live)
King scallops are more genetically diverse than we are? The Roesel’s bush cricket’s genome is four times the size of ours? These are just some of the findings made by scientists at the Wellcome Sanger Institute after undertaking a project to sequence the DNA of 25 wildlife species important to the United Kingdom.
Although many of the species they selected are native to the British Isles, the implications of the research are expected to extend around the globe. The project’s first data release, the Golden Eagle genome, for instance, will impact the study of eagles in North America and elsewhere, according to Sanger Institute associate director Julia Wilson.
Wilson announced the release of the remaining 24 genomes at a 25th anniversary celebration at the Institute’s Cambridgeshire campus today.
“We have learned much through this project already and this new knowledge is flowing into many areas of our large-scale science,” Wilson said. “Now that the genomes have been read, the pieces of each species puzzle need to be put back together during genome assembly before they are made available.”
Among other questions scientists will explore with the new high-quality genomes are why some brown trout migrate to the open ocean while others don’t, and why red squirrels are vulnerable to the squirrel pox virus, yet grey squirrels can carry and spread the virus without becoming ill.
The genomes—selected by scientists to include representatives of flourishing, floundering, dangerous, iconic and cryptic species, as well as five picked by the public during a nationwide vote—were decoded using SMRT Sequencing. They will now be annotated and analyzed.
“Sequencing these species for the first time didn’t come without challenges, but our scientists and staff repeatedly came up with innovative solutions to overcome them,” Wilson said.
These challenges, documented in a blog series by 25 Genomes Project coordinator Dan Mead, included everything from acquiring specimens to “exploding flatworm goop.”
“We are already discovering the surprising secrets these species hold in their genomes,” Mead said. “Similar to when the Human Genome Project first began, we don’t know where these findings could take us.”
Whereas the first human genome took 13 years and billions of dollars to complete, the Sanger Institute was able to newly sequence 25 species’ genomes in less than one year, at a fraction of the cost. The high-quality genomes will be made freely available to scientists to use in their research.
We are honored to have been part of the effort, and extend Happy Birthday wishes to our colleagues across the pond. We look forward to another 25 years of collaboration!
With its bold blue plumage, russet throat and chipper chirps, the barn swallow is beloved by many avian enthusiasts. It’s also a favorite of scientists, becoming a flagship species for conservation biology. Numerous evolutionary and ecological studies have focused on its biology, life history, sexual selection, response to climate change, and the divergence between its eight subspecies in Europe, Asia and North America.
But the full potential of such studies has been limited by gaps in genomic data. A 2016 draft genome for the American subspecies (Hirundo rustica erythrogaster), for example, was assembled from short, paired-end reads derived from a male individual, leading to continuity gaps and a lack of information for the W chromosome, as females are the heterogametic (ZW) sex in birds.
To address such limitations, an international team of researchers from the University of Milan, University of Pavia, and California State Polytechnic University used SMRT Sequencing at the Functional Genomics Center in Zurich and Bionano optical mapping to produce a new high-quality genome assembly for the European barn swallow, Hirundo rustica rustica.
The combination led to a final 1.21 Gb assembly with a scaffold N50 value of over 25.95 Mb, representing a more than 650-fold improvement in N50 with respect to the 2016 draft genome, as reported in the pre-print “SMRT long-read sequencing and Direct Label and Stain optical maps allow the generation of a high-quality genome assembly for the European barn swallow (Hirundo rustica rustica).”
The primary assembly’s contiguity metrics even meet the high standards of the Vertebrate Genome Project consortium “Platinum Genome” criteria (contig N50 in excess of 1 Mb and scaffold N50 above 10 Mb).
“Given the inception of large scale sequencing initiatives aiming to produce genome assemblies for a wide range of organisms, it is critical to identify combinations of sequencing and scaffolding approaches that allow the cost effective generation of genuinely high-quality genome assemblies,” the authors write.
“We believe that the data presented here, as well as attesting to the effectiveness of SMRT sequencing combined with DLS optical mapping for the assembly of vertebrate genomes, will provide an invaluable asset for population genetics studies in the barn swallow and for comparative genomics in birds,” they conclude.
The authors also identified several potential future projects based on the improved assembly, including: the phasing of the assembly to generate extended haplotypes, a more thorough gene annotation using RNA/Iso-Seq sequencing data, detailed comparisons with genome data from the American barn swallow, re-evaluation of data from previous population genetics studies conducted in this species, as well as characterization of the epigenetic landscape.
We look forward to additional reports, and will be keeping a bird’s eye view of work done with the genome.
Seeking sequencing for your plant and animal project? Check out opportunities available via our SMRT Grant Program.
Xiaochang Zhang, an assistant professor at the University of Chicago, is poised to get a powerful new data set to help his team understand the role of alternative splicing in brain development. His project, entitled “Uncovering mRNA splicing diversity in cerebral cortex development,” was selected as the winner of the 2018 Iso-Seq SMRT Grant Program. Sequencing for this project will be carried out by our Certified Service Provider RTL Genomics. We caught up with Xiaochang to learn more about his research and how SMRT Sequencing data will make a difference.
Q: What’s your research focus?
A: We are interested in the impact of alternative RNA splicing in neocortex development and disorders, and we are excited about the opportunity to use long-read sequencing to further address this question. Enormous neuronal cell diversity has been described, and it is speculated that the secret of neuronal cell diversity is partly hidden in the heterogeneity of neural progenitor cells. Post-transcriptional mRNA metabolism such as alternative splicing presents another layer of gene regulation and dramatically increases protein diversity. Indeed, work from others and us showed that alternative pre-mRNA splicing is wide spread in developing mouse and human brains, and tight regulation of cell type-specific RNA splicing is required for human brain development. Characterizing mRNA isoforms with long-read sequencing will give us a unique chance to understand how the brain is built – we’re really excited about this.
Q: How have you pursued this prior to long-read sequencing?
A: We did bulk RNA sequencing with mouse brain cells and found hundreds of alternatively spliced exons between neural progenitor cells and post-mitotic neurons. We further analyzed a single-cell data set of fetal human brain cells and identified consistent RNA splicing changes between cell types. However, it is hard to obtain a full picture of alternative RNA splicing with short-read sequencing for genes that have multiple alternatively spliced exons. Long-read sequencing will be superior to uncover complex splicing isoforms.
Q: What do you hope to learn with the SMRT Sequencing data?
A: Single Molecule, Real-Time (SMRT) Sequencing can sequence single molecules of the longest human messenger RNAs. We are excited to directly detect the actual full-length mRNA isoforms among different brain cell types with SMRT Sequencing. We will compare long-read sequencing results with our current datasets, and try to uncover complex splicing isoforms that are previously unobservable. With this SMRT Grant we hope to get a better view of alternative RNA splicing in brain development.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win SMRT Sequencing. Also, thank you to our co-sponsor RTL Genomics for supporting the Iso-Seq SMRT Grant Program!
In addition to the most common applications, like whole genome sequencing for de novo assembly, there are several other features you can utilize to advance your science or incorporate to offer your customers a broad range of the best PacBio services. Here’s a sampling of the most recent updates and releases.
Iso-Seq Analysis for Genome Annotation or Targeted Isoform Discovery
The isoform sequence (Iso-Seq) application generates full-length cDNA sequences – from the 5’ end of transcripts to the poly-A tail – eliminating the need for transcriptome reconstruction using isoform-inference algorithms. It’s even easier to help your customers annotate their genomes or perform isoform discovery with full-length transcripts now that diffusion loading is supported for Iso-Seq projects. (For more information on switching to diffusion loading for Iso-Seq analysis projects, please contact your local FAS.)
Multiplexing for Bacterial Whole Genome Assembly
A new solution for multiplexed bacterial whole genome sequencing on the Sequel System is now available, enabling pooling of as many as 16 samples that total up to 30 Mb of genomes. With two new barcoded adaptor kits, a run setup calculator, and data analysis workflow, it’s now fast and easy for your customers to generate multiple high-quality bacterial genomes in a single Sequel System experiment.
Structural Variant Detection with Low-Fold Coverage Sequencing
The PacBio SV application provides high-sensitivity detection of structural variants in human genomes with modest coverage and a low false discovery rate. These larger variant types are typically missed with short-read methods but are known to cause disease. A simple library prep, using a modest amount (~3 µg) of unamplified genomic DNA from a blood sample, is effective for gene discovery in rare and Mendelian disease as well as broader population-scale SV characterization.
When creating a global genomic ark of creatures great and small, scientists are turning to the comprehensive coverage and quality of PacBio sequencing.
The Vertebrate Genomes Project (VGP), an international consortium of more than 150 scientists from 50 academic, industry and government institutions in 12 countries, recently released the first 15 of an anticipated 66,000 high-quality reference genomes representing all vertebrate species on Earth.
The VGP consortium spent three years selecting technologies and workflows to produce higher quality, “platinum-level” genomes, and SMRT Sequencing was selected to generate the initial assemblies.
“Until recently, sequencing the complete genome of a single animal required millions of dollars and years of effort. New sequencing technologies have dramatically reduced the cost and made it possible to reconstruct near-perfect genomes for the first time,” said VGP member Adam Phillippy of the Genome Informatics Section at the National Human Genome Research Institute.
From the duck-billed platypus to the limbless serpentine amphibian Two-lined caecilian, the first data release represents species from all five vertebrate classes – mammals, birds, reptiles, amphibians, and fishes.
The first phase of the project will continue with the sequencing of at least one species representing each of the 260 orders of living vertebrates. Subsequent sequencing will cover all 1,045 families, then 9,478 genera, and ultimately all of the approximately 66,000 species of vertebrates.
“The last 20 years have proven the value of openly available high-quality reference genome sequences to scientific research, but until now these have mostly been available just for humans and other key organisms,” said Richard Durbin, of the University of Cambridge and the Wellcome Sanger Institute. “We are entering an era in which we will obtain reference genome sequences for all species across the Tree of Life.”
VGP is one of many large-scale international projects to sequence the DNA of thousands of plant, animal, fungal and bacterial species that have chosen PacBio Single Molecule, Real-Time (SMRT) Sequencing to assemble some of the most complete genomes to date. These comprehensive catalogs of genetic code provide valuable resources to researchers in their quest to understand the biology, physiology, development and evolution of a multitude of living organisms, and will aide in their conservation.
Another is the Bat1K initiative, and effort by Sonja Vernes of the Max Planck Institute and others to catalog the genetic diversity in 1,300 types of bats.
“The long-read sequencing technology from PacBio is allowing us to produce bat genomes of unprecedented quality and resolution as part of the Bat1K project,” said Vernes. “This is going to be a big step forward for understanding how the genes and also the non-coding DNA in these genomes influence the weird and wonderful features of bats.”
Other projects include:
- The Bird 10,000 Genomes (B10K) Project, which is aiming to generate representative draft genome sequences from all extant bird species; many of its members became founders of the The Genome 10,000 consortium (G10K), which evolved into the Vertebrate Genome Project;
- Efforts to sequence nationally significant species, such as the Sanger 25 Project by the Wellcome Trust Sanger Institute and the Canada 150 Sequencing Initiative (CanSeq150) by Canada’s Genomics Enterprise.
- The NCTC 3000 initiative by the UK’s National Collection of Type Cultures to sequence the genomes of 3,000 strains of bacteria;
- Whole Genome Assembly of the Maize NAM Founders, a multi-institutional effort to create a 26-line pangenome maize reference collection, one of many initiative to sequence important agricultural crops to discover and utilize novel genes, traits and/or genomic regions for crop improvement and basic research;
- The Pan-Genome Analysis of Sorghum project at the Donald Danforth Plant Science Center, which includes 15 sorghum lines covering the diversity of this important bioenergy, food, and feed crop. The project is supported through the Community Science Program (CSP) of the DOE Joint Genome Institute with PacBio sequencing at HudsonAlpha Institute for Biotechnology.
- The Open Green Genomes Initiative, also supported by DOE Joint Genome Institute, which will generate high-quality genome assemblies and annotations for 35 species representing all major evolutionary lineages in the land plant tree of life.
- The Functional Annotation of Animal Genomes Project (FAANG), which is aiming to produce comprehensive maps of functional elements in the genomes of domesticated animal species;
- Marine and aquaculture efforts such as The Aqua-Hundred Genome Project;
- Insect initiatives, including the i5k Project to sequence 5,000 arthropod genomes and The Global Ant Genomics Alliance (GAGA) to sequence 200 ant species.
If you’re interested in supporting this important effort, the group is soliciting donations for ongoing project support.
Many people who run a sequencing core lab would prefer to focus on science instead of business, but all core lab managers know that it’s imperative to keep a steady stream of clients and projects filling the pipeline. In a recent blog post we offered 5 ways to attract more customers to your sequencing services. Now let’s take a look at how you can incorporate new services and upgrades into your facility.
Keeping up with the latest and greatest advancements in sequencing technology isn’t just about the sequencing instruments. Companies like PacBio regularly release instrument improvements, new chemistries, software features, and new applications for their sequencing platforms. Making sure that you are running the latest chemistries and supporting the newest features will help your lab continue to generate the best results for your customers. Here are several ways you can keep up to date with all things PacBio.
- Keep in close contact with your local FAS
The local Field Application Scientist (FAS) who trained your team and whom you call with questions is the same person who can give you real-time information on the newest releases and applications. He or she can give you the in-depth training to get started offering a new service or upgrading your current software.
- Join the Certified Service Provider Program
As a PacBio Certified Service Provider (CSP), you can take advantage of benefits that other providers cannot. Benefits include preferential consideration for early access to, and sometimes even beta testing of, new features and applications. In addition, you’ll have quarterly check-ins with the PacBio team for the latest updates and information about the products we offer. Find out more about joining our CSP Program.
- Connect with us digitally
We try to deliver a steady, but not overwhelming, stream of the latest information about the uses for SMRT Sequencing across multiple channels. From our market area newsletters (Plant and Animal, Human Biomedical, Microbial) to the snappy one-liners on Twitter, there’s a mix of communication out there perfect for keeping you informed. Subscribe to our blog, follow us on Twitter, Medium, and LinkedIn, and sign up for updates to make sure you’re getting all the latest news delivered to you as it happens.
- Attend PacBio events
Throughout the year we host a series of User Group Meetings all over the world with the goal of bringing together our customers, end users of SMRT Sequencing, and anyone else interested in learning more. These multi-day events consist of updates from PacBio staff as well as cool biological stories from many different labs covering a variety of applications. Because of the smaller nature of these events compared to large industry conferences, a lot of individual information exchange occurs and collaborations are formed. Check out our upcoming events – we hope to see you at the next one!
In an exciting paper that made the cover of Genome Research, scientists from Cold Spring Harbor Laboratory and collaborating institutions report the genome sequence and transcriptome of a commonly used breast cancer cell line. They determined that the cell line harbors far more structural variants than previously thought with results that call into question cancer genome analysis based solely on short-read sequencing data.
In “Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line,” lead author Maria Nattestad, senior author Michael Schatz, and collaborators describe an in-depth investigation of SK-BR-3, an important model for HER2-positive breast cancer. “SK-BR-3 is known to be highly rearranged, although much of the variation is in complex and repetitive regions that may be underreported,” they write, explaining their choice of PacBio long-read sequencing to conduct a new genomic and transcriptomic analysis of the cell line.
Investigating genomic instability is essential to understanding cancer but attempts to do so using short-read sequencing have seen limited success due to challenges in detecting structural variation. Even large-scale cancer projects “have performed somewhat limited analysis of structural variations, as both the false positive rate and the false negative rate for detecting structural variants from short reads are reported to be 50% or more,” Nattestad, et al. report. “Furthermore, the variations that are detected are rarely close enough to determine whether they occur in phase on the same molecule, limiting the analysis of how the overall chromosome structure has been altered.”
With the goal of creating a comprehensive map of structural variations in cancer, scientists sequenced the SK-BR-3 genome using SMRT Sequencing. To enable comparison between sequencing technologies, they also used a short-read technology. The team found that PacBio data was more mappable: more than 90% of PacBio reads align with a mapping quality of 60, while just 69% of short reads did the same. “We also observed a smaller GC bias in the PacBio sequencing compared to the Illumina sequence data,” they note, “which enables more robust copy number analysis and generally better variant detection overall.”
An analysis of variants showed that long-read sequencing detected more than 17,000 structural variants of at least 50 bp in length, while the short-read data yielded only about 4,100, a difference that could largely be attributed to the lack of insertions called in the latter data set. This closely mirrors the results of researchers working on population-specific reference genomes.
The scientists coupled their genomic variant discovery with the Iso-Seq method to capture full-length transcripts from SK-BR-3, noting that short-read data often cannot span or accurately reconstruct entire isoforms. “Long reads overcome such limitations by spanning multiple exon junctions and often covering complete transcripts,” they explain. Within the transcriptome analysis, the team closely examined several gene fusions. Some of the gene fusions were found to be the product of two or three rearrangement events occurring in sequence. For example, “CYTH1-EIF3H had been discovered previously with RNA-seq and been validated with RT-PCR, but it was not known to be a “2-hop” gene fusion (taking place through a series of two variants) until now,” the scientists report. “This fusion was also captured in full by several individual SMRT-seq reads that contain both variants and have alignments in both genes.” The authors also report finding direct evidence that a gene fusion previously thought to be the result of a 2-hop path is actually a 3-hop fusion.
One detailed illustration of the careful analysis performed for this project involved the ERBB2 oncogene, which is also called HER2. “We discover a complex sequence of nested duplications and translocations, suggesting a punctuated progression,” the team writes. They were able to “reconstruct the progression of rearrangements resulting in the amplification of the ERBB2 oncogene, including a previously unrecognized inverted duplication spanning a large portion of the region.”
“Long-read read sequencing can expose complex variants with great certainty and context, suggesting that more multi-hop gene fusions, inverted duplications, and complex events may be found in other cancer genomes,” the scientists conclude. “There may be many other types of complex variations present in other cancer genomes that were not found in SK-BR-3, so it is essential to continue building a catalogue of these variant types using the best available technologies.”
A map of every individual’s genome will soon be possible, but how will we know if it is correct? Benchmarks are needed in order to check the performance of sequencing, and any genomes used for such a purpose should be comprehensive and well characterized.
Enter the Genome in a Bottle Project (GIAB), a consortium of geneticists and bioinformaticians committed to the creation and sharing of high-quality reference genomes. Unlike other initiatives, such as the 1000 Genomes Project, that are seeking to sequence many representatives of different populations, GIAB is interested in sequencing just a few individuals, but deeply and with multiple technologies. Formed in 2012, the consortium has to date released data for five individuals, including an Ashkenazi Jewish family trio.
GIAB has made great progress with characterizing small variants, such as SNPs and indels. However, as project co-leader Justin Zook explained in a presentation at the Labroots Genetics and Genomics conference, much work remains to be done. Zook estimates that the current GIAB truth sets, based mostly on short-read sequencing, miss 200,000 variants in tandem repeats and homopolymers. Further, the vast majority of medium and large variants are missed: over 75% for indels 15-50 bp and 99% for structural variants >50 bp.
“The representation of these variants is poorly standardized, and that’s especially true once you get to more complex changes that occur in repetitive regions,” Zook said. “And tools to do the comparisons for structural variants are really in their infancy.”
The solution? New technologies like accurate, long reads from SMRT Sequencing, and new variant callers, especially those based on de novo assembly. Zook, a scientist at the National Institute of Standards and Technology, and the GIAB consortium are currently applying these techniques to build benchmark sets of structural variants. Using PacBio long reads, the GIAB consortium has expanded its structural variant callset from only a few hundred variants to over 20,000.
“When we’re trying to characterize the structural variation in long, repetitive regions, or in places where there are large insertions, it’s been really useful to have long-read information,” Zook said. “Long reads are also really useful for phasing variants, and it looks like they’ll be really useful for characterizing variants in difficult-to-map regions,” he added.
In addition to providing completed benchmark reference genomes, GIAB also releases datasets; a 2016 release included 12 datasets based on seven genomes, compiled by 51 authors from 14 institutions.
Zook said new public long-read datasets are coming. The data in development includes SMRT Sequencing of a Chinese trio (in collaboration with the Icahn School of Medicine), and a deeper dive into the genomes of the Ashkenazi Jewish son and mother of the originally released trio set.
Genetics is not only key to discovering and tracing new traits in an organism, but also conserving old ones — and in some cases, the species itself.
A deep understanding of genetic variation within and among species can be used to reconstruct their evolutionary history, to examine their contemporary status, and to predict the future effects of management strategies.
With this in mind, scientists at the UK’s Wellcome Sanger Institute were keen to incorporate endangered species among 25 genomes to be sequenced as part of a project to mark its 25-year anniversary, and the first assembly to be released is the golden eagle.
The first high-quality reference genome of the iconic bird, generated in partnership with the University of Edinburgh’s Royal (Dick) School of Veterinary Studies using SMRT Sequencing technology, the assembly is expected to be an excellent resource for international eagle conservation efforts.
The genetic information provided by the genomic map will further our understanding of the diversity and viability of golden eagles, bald eagles and other species worldwide, and could help in the identification of populations or individuals best suited for reintroduction projects.
Although the golden eagle is not considered threatened on a global scale, the species has experienced sharp population declines in some areas, including the United Kingdom and parts of the United States. Urbanization, agricultural development, and changes in wildfire regimes have compromised nesting and hunting grounds in southern California and in the sagebrush steppes of the inner West, for example, and there are only 508 breeding pairs of golden eagles in the UK, largely restricted to the Scottish Highlands and Islands.
The South of Scotland Golden Eagle Project is among many initiatives expected to benefit from the genomic information.
“With the golden eagle genome sequence, we will be able to compare the eagles being relocated to southern Scotland to those already in the area to ensure we are creating a genetically diverse population,” said project director Rob Ogden, Head of Conservation Genetics at the University of Edinburgh. “We will also be able to start investigating the biological effects of any genetic differences that we detect, not only within the Scottish population, but worldwide.”
Megan Judkins, an adjunct faculty member at Oklahoma State University and interim director of the Grey Snow Eagle House, a tribally run and operated rehabilitation and research facility of bald and golden eagles, said the new European golden eagle genome will provide essential information for learning about this species from a worldwide perspective, as previously sequenced golden eagle genomes were from the North American population.
“Having this new tool could help us reveal more about their genetic diversity and provide insight into the subspecies that are thought to exist, but are not substantiated with genomic data,” Judkins said. “Furthermore, as it is thought that the overall golden eagle population in the United States is stable at best, with some populations facing significant declines from anthropomorphic stressors, conservation tools such as this are essential for best management practices.”
Bird’s Eye View
Other endangered bird species have also been given the PacBio treatment. In some cases, the entire population of critically endangered species are having their genomes sequenced. As highlighted recently on Medium, the Kākāpō 125 Project has begun sequencing the remaining 148 members of the rare, flightless New Zealand parrot, and the ‘alalā crow is also being comprehensively profiled in an effort to save the remaining 140 members of the Hawaiian species, and to boost the breeding efforts of more. High-quality PacBio reference genomes have been essential to both projects.
We have also partnered with several multi-institutional projects striving to create larger genomic databases of high-quality, comprehensive assemblies of animal species, including:
- The Vertebrate Genomes Project – An effort to sequence 66,000 species
- The Earth BioGenome Project – A moonshot for biology which aims to sequence, catalog and characterize the genomes of all eukaryotic biodiversity in 10 years.
- Bat1K – A project to sequence the genomes of all 1,300 species of bat
- Functional Annotation of Animal Genomes (FAANG) – An effort to produce comprehensive maps of functional elements in the genomes of domesticated animal species.
What’s in a name? Too much, when it comes to the taxology of yeast, it turns out.
Scientists from University College of Dublin have found that two distinctly named species of yeast are in fact 99.6% identical at the base pair level, and collinear. In other words, they are the same species.
It was a bit of a shock, especially considering one of the yeast species, Pichia kudriavzevii, is commonly used in food production and classified by the US FDA as “generally recognized as safe,” while the other, Candida krusei, is known to be drug-resistant and able to cause opportunistic infections in humans.
“The existence of multiple names for this species has almost certainly impeded research into it,” the researchers write. “We suggest that P. kudriavzevii should be the only name used in future.”
Their study, published in PLOS Pathogens, highlights the importance of gathering comprehensive genetic data of organisms.
The Irish team, led by Kenneth H. Wolfe and first author Alexander P. Douglass, is the first to sequence the type strain of C. krusei. Genome sequences had been published previously for four P. kudriavzevii strains and one C. krusei clinical isolate, but they were highly fragmented, and none of them provided a chromosome-level assembly or transcriptome-based annotation.
The researchers produced high-quality reference genomes for a C. krusei type strain called CBS573 and the CBS5147 type strain for P. kudriavzevii. They then annotated the genomes with the help of RNA sequence data for CBS573, uncovering more than 5,100 protein-coding genes. They also re-sequenced 30 additional clinical and environmental isolates to explore the relationships between the strains and their genomic diversity.
Not only did the comprehensive assemblies clarify the genome content and structure, they uncovered some unexpected features of the genomes.
“One of the most unexpected features of the genome is the structure of its centromeres, which consist of a simple but large IR. The 99% DNA sequence identity of the 8–14 kb units that form the IRs means that centromere organization would have been difficult to deduce without long-read PacBio data.”
The data also allowed them to take a deeper dive into a question that has been perplexing scientists in the field concerning the sexual cycle of the yeast.
When P. kudriavzevii was first described, it was reported to be able to sporulate, forming one spore per ascus, but later studies reported that the type strain of P. kudriavzevii does not mate or sporulate.
“Our discovery that this strain is triploid provides a possible explanation for its failure to sporulate, or at least its failure to produce viable spores,” the authors write.
As for implications to health and safety, the authors say the yeast should no longer be used in food processing, as it “presents a potential hazard to the health of immunocompromised workers, and potentially also to consumers.”
They suggest that the closely related, non-pathogenic Pichia species be considered as possible alternatives for some industrial applications.
Brought to the brink of extinction, the future of Hawaii’s only lineage of the crow family (Corvidae) is looking up thanks to intensive conservation genomics efforts using PacBio de novo assemblies.
In Hawaiian mythology, the ‘alalā is said to lead souls to their final resting place on the cliffs of Ka Lae, the southernmost tip on the Big Island of Hawaii. As one of the largest native bird populations, it also had a vital role in the ecosystem, helping to disperse and germinate seeds of many indigenous plant species.
Disease, predators and shrinking habitats led to a complete loss of the species in the wild. A captive breeding program led by San Diego Zoo Global managed to save nine ‘alalā and has successfully bred around 140 more to date. But the captive birds also face challenges, including low hatching success and signs of poor genetic diversity due to inbreeding, with the majority of the population linked to a single founding pair.
Not satisfied with following family trees to determine suitable mating pairs, a research team from the San Diego Zoo Institute for Conservation Research, the University of Hawaii, and other organizations produced a high-quality genome assembly based on SMRT Sequencing. The team believed a comprehensive genome assembly could provide a more detailed picture of population-level genomic diversity and genetic load of Corvus hawaiiensis, as well as more accurate estimates of molecular relatedness to guide breeding decisions. And they were right.
Led by Jolene Sutton, assistant professor at the University of Hawaii, Hilo, the team created an assembly which has provided critical insights into inbreeding and disease susceptibility. They found that the ‘alalā genome is substantially more homozygous compared with more outbred species, and created annotations for a subset of immunity genes that are likely to be important for conservation applications.
As reported in the latest issue of Genes — and featured on its cover — the quality of the assembly places it amongst the very best avian genomes assembled to date, comparable to intensively studied model systems.
“Such genome-level data offer unprecedented precision to examine the causes and genetic consequences of population declines, and to apply these results to conservation management,” the authors state. “Although pair selection and managed breeding using the pedigree has kept the inbreeding level of the ‘alalā population at a relatively low level over the past 20 years, the intensive and ongoing conservation management of the species requires a more detailed approach.”
Since the generation of the ‘alalā assembly, several projects have been initiated that rely heavily on use of the new resource, the authors state. To better understand the impact of population bottlenecks over the past 100 years, and to provide a clearer picture of how much diversity can likely be maintained into the future, the team is using targeted SNP-capture to compare genomic diversity in museum and modern ‘alalā, for example. Plans are also underway to genotype every individual ‘alalā against this new reference to further inform the choice of breeding pairs in captivity as well as the management of an ‘alalā release project started in 2017.
“Genomic data derived from our analyses are an essential component of the current and future recovery of the ‘alalā,” the authors write. “As the size of both the captive and wild ‘alalā populations continue to increase, the integration of genomic data as part of the conservation management effort will help to maximize the genetic health of the species well into the future.”
A new Nature Biotechnology publication is sending reverberations through the CRISPR and gene therapy communities. The discovery that the widely used CRISPR/Cas9 method results in far more genomic changes than previously thought — including big deletions and rearrangements — was made possible by the use of long-read SMRT Sequencing.
“Repair of double-strand breaks induced by CRISPR–Cas9 leads to large deletions and complex rearrangements” comes from Michael Kosicki, Kärt Tomberg, and Allan Bradley at the Wellcome Sanger Institute. The scientists aimed to better understand the possible universe of on-target edits (rather than the better-studied off-target effects) made in a controlled environment, starting with a 5.7 kb amplicon from the X-linked PigA locus in mouse embryonic stem cells. “Thus far, exploration of Cas9-induced genetic alterations has been limited to the immediate vicinity of the target site and distal off-target sequences, leading to the conclusion that CRISPR–Cas9 was reasonably specific,” they write.
Their findings led to a collective groan among CRISPR scientists and the businesses based on this technology. “We report significant on-target mutagenesis, such as large deletions and more complex genomic rearrangements at the targeted sites in mouse embryonic stem cells, mouse hematopoietic progenitors and a human differentiated cell line,” Kosicki et al. report. “We speculate that current assessments may have missed a substantial proportion of potential genotypes generated by on-target Cas9 cutting and repair, some of which may have potential pathogenic consequences following somatic editing of large populations of mitotically active cells.”
The heterogeneous nature of DNA repair after CRISPR edits was previously observed by Gasperini et al. which shared the strategy of long read SMRT sequencing to get a more clear picture of editing outcomes. In both cases, choosing long-read SMRT Sequencing allowed a larger region adjacent to the intended edit site to be surveyed, uncovering unexpected changes caused by CRISPR-Cas9 cuts. A number of these changes would have been impossible to spot with short-read sequencing, such as large edits deleting an adjacent primer binding site that would have been used to check the region. “The most frequent lesions in these cells were deletions extending many kilobases up- or downstream, away from the exon,” the scientists note. “We conclude that, in most cases, loss of PigA expression was likely caused by loss of the exon, rather than damage to intronic regulatory elements.” In one case, the team even found a de novo insertion — “a perfect match to four consecutive exons derived from the Hmgn1 gene” — that they believe came from spliced, reverse-transcribed RNA.
These sweeping edits weren’t the only bad news in the paper. The scientists repeated the original experiment four times to determine whether the same edits would be seen each time and found that they were not. “Each biological replicate differed substantially, despite a large number of unique deletion events sampled, indicating that the diversity of potential deletion outcomes is vast,” they report.
The CRISPR method has been considered quite promising as a gene-editing tool to cure disease, and this publication does not suggest that the authors’ findings would necessarily derail that idea. Instead, they urge others in the field to be more comprehensive in analyzing genomes before and after the use of CRISPR for a clearer view of its effects. “Results reported here … illustrate a need to thoroughly examine the genome when editing is conducted ex vivo,” they conclude. “As genetic damage is frequent, extensive and undetectable by the short-range PCR assays that are commonly used, comprehensive genomic analysis is warranted to identify cells with normal genomes before patient administration.”
“Live every week like it’s Shark Week,” 30 Rock character Tracy Jordan once quipped to Kenneth the Page, referencing the week-long, dorsal-finned programming phenomenon that has become the Discovery Channel summer ratings mainstay.
If it involves diving deeply into the science of the maligned species, we’re all in favor. But why stop there?
On our companion long-form Medium blog, we hosted our own Marine Week to highlight recent scientific discoveries across the seas.
- In “Healthy Marine Ecosystems Rely on Their Tiniest Inhabitants,” we explore how the health of ocean habitats relies on more than the activities of our finned friends. Just as human health is proving to be linked to the microbial communities in our guts, marine health is influenced by the bacteria in its ecosystems. A group of Thai scientists are studying the marine microbiology of coral reefs in the Gulf of Thailand and the Andaman Sea to glean the role bacteria might play in the health of the habitat and its responses to environmental stressors, such as elevated seawater temperature.
- The orange clownfish, Amphiprion percula, may have been immortalized in the comedic film “Finding Nemo,” but its importance to the scientific community is no joke. In “Finding Nemo’s Genes: International Team Creates First Reference Genome of Orange Clownfish,” we visit an effort led by Tim Ravasi of King Abdullah University of Science and Technology in Saudi Arabia and Phil Munday of James Cook University in Australia, to create molecular resources for one of the most important species for studying the ecology and evolution of coral reef fishes, as well as a model species for social organization, sex change, mutualism, habitat selection, lifespan, and predator-prey interactions.
- Aquaculture has become an increasingly important source of sustainable seafood. And similar to the city singles scene, its viability has a lot to do with sex. In “Deep in the Dating Pool,” we look at how studies into sex differentiation of two marine species — Nile tilapia (Oreochromis niloticus) and abalone (Haliotis discus hannai)– can help commercial and conservation breeding efforts. Long-read sequencing and the Iso-Seq method were key to the success of these efforts by two international research groups.
- In “A Fish Tale: Tracing the Divergence of a Species,” we explore what it takes for one species to evolve into another, with medaka as the model. A popular pet since the 17th century because of its hardiness and pleasant coloration, scientists are more interested in the genetics of the medaka, but earlier attempts to sequence the fish’s 800 Mb genome were not the best quality, and had 97,933 gaps in their sequence. So researchers at the University of Tokyo started from scratch, using Single Molecule, Real-Time (SMRT) Sequencing. This advanced technology allowed them to study difficult-to-detect centromeres and changes in DNA structure that were missing in the previous genome assemblies.
Hungry for more? Head over to bioRxiv, where a team of Japanese and American researchers, led by Shawn Burgess at the NIH’s National Human Genome Research Institute, have reported on the assembly of the goldfish (Carassius auratus) genome and the evolution of its genes after whole genome duplication. As a very close relative of the common carp (Cyprinus carpio), goldfish share the recent genome duplication that occurred approximately 14-16 million years ago in their common ancestor, and the combination of centuries of breeding and a wide array of interesting body morphologies “is an exciting opportunity to link genotype to phenotype as well as understanding the dynamics of genome evolution and speciation,” the authors state.
Generating a high-quality draft sequence of a “Wakin” goldfish using 71-fold coverage PacBio long-reads, the team identified 70,324 coding genes and more than 11,000 non-coding transcripts and found that that two sub-genomes in goldfish retained extensive synteny and collinearity between goldfish and zebrafish. However, “ohnologous” genes were lost quickly after the carp whole-genome duplication, and the expression of 30% of the retained duplicated gene diverged significantly across seven tissues sampled.