This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
Xiaochang Zhang, an assistant professor at the University of Chicago, is poised to get a powerful new data set to help his team understand the role of alternative splicing in brain development. His project, entitled “Uncovering mRNA splicing diversity in cerebral cortex development,” was selected as the winner of the 2018 Iso-Seq SMRT Grant Program. Sequencing for this project will be carried out by our Certified Service Provider RTL Genomics. We caught up with Xiaochang to learn more about his research and how SMRT Sequencing data will make a difference.
Q: What’s your research focus?
A: We are interested in the impact of alternative RNA splicing in neocortex development and disorders, and we are excited about the opportunity to use long-read sequencing to further address this question. Enormous neuronal cell diversity has been described, and it is speculated that the secret of neuronal cell diversity is partly hidden in the heterogeneity of neural progenitor cells. Post-transcriptional mRNA metabolism such as alternative splicing presents another layer of gene regulation and dramatically increases protein diversity. Indeed, work from others and us showed that alternative pre-mRNA splicing is wide spread in developing mouse and human brains, and tight regulation of cell type-specific RNA splicing is required for human brain development. Characterizing mRNA isoforms with long-read sequencing will give us a unique chance to understand how the brain is built – we’re really excited about this.
Q: How have you pursued this prior to long-read sequencing?
A: We did bulk RNA sequencing with mouse brain cells and found hundreds of alternatively spliced exons between neural progenitor cells and post-mitotic neurons. We further analyzed a single-cell data set of fetal human brain cells and identified consistent RNA splicing changes between cell types. However, it is hard to obtain a full picture of alternative RNA splicing with short-read sequencing for genes that have multiple alternatively spliced exons. Long-read sequencing will be superior to uncover complex splicing isoforms.
Q: What do you hope to learn with the SMRT Sequencing data?
A: Single Molecule, Real-Time (SMRT) Sequencing can sequence single molecules of the longest human messenger RNAs. We are excited to directly detect the actual full-length mRNA isoforms among different brain cell types with SMRT Sequencing. We will compare long-read sequencing results with our current datasets, and try to uncover complex splicing isoforms that are previously unobservable. With this SMRT Grant we hope to get a better view of alternative RNA splicing in brain development.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win SMRT Sequencing. Also, thank you to our co-sponsor RTL Genomics for supporting the Iso-Seq SMRT Grant Program!
In addition to the most common applications, like whole genome sequencing for de novo assembly, there are several other features you can utilize to advance your science or incorporate to offer your customers a broad range of the best PacBio services. Here’s a sampling of the most recent updates and releases.
Iso-Seq Analysis for Genome Annotation or Targeted Isoform Discovery
The isoform sequence (Iso-Seq) application generates full-length cDNA sequences – from the 5’ end of transcripts to the poly-A tail – eliminating the need for transcriptome reconstruction using isoform-inference algorithms. It’s even easier to help your customers annotate their genomes or perform isoform discovery with full-length transcripts now that diffusion loading is supported for Iso-Seq projects. (For more information on switching to diffusion loading for Iso-Seq analysis projects, please contact your local FAS.)
Multiplexing for Bacterial Whole Genome Assembly
A new solution for multiplexed bacterial whole genome sequencing on the Sequel System is now available, enabling pooling of as many as 16 samples that total up to 30 Mb of genomes. With two new barcoded adaptor kits, a run setup calculator, and data analysis workflow, it’s now fast and easy for your customers to generate multiple high-quality bacterial genomes in a single Sequel System experiment.
Structural Variant Detection with Low-Fold Coverage Sequencing
The PacBio SV application provides high-sensitivity detection of structural variants in human genomes with modest coverage and a low false discovery rate. These larger variant types are typically missed with short-read methods but are known to cause disease. A simple library prep, using a modest amount (~3 µg) of unamplified genomic DNA from a blood sample, is effective for gene discovery in rare and Mendelian disease as well as broader population-scale SV characterization.
When creating a global genomic ark of creatures great and small, scientists are turning to the comprehensive coverage and quality of PacBio sequencing.
The Vertebrate Genomes Project (VGP), an international consortium of more than 150 scientists from 50 academic, industry and government institutions in 12 countries, recently released the first 15 of an anticipated 66,000 high-quality reference genomes representing all vertebrate species on Earth.
The VGP consortium spent three years selecting technologies and workflows to produce higher quality, “platinum-level” genomes, and SMRT Sequencing was selected to generate the initial assemblies.
“Until recently, sequencing the complete genome of a single animal required millions of dollars and years of effort. New sequencing technologies have dramatically reduced the cost and made it possible to reconstruct near-perfect genomes for the first time,” said VGP member Adam Phillippy of the Genome Informatics Section at the National Human Genome Research Institute.
From the duck-billed platypus to the limbless serpentine amphibian Two-lined caecilian, the first data release represents species from all five vertebrate classes – mammals, birds, reptiles, amphibians, and fishes.
The first phase of the project will continue with the sequencing of at least one species representing each of the 260 orders of living vertebrates. Subsequent sequencing will cover all 1,045 families, then 9,478 genera, and ultimately all of the approximately 66,000 species of vertebrates.
“The last 20 years have proven the value of openly available high-quality reference genome sequences to scientific research, but until now these have mostly been available just for humans and other key organisms,” said Richard Durbin, of the University of Cambridge and the Wellcome Sanger Institute. “We are entering an era in which we will obtain reference genome sequences for all species across the Tree of Life.”
VGP is one of many large-scale international projects to sequence the DNA of thousands of plant, animal, fungal and bacterial species that have chosen PacBio Single Molecule, Real-Time (SMRT) Sequencing to assemble some of the most complete genomes to date. These comprehensive catalogs of genetic code provide valuable resources to researchers in their quest to understand the biology, physiology, development and evolution of a multitude of living organisms, and will aide in their conservation.
Another is the Bat1K initiative, and effort by Sonja Vernes of the Max Planck Institute and others to catalog the genetic diversity in 1,300 types of bats.
“The long-read sequencing technology from PacBio is allowing us to produce bat genomes of unprecedented quality and resolution as part of the Bat1K project,” said Vernes. “This is going to be a big step forward for understanding how the genes and also the non-coding DNA in these genomes influence the weird and wonderful features of bats.”
Other projects include:
- The Bird 10,000 Genomes (B10K) Project, which is aiming to generate representative draft genome sequences from all extant bird species; many of its members became founders of the The Genome 10,000 consortium (G10K), which evolved into the Vertebrate Genome Project;
- Efforts to sequence nationally significant species, such as the Sanger 25 Project by the Wellcome Trust Sanger Institute and the Canada 150 Sequencing Initiative (CanSeq150) by Canada’s Genomics Enterprise.
- The NCTC 3000 initiative by the UK’s National Collection of Type Cultures to sequence the genomes of 3,000 strains of bacteria;
- Whole Genome Assembly of the Maize NAM Founders, a multi-institutional effort to create a 26-line pangenome maize reference collection, one of many initiative to sequence important agricultural crops to discover and utilize novel genes, traits and/or genomic regions for crop improvement and basic research;
- The Pan-Genome Analysis of Sorghum project at the Donald Danforth Plant Science Center, which includes 15 sorghum lines covering the diversity of this important bioenergy, food, and feed crop. The project is supported through the Community Science Program (CSP) of the DOE Joint Genome Institute with PacBio sequencing at HudsonAlpha Institute for Biotechnology.
- The Open Green Genomes Initiative, also supported by DOE Joint Genome Institute, which will generate high-quality genome assemblies and annotations for 35 species representing all major evolutionary lineages in the land plant tree of life.
- The Functional Annotation of Animal Genomes Project (FAANG), which is aiming to produce comprehensive maps of functional elements in the genomes of domesticated animal species;
- Marine and aquaculture efforts such as The Aqua-Hundred Genome Project;
- Insect initiatives, including the i5k Project to sequence 5,000 arthropod genomes and The Global Ant Genomics Alliance (GAGA) to sequence 200 ant species.
If you’re interested in supporting this important effort, the group is soliciting donations for ongoing project support.
Many people who run a sequencing core lab would prefer to focus on science instead of business, but all core lab managers know that it’s imperative to keep a steady stream of clients and projects filling the pipeline. In a recent blog post we offered 5 ways to attract more customers to your sequencing services. Now let’s take a look at how you can incorporate new services and upgrades into your facility.
Keeping up with the latest and greatest advancements in sequencing technology isn’t just about the sequencing instruments. Companies like PacBio regularly release instrument improvements, new chemistries, software features, and new applications for their sequencing platforms. Making sure that you are running the latest chemistries and supporting the newest features will help your lab continue to generate the best results for your customers. Here are several ways you can keep up to date with all things PacBio.
- Keep in close contact with your local FAS
The local Field Application Scientist (FAS) who trained your team and whom you call with questions is the same person who can give you real-time information on the newest releases and applications. He or she can give you the in-depth training to get started offering a new service or upgrading your current software.
- Join the Certified Service Provider Program
As a PacBio Certified Service Provider (CSP), you can take advantage of benefits that other providers cannot. Benefits include preferential consideration for early access to, and sometimes even beta testing of, new features and applications. In addition, you’ll have quarterly check-ins with the PacBio team for the latest updates and information about the products we offer. Find out more about joining our CSP Program.
- Connect with us digitally
We try to deliver a steady, but not overwhelming, stream of the latest information about the uses for SMRT Sequencing across multiple channels. From our market area newsletters (Plant and Animal, Human Biomedical, Microbial) to the snappy one-liners on Twitter, there’s a mix of communication out there perfect for keeping you informed. Subscribe to our blog, follow us on Twitter, Medium, and LinkedIn, and sign up for updates to make sure you’re getting all the latest news delivered to you as it happens.
- Attend PacBio events
Throughout the year we host a series of User Group Meetings all over the world with the goal of bringing together our customers, end users of SMRT Sequencing, and anyone else interested in learning more. These multi-day events consist of updates from PacBio staff as well as cool biological stories from many different labs covering a variety of applications. Because of the smaller nature of these events compared to large industry conferences, a lot of individual information exchange occurs and collaborations are formed. Check out our upcoming events – we hope to see you at the next one!
In an exciting paper that made the cover of Genome Research, scientists from Cold Spring Harbor Laboratory and collaborating institutions report the genome sequence and transcriptome of a commonly used breast cancer cell line. They determined that the cell line harbors far more structural variants than previously thought with results that call into question cancer genome analysis based solely on short-read sequencing data.
In “Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line,” lead author Maria Nattestad, senior author Michael Schatz, and collaborators describe an in-depth investigation of SK-BR-3, an important model for HER2-positive breast cancer. “SK-BR-3 is known to be highly rearranged, although much of the variation is in complex and repetitive regions that may be underreported,” they write, explaining their choice of PacBio long-read sequencing to conduct a new genomic and transcriptomic analysis of the cell line.
Investigating genomic instability is essential to understanding cancer but attempts to do so using short-read sequencing have seen limited success due to challenges in detecting structural variation. Even large-scale cancer projects “have performed somewhat limited analysis of structural variations, as both the false positive rate and the false negative rate for detecting structural variants from short reads are reported to be 50% or more,” Nattestad, et al. report. “Furthermore, the variations that are detected are rarely close enough to determine whether they occur in phase on the same molecule, limiting the analysis of how the overall chromosome structure has been altered.”
With the goal of creating a comprehensive map of structural variations in cancer, scientists sequenced the SK-BR-3 genome using SMRT Sequencing. To enable comparison between sequencing technologies, they also used a short-read technology. The team found that PacBio data was more mappable: more than 90% of PacBio reads align with a mapping quality of 60, while just 69% of short reads did the same. “We also observed a smaller GC bias in the PacBio sequencing compared to the Illumina sequence data,” they note, “which enables more robust copy number analysis and generally better variant detection overall.”
An analysis of variants showed that long-read sequencing detected more than 17,000 structural variants of at least 50 bp in length, while the short-read data yielded only about 4,100, a difference that could largely be attributed to the lack of insertions called in the latter data set. This closely mirrors the results of researchers working on population-specific reference genomes.
The scientists coupled their genomic variant discovery with the Iso-Seq method to capture full-length transcripts from SK-BR-3, noting that short-read data often cannot span or accurately reconstruct entire isoforms. “Long reads overcome such limitations by spanning multiple exon junctions and often covering complete transcripts,” they explain. Within the transcriptome analysis, the team closely examined several gene fusions. Some of the gene fusions were found to be the product of two or three rearrangement events occurring in sequence. For example, “CYTH1-EIF3H had been discovered previously with RNA-seq and been validated with RT-PCR, but it was not known to be a “2-hop” gene fusion (taking place through a series of two variants) until now,” the scientists report. “This fusion was also captured in full by several individual SMRT-seq reads that contain both variants and have alignments in both genes.” The authors also report finding direct evidence that a gene fusion previously thought to be the result of a 2-hop path is actually a 3-hop fusion.
One detailed illustration of the careful analysis performed for this project involved the ERBB2 oncogene, which is also called HER2. “We discover a complex sequence of nested duplications and translocations, suggesting a punctuated progression,” the team writes. They were able to “reconstruct the progression of rearrangements resulting in the amplification of the ERBB2 oncogene, including a previously unrecognized inverted duplication spanning a large portion of the region.”
“Long-read read sequencing can expose complex variants with great certainty and context, suggesting that more multi-hop gene fusions, inverted duplications, and complex events may be found in other cancer genomes,” the scientists conclude. “There may be many other types of complex variations present in other cancer genomes that were not found in SK-BR-3, so it is essential to continue building a catalogue of these variant types using the best available technologies.”
A map of every individual’s genome will soon be possible, but how will we know if it is correct? Benchmarks are needed in order to check the performance of sequencing, and any genomes used for such a purpose should be comprehensive and well characterized.
Enter the Genome in a Bottle Project (GIAB), a consortium of geneticists and bioinformaticians committed to the creation and sharing of high-quality reference genomes. Unlike other initiatives, such as the 1000 Genomes Project, that are seeking to sequence many representatives of different populations, GIAB is interested in sequencing just a few individuals, but deeply and with multiple technologies. Formed in 2012, the consortium has to date released data for five individuals, including an Ashkenazi Jewish family trio.
GIAB has made great progress with characterizing small variants, such as SNPs and indels. However, as project co-leader Justin Zook explained in a presentation at the Labroots Genetics and Genomics conference, much work remains to be done. Zook estimates that the current GIAB truth sets, based mostly on short-read sequencing, miss 200,000 variants in tandem repeats and homopolymers. Further, the vast majority of medium and large variants are missed: over 75% for indels 15-50 bp and 99% for structural variants >50 bp.
“The representation of these variants is poorly standardized, and that’s especially true once you get to more complex changes that occur in repetitive regions,” Zook said. “And tools to do the comparisons for structural variants are really in their infancy.”
The solution? New technologies like accurate, long reads from SMRT Sequencing, and new variant callers, especially those based on de novo assembly. Zook, a scientist at the National Institute of Standards and Technology, and the GIAB consortium are currently applying these techniques to build benchmark sets of structural variants. Using PacBio long reads, the GIAB consortium has expanded its structural variant callset from only a few hundred variants to over 20,000.
“When we’re trying to characterize the structural variation in long, repetitive regions, or in places where there are large insertions, it’s been really useful to have long-read information,” Zook said. “Long reads are also really useful for phasing variants, and it looks like they’ll be really useful for characterizing variants in difficult-to-map regions,” he added.
In addition to providing completed benchmark reference genomes, GIAB also releases datasets; a 2016 release included 12 datasets based on seven genomes, compiled by 51 authors from 14 institutions.
Zook said new public long-read datasets are coming. The data in development includes SMRT Sequencing of a Chinese trio (in collaboration with the Icahn School of Medicine), and a deeper dive into the genomes of the Ashkenazi Jewish son and mother of the originally released trio set.
Genetics is not only key to discovering and tracing new traits in an organism, but also conserving old ones — and in some cases, the species itself.
A deep understanding of genetic variation within and among species can be used to reconstruct their evolutionary history, to examine their contemporary status, and to predict the future effects of management strategies.
With this in mind, scientists at the UK’s Wellcome Sanger Institute were keen to incorporate endangered species among 25 genomes to be sequenced as part of a project to mark its 25-year anniversary, and the first assembly to be released is the golden eagle.
The first high-quality reference genome of the iconic bird, generated in partnership with the University of Edinburgh’s Royal (Dick) School of Veterinary Studies using SMRT Sequencing technology, the assembly is expected to be an excellent resource for international eagle conservation efforts.
The genetic information provided by the genomic map will further our understanding of the diversity and viability of golden eagles, bald eagles and other species worldwide, and could help in the identification of populations or individuals best suited for reintroduction projects.
Although the golden eagle is not considered threatened on a global scale, the species has experienced sharp population declines in some areas, including the United Kingdom and parts of the United States. Urbanization, agricultural development, and changes in wildfire regimes have compromised nesting and hunting grounds in southern California and in the sagebrush steppes of the inner West, for example, and there are only 508 breeding pairs of golden eagles in the UK, largely restricted to the Scottish Highlands and Islands.
The South of Scotland Golden Eagle Project is among many initiatives expected to benefit from the genomic information.
“With the golden eagle genome sequence, we will be able to compare the eagles being relocated to southern Scotland to those already in the area to ensure we are creating a genetically diverse population,” said project director Rob Ogden, Head of Conservation Genetics at the University of Edinburgh. “We will also be able to start investigating the biological effects of any genetic differences that we detect, not only within the Scottish population, but worldwide.”
Megan Judkins, an adjunct faculty member at Oklahoma State University and interim director of the Grey Snow Eagle House, a tribally run and operated rehabilitation and research facility of bald and golden eagles, said the new European golden eagle genome will provide essential information for learning about this species from a worldwide perspective, as previously sequenced golden eagle genomes were from the North American population.
“Having this new tool could help us reveal more about their genetic diversity and provide insight into the subspecies that are thought to exist, but are not substantiated with genomic data,” Judkins said. “Furthermore, as it is thought that the overall golden eagle population in the United States is stable at best, with some populations facing significant declines from anthropomorphic stressors, conservation tools such as this are essential for best management practices.”
Bird’s Eye View
Other endangered bird species have also been given the PacBio treatment. In some cases, the entire population of critically endangered species are having their genomes sequenced. As highlighted recently on Medium, the Kākāpō 125 Project has begun sequencing the remaining 148 members of the rare, flightless New Zealand parrot, and the ‘alalā crow is also being comprehensively profiled in an effort to save the remaining 140 members of the Hawaiian species, and to boost the breeding efforts of more. High-quality PacBio reference genomes have been essential to both projects.
We have also partnered with several multi-institutional projects striving to create larger genomic databases of high-quality, comprehensive assemblies of animal species, including:
- The Vertebrate Genomes Project – An effort to sequence 66,000 species
- The Earth BioGenome Project – A moonshot for biology which aims to sequence, catalog and characterize the genomes of all eukaryotic biodiversity in 10 years.
- Bat1K – A project to sequence the genomes of all 1,300 species of bat
- Functional Annotation of Animal Genomes (FAANG) – An effort to produce comprehensive maps of functional elements in the genomes of domesticated animal species.
What’s in a name? Too much, when it comes to the taxology of yeast, it turns out.
Scientists from University College of Dublin have found that two distinctly named species of yeast are in fact 99.6% identical at the base pair level, and collinear. In other words, they are the same species.
It was a bit of a shock, especially considering one of the yeast species, Pichia kudriavzevii, is commonly used in food production and classified by the US FDA as “generally recognized as safe,” while the other, Candida krusei, is known to be drug-resistant and able to cause opportunistic infections in humans.
“The existence of multiple names for this species has almost certainly impeded research into it,” the researchers write. “We suggest that P. kudriavzevii should be the only name used in future.”
Their study, published in PLOS Pathogens, highlights the importance of gathering comprehensive genetic data of organisms.
The Irish team, led by Kenneth H. Wolfe and first author Alexander P. Douglass, is the first to sequence the type strain of C. krusei. Genome sequences had been published previously for four P. kudriavzevii strains and one C. krusei clinical isolate, but they were highly fragmented, and none of them provided a chromosome-level assembly or transcriptome-based annotation.
The researchers produced high-quality reference genomes for a C. krusei type strain called CBS573 and the CBS5147 type strain for P. kudriavzevii. They then annotated the genomes with the help of RNA sequence data for CBS573, uncovering more than 5,100 protein-coding genes. They also re-sequenced 30 additional clinical and environmental isolates to explore the relationships between the strains and their genomic diversity.
Not only did the comprehensive assemblies clarify the genome content and structure, they uncovered some unexpected features of the genomes.
“One of the most unexpected features of the genome is the structure of its centromeres, which consist of a simple but large IR. The 99% DNA sequence identity of the 8–14 kb units that form the IRs means that centromere organization would have been difficult to deduce without long-read PacBio data.”
The data also allowed them to take a deeper dive into a question that has been perplexing scientists in the field concerning the sexual cycle of the yeast.
When P. kudriavzevii was first described, it was reported to be able to sporulate, forming one spore per ascus, but later studies reported that the type strain of P. kudriavzevii does not mate or sporulate.
“Our discovery that this strain is triploid provides a possible explanation for its failure to sporulate, or at least its failure to produce viable spores,” the authors write.
As for implications to health and safety, the authors say the yeast should no longer be used in food processing, as it “presents a potential hazard to the health of immunocompromised workers, and potentially also to consumers.”
They suggest that the closely related, non-pathogenic Pichia species be considered as possible alternatives for some industrial applications.
Brought to the brink of extinction, the future of Hawaii’s only lineage of the crow family (Corvidae) is looking up thanks to intensive conservation genomics efforts using PacBio de novo assemblies.
In Hawaiian mythology, the ‘alalā is said to lead souls to their final resting place on the cliffs of Ka Lae, the southernmost tip on the Big Island of Hawaii. As one of the largest native bird populations, it also had a vital role in the ecosystem, helping to disperse and germinate seeds of many indigenous plant species.
Disease, predators and shrinking habitats led to a complete loss of the species in the wild. A captive breeding program led by San Diego Zoo Global managed to save nine ‘alalā and has successfully bred around 140 more to date. But the captive birds also face challenges, including low hatching success and signs of poor genetic diversity due to inbreeding, with the majority of the population linked to a single founding pair.
Not satisfied with following family trees to determine suitable mating pairs, a research team from the San Diego Zoo Institute for Conservation Research, the University of Hawaii, and other organizations produced a high-quality genome assembly based on SMRT Sequencing. The team believed a comprehensive genome assembly could provide a more detailed picture of population-level genomic diversity and genetic load of Corvus hawaiiensis, as well as more accurate estimates of molecular relatedness to guide breeding decisions. And they were right.
Led by Jolene Sutton, assistant professor at the University of Hawaii, Hilo, the team created an assembly which has provided critical insights into inbreeding and disease susceptibility. They found that the ‘alalā genome is substantially more homozygous compared with more outbred species, and created annotations for a subset of immunity genes that are likely to be important for conservation applications.
As reported in the latest issue of Genes — and featured on its cover — the quality of the assembly places it amongst the very best avian genomes assembled to date, comparable to intensively studied model systems.
“Such genome-level data offer unprecedented precision to examine the causes and genetic consequences of population declines, and to apply these results to conservation management,” the authors state. “Although pair selection and managed breeding using the pedigree has kept the inbreeding level of the ‘alalā population at a relatively low level over the past 20 years, the intensive and ongoing conservation management of the species requires a more detailed approach.”
Since the generation of the ‘alalā assembly, several projects have been initiated that rely heavily on use of the new resource, the authors state. To better understand the impact of population bottlenecks over the past 100 years, and to provide a clearer picture of how much diversity can likely be maintained into the future, the team is using targeted SNP-capture to compare genomic diversity in museum and modern ‘alalā, for example. Plans are also underway to genotype every individual ‘alalā against this new reference to further inform the choice of breeding pairs in captivity as well as the management of an ‘alalā release project started in 2017.
“Genomic data derived from our analyses are an essential component of the current and future recovery of the ‘alalā,” the authors write. “As the size of both the captive and wild ‘alalā populations continue to increase, the integration of genomic data as part of the conservation management effort will help to maximize the genetic health of the species well into the future.”
A new Nature Biotechnology publication is sending reverberations through the CRISPR and gene therapy communities. The discovery that the widely used CRISPR/Cas9 method results in far more genomic changes than previously thought — including big deletions and rearrangements — was made possible by the use of long-read SMRT Sequencing.
“Repair of double-strand breaks induced by CRISPR–Cas9 leads to large deletions and complex rearrangements” comes from Michael Kosicki, Kärt Tomberg, and Allan Bradley at the Wellcome Sanger Institute. The scientists aimed to better understand the possible universe of on-target edits (rather than the better-studied off-target effects) made in a controlled environment, starting with a 5.7 kb amplicon from the X-linked PigA locus in mouse embryonic stem cells. “Thus far, exploration of Cas9-induced genetic alterations has been limited to the immediate vicinity of the target site and distal off-target sequences, leading to the conclusion that CRISPR–Cas9 was reasonably specific,” they write.
Their findings led to a collective groan among CRISPR scientists and the businesses based on this technology. “We report significant on-target mutagenesis, such as large deletions and more complex genomic rearrangements at the targeted sites in mouse embryonic stem cells, mouse hematopoietic progenitors and a human differentiated cell line,” Kosicki et al. report. “We speculate that current assessments may have missed a substantial proportion of potential genotypes generated by on-target Cas9 cutting and repair, some of which may have potential pathogenic consequences following somatic editing of large populations of mitotically active cells.”
The heterogeneous nature of DNA repair after CRISPR edits was previously observed by Gasperini et al. which shared the strategy of long read SMRT sequencing to get a more clear picture of editing outcomes. In both cases, choosing long-read SMRT Sequencing allowed a larger region adjacent to the intended edit site to be surveyed, uncovering unexpected changes caused by CRISPR-Cas9 cuts. A number of these changes would have been impossible to spot with short-read sequencing, such as large edits deleting an adjacent primer binding site that would have been used to check the region. “The most frequent lesions in these cells were deletions extending many kilobases up- or downstream, away from the exon,” the scientists note. “We conclude that, in most cases, loss of PigA expression was likely caused by loss of the exon, rather than damage to intronic regulatory elements.” In one case, the team even found a de novo insertion — “a perfect match to four consecutive exons derived from the Hmgn1 gene” — that they believe came from spliced, reverse-transcribed RNA.
These sweeping edits weren’t the only bad news in the paper. The scientists repeated the original experiment four times to determine whether the same edits would be seen each time and found that they were not. “Each biological replicate differed substantially, despite a large number of unique deletion events sampled, indicating that the diversity of potential deletion outcomes is vast,” they report.
The CRISPR method has been considered quite promising as a gene-editing tool to cure disease, and this publication does not suggest that the authors’ findings would necessarily derail that idea. Instead, they urge others in the field to be more comprehensive in analyzing genomes before and after the use of CRISPR for a clearer view of its effects. “Results reported here … illustrate a need to thoroughly examine the genome when editing is conducted ex vivo,” they conclude. “As genetic damage is frequent, extensive and undetectable by the short-range PCR assays that are commonly used, comprehensive genomic analysis is warranted to identify cells with normal genomes before patient administration.”
“Live every week like it’s Shark Week,” 30 Rock character Tracy Jordan once quipped to Kenneth the Page, referencing the week-long, dorsal-finned programming phenomenon that has become the Discovery Channel summer ratings mainstay.
If it involves diving deeply into the science of the maligned species, we’re all in favor. But why stop there?
On our companion long-form Medium blog, we hosted our own Marine Week to highlight recent scientific discoveries across the seas.
- In “Healthy Marine Ecosystems Rely on Their Tiniest Inhabitants,” we explore how the health of ocean habitats relies on more than the activities of our finned friends. Just as human health is proving to be linked to the microbial communities in our guts, marine health is influenced by the bacteria in its ecosystems. A group of Thai scientists are studying the marine microbiology of coral reefs in the Gulf of Thailand and the Andaman Sea to glean the role bacteria might play in the health of the habitat and its responses to environmental stressors, such as elevated seawater temperature.
- The orange clownfish, Amphiprion percula, may have been immortalized in the comedic film “Finding Nemo,” but its importance to the scientific community is no joke. In “Finding Nemo’s Genes: International Team Creates First Reference Genome of Orange Clownfish,” we visit an effort led by Tim Ravasi of King Abdullah University of Science and Technology in Saudi Arabia and Phil Munday of James Cook University in Australia, to create molecular resources for one of the most important species for studying the ecology and evolution of coral reef fishes, as well as a model species for social organization, sex change, mutualism, habitat selection, lifespan, and predator-prey interactions.
- Aquaculture has become an increasingly important source of sustainable seafood. And similar to the city singles scene, its viability has a lot to do with sex. In “Deep in the Dating Pool,” we look at how studies into sex differentiation of two marine species — Nile tilapia (Oreochromis niloticus) and abalone (Haliotis discus hannai)– can help commercial and conservation breeding efforts. Long-read sequencing and the Iso-Seq method were key to the success of these efforts by two international research groups.
- In “A Fish Tale: Tracing the Divergence of a Species,” we explore what it takes for one species to evolve into another, with medaka as the model. A popular pet since the 17th century because of its hardiness and pleasant coloration, scientists are more interested in the genetics of the medaka, but earlier attempts to sequence the fish’s 800 Mb genome were not the best quality, and had 97,933 gaps in their sequence. So researchers at the University of Tokyo started from scratch, using Single Molecule, Real-Time (SMRT) Sequencing. This advanced technology allowed them to study difficult-to-detect centromeres and changes in DNA structure that were missing in the previous genome assemblies.
Hungry for more? Head over to bioRxiv, where a team of Japanese and American researchers, led by Shawn Burgess at the NIH’s National Human Genome Research Institute, have reported on the assembly of the goldfish (Carassius auratus) genome and the evolution of its genes after whole genome duplication. As a very close relative of the common carp (Cyprinus carpio), goldfish share the recent genome duplication that occurred approximately 14-16 million years ago in their common ancestor, and the combination of centuries of breeding and a wide array of interesting body morphologies “is an exciting opportunity to link genotype to phenotype as well as understanding the dynamics of genome evolution and speciation,” the authors state.
Generating a high-quality draft sequence of a “Wakin” goldfish using 71-fold coverage PacBio long-reads, the team identified 70,324 coding genes and more than 11,000 non-coding transcripts and found that that two sub-genomes in goldfish retained extensive synteny and collinearity between goldfish and zebrafish. However, “ohnologous” genes were lost quickly after the carp whole-genome duplication, and the expression of 30% of the retained duplicated gene diverged significantly across seven tissues sampled.
When was the last time you sent your DNA off to a day at the spa? Olga Pettersson of the SciLifeLab at Uppsala University lets her molecules relax for up to a week at room temperature to enable them to untangle, achieve better chemical purity, and better sequencing output.
It was one of many practical pointers shared by presenters at the popular three-day gathering of PacBio users in Leiden, Netherlands last month. SMRT Leiden featured the scientific discoveries and analytical achievements of more than 30 speakers.
Inge Kjaerbolling of the Technical University of Denmark shared her tricks using the new Aspergillus genomes for linking compounds to metabolite clusters. Zev Kronenberg, whose name recently graced Science for the cover story on the great apes comparative genome, discussed some of the tools he has developed in his new role as a Phase Genomics scientist: Polar Star for breaking chimeric PacBio contigs using Hi-C; Matlock for Hi-C data pre-processing; and FALCON-Phase, a method for using Hi-C to scaffold FALCON-Unzipped PacBio genomes.
Day 1 also featured several scientific talks about large genome projects, including: the Bat1K initiative from Sonja Vernes of the Max Planck Institute; the genome sequencing of the Zika carrier, the Aedes aegypti mosquito, from Rockefeller University’s Ben Matthews; the tomato genome project, from Mohamed Zouine of INRA/INP Toulouse; and the maize genome from Doreen Ware of USDA/Cold Spring Harbor, who prophesied: “The next green revolution will be data driven.”
Day 2 kicked off with a densely packed and awe-inspiring keynote talk by Shinichi Morishita of the University of Tokyo, covering topics with implications for human disease, speciation, structural variants, haplotype phasing, and metagenomics. It was followed by a talk from Laurence Ettwiller of New England Biolabs on a new full-length transcriptome protocol for bacteria, as well as a preview of the forthcoming version of structural variation calling in PacBio’s official SMRT Link/SMRT Analysis software suite, by PacBio scientist Armin Töpfer.
Human disease was the topic of several other presentations. Stuart Scott from the Icahn School of Medicine in New York explained how he uses SMRT Sequencing to identify and phase variants important for human disease mutations. Marjolein Weerts from Erasmus MC, Netherlands, presented her work on inferring cancer signatures on the basis of low-frequency mitochondrial DNA (mtDNA) circulating in the blood stream. And Birgitt Schuele of the Parkinson’s Institute discussed her latest publication that applied PacBio’s No-Amp method to sequence repeat expansions in the ATXN10 gene.
Dutch scientists Alex Hoischen and Yahya Anvar discussed additional applications in human genetics and precision medicine, and Martin Pollard of the Sanger Institute delved into population genomics, describing an effort to generate an expanded reference panel of MHC haplotypes from African populations.
The third day of the event was the SMRT Informatics Developers Conference, which featured a mixture of bioinformatics talks and open discussion. Speakers went into depth about de novo assembly, structural variation, amplicon sequencing, and PacBio’s Iso-Seq method for sequencing full-length RNA transcripts.
Sergey Koren’s talk about TrioBinning, an new approach for complete haplotype reconstruction, was especially popular, and David Heller (Max Planck) illustrated his graph-based approach, SVIM, for calling structural variants using long reads.
For in-depth coverage of the event, check out the four-part Medium series by PacBio Scientist Liz Tseng:
To understand the epigenetic regulation of brain function and behavior, scientists are turning to ants. To understand the ants, they are applying the accurate, long reads of SMRT Sequencing.
While the genetic code of many types of ant have been combed through thanks to several genomes assembled through whole-genome shotgun sequencing, there have only been brief glimpses and guesses regarding gene regulation. Existing assemblies are highly fragmented drafts, making epigenetic studies nearly impossible.
Eager to determine the epigenetic changes responsible for phenotypic and behavioral plasticity in Camponotus floridanus and Harpegnathos saltator ant species, a team of researchers from the Epigenetics Institute of the University of Pennsylvania’s Perelman School of Medicine used SMRT Sequencing to de novo assemble the two genomes, which had been previously sequenced using short reads.
Improved genome continuity led to comprehensive annotations of both protein-coding and non-coding RNAs, and answered some questions about the differential gene expression that allows some worker ants to become acting queens in their colonies.
In a paper published in Cell Reports, first author Emily J. Shields, lead author Roberto Bonasio and their collaborators described how they solved some mechanistic mysteries through PacBio long-read sequencing.
Harpegnathos worker ants are characterized by their unique reproductive and brain plasticity that, in the absence of a queen, allows some of them to transition to a queen-like phenotypic status called “gamergate,” which is accompanied by major changes in brain gene expression.
Previous work by the group in Harpegnathos and in the more conventional Florida carpenter ant Camponotus floridanus had suggested that epigenetic pathways, including those that control histone modifications and DNA methylation, might be responsible for differential deployment of caste-specific traits; pharmacological and molecular manipulation of histone acetylation has been shown to affect caste-specific behavior in Camponotus ants, suggesting a direct role for epigenetics in their social behavior.
“Although the molecular mechanisms by which environmental and developmental cues are converted into epigenetic information on chromatin remain subject of intense investigation, it has become clear that non-coding RNAs play an important role in mediating this flow of information,” the authors write.
In particular, they were interested in long non-coding RNAs (lncRNAs), which are transcripts longer than 200 base pairs that are not translated into proteins. Annotated extensively in human, mouse, bees, zebrafish, Drosophila melanogaster and Caenorhabditis elegans, no comprehensive annotation of lncRNAs in ants has been reported, limiting the reach of ant species as model organisms.
As many cis regulatory and epigenomic mechanisms take place at short-to-medium range (10–100 kb), the scientists wanted to span large repetitive regions and create longer gap-free regions of sequence (i.e., longer contigs) than those produced by previous short-read assemblies, so they turned to PacBio.
“Long PacBio reads allowed us to assemble across longer repeats than previously possible, greatly improving the contiguity of the Harpegnathos and Camponotus genomes,” the authors write.
They sequenced genomic DNA isolated from Harpegnathos and Camponotus workers using SMRT Sequencing, obtaining a sequence coverage of 70-fold for Harpegnathos and 53-fold for Camponotus, with longer contigs (on average more than 30-fold larger than a prior 2010 assembly) and scaffolds with fewer gaps. The assemblies have scaffold N50 sizes larger than 1 Mb, and gaps smaller than in all other insect genomes available on NCBI at the time of writing.
The UPenn team annotated protein-coding genes using a combination of methods, and they discovered more than 300 high-confidence lncRNAs, several of which displayed developmental-, brain-, or caste-specific expression patterns, suggesting important roles in development and brain function.
They were also able to identify some biologically relevant genes missing in the older versions of the genome assemblies. Most notably, a Gp-9-like gene previously unannotated in the Harpegnathos genome was found to be differentially expressed in worker brains compared to gamergates. Mass spectrometry analyses identified two peptides mapping exactly to the newly predicted sequence, confirming the accuracy of the updated gene model.
“This gene was not previously detected as differentially expressed, likely because its closest homolog in the old annotation contains many sequence disparities, reducing the RNA-seq coverage mapped to this gene in both castes,” the authors write.
The UPenn team will use the new assemblies to direct their future explorations of neuroepigenetics in ants. They also hope the work will have a wider impact on the field.
“Our greatly improved Harpegnathos and Camponotus assemblies deliver several critical benefits to further development of these ant species into molecular model organisms,” the authors conclude. “These improvements… will lead to greater understanding of the genetic and epigenetic factors that underlie the behavior of these social insects.”
When humans are infected with the Marburg virus, the result is often lethal, with hemorrhagic fever and other symptoms similar to Ebola. When bats are infected, the result is…. nothing. The tiny mammals remain asymptomatic.
In order to crack this antiviral mystery, a multi-institutional team of scientists sequenced, assembled and analyzed the genome of the bat species Rousettus aegyptiacus, a natural reservoir of Marburg virus and the only known reservoir for any filovirus.
Their findings contradicted previous hypotheses about bat antiviral immunity, which assumed that bats had enhanced antiviral defenses, controlling viral replication early in infection, and developing effective adaptive immune responses as a result. The new analysis suggests that an inhibitory immune state may exist instead.
Led by Boston University researchers Thomas B. Kepler and Stephanie S. Pavlovich, with Gustavo Palacios of the United States Army Research Institute of Infectious Diseases and others from Columbia University, the University of Nebraska, the NIH’s National Center for Biotechnology Information, and the Viral Special Pathogens Branch of the Centers for Disease Control and Prevention, the study in Cell describes several differences between immune responses in bats and humans.
Among them was an expanded and diversified KLRC/KLRD family of natural killer cell receptors, MHC class I genes, and type I interferons in the bats, which dramatically differ from their functional counterparts in other mammals.
“Such concerted evolution of key components of bat immunity is strongly suggestive of novel modes of antiviral defense,” the authors write.
The stark difference between bat and primate antiviral responses has long motivated scientists to characterize the genes involved in the immune system of bats, but previous efforts relied on genomes generated with low-coverage sequencing or with only short-read sequencing technologies. Such assemblies limit the ability to resolve repetitive regions of the genome where important immune gene loci reside, the authors note. So they turned to PacBio long-read sequencing, combined with paired-end short-reads, to generate a high-quality annotated genome for the Egyptian rousette bat. The result is a 1.91 Gb Raegyp2.0 genome, the most contiguous bat genome available.
They used the genome to study two large classes of immune genes: natural killer (NK) cell receptors and type I interferons (IFNs). Previous studies have reported the absence of canonical NK cell receptors in bat genomes, and others have suggested that significant differences exist in type I IFNs between bats and humans. Diving deeper, the new study found an unusual expansion of the KLRC (NKG2) and KLRD (CD94) gene families in R. aegyptiacus relative to other species, “showing genomic evidence of unique features and expression of these receptors that may result in a net inhibitory balance within bat NK cells.”
“The expansion of NK cell receptors is matched by an expansion of potential MHC class I ligands, which are distributed both within and, surprisingly, outside the canonical MHC loci,” the authors note.
They also observed that the type I IFN locus is considerably expanded and diversified in R. aegyptiacus, with members of the IFN-u subfamily being induced after viral infection and showing antiviral activity.
“All these features strengthen the notion of the unique biology of bats and suggest the existence of a distinct immunomodulatory mechanism used to control viral infection,” the authors conclude.
“Our findings are consistent with the hypothesis that certain key components of the immune system in bats have coevolved with viruses toward a state of respective tolerance and avirulence, although tolerance is likely not the only mechanism at play.”
The team notes that definitive tests of their hypotheses may be possible with the development of further experimental reagents for cytometry and biochemical intervention, and that such reagents are being developed now with information made available by the completed genome project.
And while the genome for R. aegyptiacus is providing useful information about how bats resist viral infections, it is just one species of interest to scientists who would like to better understand the genetics of bats in order to shed light on human and ecological biology. The Bat1K initiative is an effort by more than 140 scientists around the world to decode the genomes of all 1,300 species of bats using SMRT Sequencing and other technologies.
The first reference genome for maize variety B73, completed in 2009, was a major milestone, and an improved version released by Cold Spring Harbor Laboratory scientists in 2017 provided a deeper dive into the genetics of the complex crop. Yet even this new robust reference is not enough for Kelly Dawe, Doreen Ware and Matt Hufford, who have taken up another ambitious project: creating a 26-line pangenome reference collection in just two years.
“Maize is not only an important crop, but an important study species for answering basic questions about how plants grow and adapt to different environments,” says Ware, a computational biologist at USDA and Cold Spring Harbor Laboratory.
Interestingly, the genome differs significantly between individuals. A study comparing genome segments associated with kernel color from two inbred lines revealed that 12 percent of the gene content was not shared – that’s much more diversity within the species than between humans and chimpanzees, which exhibit more than 98 percent sequence similarity. The new project will create multiple reference genomes to reflect this diversity.
“By relying on a single type specimen as the sequence reference for most of the genetic information in maize, we may be missing much of the highly valuable natural variation in maize,” Ware says.
Beyond B73, the most extensively researched maize lines are the core set of 25 inbreds known as the NAM founder lines, which represent a broad cross section of modern maize diversity. SMRT Sequencing and BioNano optical mapping, which were essential in the creation of the groundbreaking 2017 B73 maize reference, will be used in the new $2.8 million National Science Foundation-funded project led by Dawe at the University of Georgia. They will create comprehensive, high-quality assemblies of these 25 inbreds, plus an additional line containing abnormal chromosome 10.
Plant genomes are notoriously difficult to sequence, and maize is particularly challenging because the vast majority of its 2.3 Gb diploid genome — a staggering 85 percent — is made up of highly repetitive transposable elements that other types of sequencing can’t address. Understanding these regulatory and structural elements is crucial to modern breeding efforts that aim to improve productivity across marginal environments and under changing climate.
“The sequenced lines will include varieties from both tropical and temperate regions, and their sequences should help us understand how corn has adapted to these different environments,” said Hufford, a co-principal investigator on the project and assistant professor at Iowa State University. “Understanding the ways corn adapts can facilitate development of lines for novel conditions.”
PacBio Sequencing will be essential as the team assesses the role of structural variation such as presence-absence and copy number variation in the determination of agronomic traits, Ware says.
The assemblies, along with information about the genes and their expression patterns, will be cataloged and made available to the public through her Gramene.org data resource.
“To go from a single reference to a broad perspective on the entire genetic repertoire of genes and gene expression patterns will be a major step forward in how we approach genome analysis in crops,” said Dawe, Distinguished Research Professor in UGA’s Franklin College of Arts and Sciences department of genetics and principal investigator on the grant. “It’s something that has not happened for any crop at this scale.”
Read about Doreen Ware’s original comprehensive maize genome project and about efforts at Corteva Agriscience™, Agriculture Division of DowDupont™ (formerly DuPont Pioneer) to create their own multiple maize reference library.