This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
The SOLVE-RD research program, a collaboration of 21 participant organizations in 10 nations, announced it has received a €15 million grant from the European Union’s Horizon 2020 initiative. SOLVE-RD aims to improve the diagnosis and treatment of rare diseases, which in total affect millions of Europeans. The program is applying novel diagnostic tools to around 19,000 cases unsolved by prior short-read exome sequencing. Prominent among the planned “multi-omics” approach is long-read genome sequencing, which will reveal the large amount of potentially disease-causing genetic variation that is not accessible with short-read DNA sequencing. SOLVE-RD plans to apply long-read genome sequencing to 500 cases.
Recent studies with PacBio long-read genome sequencing have shown that each human genome has upwards of 20,000 structural variants (differences ≥50 bp), which affect more base pairs than single nucleotide variants and small insertions and deletions together . Short-read sequencing fails to detect most of these structural variants, which often lie in repetitive regions of the genome or are larger than short reads can span . In 2017, Merker et al. reported the first use of PacBio long-read genome sequencing to identify a disease-causing structural variant in a Mendelian disease case undiagnosed by short-read genome sequencing . The study applied low (8-fold) coverage sequencing on the Sequel System to discover structural variants. By applying long-read sequencing to a larger cohort of subjects with rare diseases, the SOLVE-RD program promises to provide valuable insights into the disease classes for which this technology is most useful. At ASHG 2017, Han Brunner, a coordinator of the SOLVE-RD consortium, described initial work on this effort using the Sequel System.
 Huddleston J, et al. (2017). Genome Research, 27(5):677-685.
 Merker JD, et al. (2018). Genetics in Medicine, 20(1):159-163.
Maize is amazingly diverse. A study comparing genome segments from two inbred lines, for instance, revealed that half of the sequence and one-third of the gene content was not shared – that’s more diversity within the species than between some other species, for example humans and chimpanzees, which exhibit more than 98 percent sequence similarity.
So how can researchers and commercial breeders rely upon a single reference genome to represent the genetic diversity in their germplasms? More and more scientists are deciding they cannot.
At DuPont Pioneer, where DNA sequencing is paramount for R&D to reveal the genetic basis for traits of interest in a variety of commercial crops, an ambitious project has begun: A pan-genome reference collection based on high-speed, high throughput SMRT Sequencing and assembly.
As described in this case study, the company has developed a way to assemble high-quality reference genomes in just one month, and it has started to create them for several of their own elite breeding lines, as well as select wild strains.
Having multiple genome assemblies of the same high standard for several genotypes will be increasingly important as researchers try to achieve a greater understanding of the impacts of structural variation on plant genomes, says Research Scientist Kevin Fengler, of the Data Science and Informatics group at DuPont Pioneer.
“We want to focus on true structural variation and have confidence in the new discoveries we find in these genomes,” he adds. “Until now, focusing on one reference genome has limited our view. We are just beginning to explore what we have been missing all along.”
From commercialization to crop improvement to answering basic questions about biology, generating and analyzing multiple reference genomes has myriad benefits in a variety of lab settings.
“It has become clear that one single genome is not enough to represent the huge amount of variation in rice genomes,” writes Zhi-Kang Li of the Chinese Academy of Agricultural Sciences in this recent Nature Scientific Data paper about the assembly of an early-mature japonica rice genome.
Rod Wing of the Arizona Genomics Institute agrees. He is aiming to build high-quality reference genomes for 23 additional species of rice using SMRT Sequencing. Beyond providing highly accurate, long-read sequence data, Wing said the PacBio platform is also useful for full-length RNA sequencing and its ability to characterize the methylome. As he notes in this blog post and case study, he can take rice tissues at several developmental stages and under many different environmental conditions, isolate RNA, and do Iso-Seq analysis on those samples to enable whole plant transcriptome analysis, which could help the community map gene networks.
Other researchers are also eager to expand their set of references:
- At CROPS 2017, a three-day event focused on genomic technologies and their use in crop improvement and breeding programs, Jeremy Schmutz of the HudsonAlpha Institute for Biotechnology and the Joint Genome Institute, discussed his use of PacBio SMRT Sequencing to create several cotton genomes as well as Brachypodium, peanut, sorghum, and more. As he notes in this blog post, SMRT Sequencing has been successful even for very challenging plant genomes with highly repetitive elements, GC-rich regions, areas of high and low complexity, and of varying degrees of ploidy.
- Grapes are getting a thorough cataloging, thanks to the efforts of the Cantu Lab at the University of California, Davis. His colleague, Steve Knapp, is also leading efforts to expand the selection of reference genome assemblies for strawberries.
- There is also a pressing need to add to the germplasm of the world’s most valuable plant: Coffee. As noted in this case study, there are about 100 species in the Coffea genus, but the particular strains cultivated to produce coffee — a market valued at $90 billion — have very little genetic diversity. In her quest to address disease and climate pressures that threaten the plant, Cornell University researcher Marcela Yepes has begun to create new references for several varieties, starting with Coffea Arabica and Coffea eugenioides.
We are looking forward to hearing about additional efforts to expand the reference genome library at the ongoing Plant and Animal Genome XXVI Conference. Look out for PacBio @ PAG, and swing by booth 418 to say hi. We will also be hosting a half-day informatics conference on Wednesday, Jan. 17.
Scientists from the University at Buffalo, Nanyang Technological University, and other institutions published results from an effort to elucidate the Utricularia gibba genome using SMRT Sequencing.
U. gibba, also known as the humped bladderwort, is an aquatic carnivorous flowering plant with a remarkably small genome, especially in light of two whole genome duplication events. Genome sequencing and annotation data are reported in the PNAS publication “Long-read sequencing uncovers the adaptive topography of a carnivorous plant genome” from lead author Tianying Lan, senior author Victor Albert, and collaborators. The scientists were interested in using the plant’s genome to learn more about the post-duplication deletion process as well as traits specific to carnivorous plants.
This particular plant’s genome was previously sequenced with short-read technology, producing an 82 Mb assembly “which revealed that its genome gained and deleted gene duplicates significantly faster than those of other genomes,” the scientists note. By applying PacBio long-read technology, the team was able to significantly improve on the original assembly. The de novo genome project resulted in an assembly with a contig N50 of nearly 3.5 Mb. The total size was about 100 Mb, adding more than 18 Mb missed by the short-read assembly. Twenty-four contigs included telomeres, with four of those representing complete chromosomes. “Remarkably, base pair correction using either the PacBio data or Illumina MiSeq reads from our previous assembly led to extremely minor improvements, only 0.071% and 0.01% of total bases, respectively,” Lan et al. write.
The authors present a more complete count of protein-coding genes thanks to the improved assembly. The tally came to 30,689, a nearly 8% increase from estimates based on the short-read assembly. In addition, “unlike the far shorter scaffolds from [the prior] assembly, our largely chromosome-sized contigs permitted us to conservatively distinguish the [whole genome duplication]-derived and tandem duplicate portions of U. gibba’s genome adaptive landscape,” they write. That unique information enabled the team to discover that tandem duplication events were “enriched in metabolic functions potentially important for a carnivorous plant” — including cysteine protease genes expressed only in the plant’s trap — while syntenic duplicates were “enriched for transcription factor functions,” the scientists report. “Such small-scale, tandem duplicates are therefore revealed as essential elements in the bladderwort’s carnivorous adaptation.”
Transposable elements were another area of investigation, with many more TE-derived events found in the PacBio assembly compared to the previous one. “Serving as a good illustration of the repeat discovery power of PacBio sequencing, ∼47% of the total TE assembly space comprised [large retrotransposon derivatives], whereas these elements amounted to only ∼14.6% of TEs in the previous short-read assembly,” the authors write.
For more information, check out this New York Times article covering the project and hear Tanya Renner, paper co-author, speak about carnivorous plants at the PacBio workshop held at the upcoming Plant & Animal Genome Conference on Monday, January 15th at 12:50 PM. Reserve your seat or register for a recording of the presentation.
One of our favorite January traditions is taking part in the Precision Medicine World Conference (PMWC), a three-day Silicon Valley event focused on exploring challenges and opportunities in personalized medicine. Taking place this year January 22-24 at the Computer History Museum, the slate of more than 350 speakers in 65-plus sessions will offer a cutting-edge look at the field. The conference is co-hosted by Stanford Health Care; the University of California, San Francisco; Johns Hopkins University; the University of Michigan; Duke University; and Duke Health.
On the docket this year: PMWC will cover topics including artificial intelligence and machine learning, CRISPR, immunotherapy, microbiome studies, and more. We’re particularly interested in presentations about national sequencing efforts such as the All of Us initiative, clinical sequencing, and infectious disease monitoring. The conference is also well-known for its award program. The prestigious PMWC Luminary Award is going this year to Emmanuelle Charpentier for CRISPR/Cas9 innovations and to Elizabeth Blackburn for discovering telomeres. Meanwhile, the PMWC Pioneer Award will be given to Ronald Levy for pioneering antibodies to treat cancer, Sir John Bell for his role in precision medicine efforts in the UK, and Alan Ashworth for discoveries in breast cancer and more.
Lori Aro, our senior director of clinical genomics, will be speaking at PMWC in the Genomic Profiling Showcase on January 24th at 11:15 am. Her presentation will update attendees on how long-read PacBio sequencing can be a good fit for clinical assay development since it provides the longest average reads, highest consensus accuracy, and most uniform coverage of any sequencing technology available today.
There’s still time to register for PMWC 2018 and they’re offering a 10% discount until January 11. We hope to see you at the meeting!
In an unprecedented crowd-sourced effort stoked by social media, 72 scientists collaborated via 25 conference calls and 3,323 emails to produce a new high-quality Aedes aegypti mosquito genome.
Assembled using PacBio long-read sequencing, the resource could provide the DNA map researchers need to combat the pest and the infectious diseases it spreads, including Zika, dengue, chikungunya, and yellow fever.
Eager to share the results with the scientific community, lead author Leslie B. Vosshall, first author Benjamin Matthews, both of Rockefeller University, and colleagues at several other institutions, published a pre-print of their paper, “Improved Aedes aegypti mosquito reference genome assembly enables biological discovery and vector control” online at bioRxiv.
In it, they describe how they improved upon previous efforts which failed to produce contiguous sequences of the large (~1.3 Gb) and highly repetitive Ae. Aegypti genome. The most recent previous assembly, AaegL4, for instance, produced chromosome-length scaffolds but suffered from short contigs and more than 31,000 gaps.
Using SMRT Sequencing data, the team produced an assembly that is highly contiguous, representing a 93% decrease in the number of contigs. The PacBio contigs were scaffolded end-to-end to the three Ae. aegypti chromosomes using Hi-C technology, resulting in the new AaegL5 reference. They were able to validate local structure, predict structural variants between haplotypes, and generate a dramatically improved gene set annotation.
As co-author Jeffrey Powell, a mosquito researcher at Yale University, told the New York Times at the start of the Aedes Genome Working Group project: “If we’re going to control the creature, we need to know it frontwards and backwards.”
“Having a complete genome sequence of the beast will give us a fundamental understanding of its biology that you can’t get any other way,” he added.
The researchers have already used the new assembly to investigate several scientific questions that could not be addressed with the previous genome, a few of which include:
- The structure of the elusive sex-determining “M” locus. Population suppressing strategies such as Sterile Insect Technique and Incompatible Insect Technique require that only males are released. A strategy that connects a gene for male determination to a gene drive construct has been proposed to effectively bias the population towards males over multiple generations, the authors note.
- More complete accounting of insecticide-detoxifying glutathione-S-transferase genes. Could catalyze the search for new resistance-breaking insecticides.
- The identity of multi-genes families that encode chemosensory receptors. A doubling in the known number of chemosensory receptors provides opportunities to link odorants on human skin to mosquito attraction, a key first step in the development of novel mosquito repellents.
- The evolution of insecticide resistance and vector differences. Mapping new candidates for dengue vector competence could help devise geographically-specific strategies.
“We predict that AaegL5 will catalyze new biological insights and intervention strategies to fight the deadly arboviral vector,” the authors conclude. “The high-quality genome assembly and annotation described here will enable major advances in mosquito biology and has already allowed us to carry out a number of experiments that were previously impossible.”
In a recent publication, scientists from the University of California, Davis, and PacBio reported results from an investigation of alternative splicing associated with a repeat expansion in the gene linked to fragile X syndrome. They used SMRT Sequencing to detect full-length isoforms (Iso-Seq analysis) associated with individuals at risk of FXTAS, an adult-onset neurodegenerative disorder.
“Altered expression of the FMR1 splicing variants landscape in premutation carriers” comes from lead author Elizabeth Tseng, senior author Flora Tassone, and collaborators. Previous studies from the Tassone lab had used SMRT Sequencing to detect full-length isoforms in samples from premutation carriers (individuals with more CGG repeats than a healthy person, but not enough to cause fragile X syndrome) and had identified a number of known isoforms. In this study, the scientists aimed for a more comprehensive analysis of alternative splicing of the FMR1 gene. “Although evidence suggests a strong role for regulation of the FMR1 gene expression in clinical outcomes,” the team reports, “there have been no detailed molecular characterizations on the role of alternative splicing in the development of FMR1 premutation associated disorders.”
To tackle this challenge, they deployed SMRT Sequencing to characterize transcript isoforms of the FMR1 gene in tissue samples from three premutation carriers and three matched controls, plus blood samples from 30 premutation carriers and 15 controls. The tissue samples were collected from cerebellum, testis, muscle, and heart. Iso-Seq analysis yielded up to 28,000 full-length transcript reads from the premutation carrier tissue samples. In total, the authors identified 49 unique FMR1 isoforms, including 16 previously characterized isoforms and a number of novel ones. This study has revealed new splicing patterns and a novel 140-bp exon that were shown to have elevated expression in premutation and FXTAS samples compared to normal controls. Of the 49 FMR1 isoforms, the scientists note, “30 of them were exclusively detected in premutation carriers based on sequencing results.”
This study underscores the power of Iso-Seq analysis in comprehensively characterizing full-length transcript isoforms for a gene of interest. By eliminating the need for sequence assembly, as is required using short-read sequencing methods, Iso-Seq analysis returns each isoform sequence in its entirety in a single read, thereby enabling the discovery of novel exons, intron retention, fusion transcripts and, ultimatel, previously undetected novel isoforms.
“Our findings suggest that an abnormal alternative splicing process is present in individuals with premutation alleles,” the team concludes. “The characterization of the expression levels of the different FMR1 isoforms is fundamental for understanding the regulation of the FMR1 gene as imbalance in their expression could lead to an altered functional diversity with neurotoxic consequences.”
Nematodes are both simple and complex, making them one of the most attractive animal taxa to study basic biological processes, including genome evolution. Studies in the nematode Caenorhabditis elegans, for instance, have provided invaluable insights into almost all aspects of biology, from developmental to neurobiology and human diseases.
However, the high degree of fragmentation of current genome assemblies for many organisms complicates almost all types of genomic analysis. As the authors of a recent Cell Reports paper, Single-Molecule Sequencing Reveals the Chromosome-Scale Genomic Architecture of the Nematode Model Organism Pristionchus pacificus, point out, “general questions of chromosome evolution cannot be addressed if genome assemblies consist of thousands of contigs.”
SMRT Sequencing was able to remedy this problem. By sequencing the genome of P. pacificus with the PacBio Sequel System, Christian Roedelsperger, Ralf J. Sommer, and other colleagues from the Max Planck Institute for Developmental Biology generated an assembly that reduced the number of contigs from 12,395 to 135 and simplified their search for clues into developmental systems drift, the genetics of phenotypic plasticity, and genome evolution.
pacificus has become an increasingly important model species, used in comparison to two other free-living nematode species, C. elegans and C. briggsae, to investigate how various biological pathways and their underlying regulatory programs are modified during evolution.
Populated primarily by self-fertilizing hermaphrodites with a low frequency of males, all three species undergo frequent recombination among different genetic lineages. Their genomes range in size from 100-160 Mb, but all have five autosomes and one sex chromosome. Many of their shared features are controlled by completely different molecular programs, a phenomenon referred to as ‘‘developmental systems drift,” making them particularly useful in comparative biology.
pacificus is also one of the most promising animal models in the investigation of “phenotypic plasticity,” the property of a single genotype to form distinct phenotypes in response to different environmental influences. In P. pacificus, for instance, young nematode larvae either develop directly into adults or into non-feeding, long-lived dauer larvae, which can disperse to find more suitable environments. They also exhibit two different mouth morphs that are specialized for either bacterial or predatory feeding.
For these reasons, the Max Planck team was particularly interested in unravelling some of their genetic mysteries. They sequenced the genome of the P. pacificus reference strain (PS312) on the Sequel System to 100-fold coverage. The resulting de novo assembly enabled ordering and orientation of contigs for all six P. pacificus chromosomes. “This allowed us to robustly characterize chromosomal patterns of gene density, repeat content, nucleotide diversity, linkage disequilibrium, and macrosynteny,” the authors write.
Among their findings was the discovery of a major translocation from autosomes to the sex chromosome during the evolution of the lineage leading to C. elegans.“These findings highlight the impact of large-scale chromosomal rearrangements in nematode genome evolution and emphasize the need for high-quality genome assemblies to robustly study these events,” add the authors. “The new P. pacificus assembly will allow more rigorous genomewide analysis in all fields of genomics, and will greatly enhance the capacity to map and identify causal genes for various phenotypes.”
Following on the heels of the first nearly complete assembly of the hexaploid bread wheat genome, scientists from the University of California, Davis, the USDA Agricultural Research Service, Johns Hopkins University, and many other institutions recently published a high-quality genome assembly for one of wheat’s diploid ancestors. Both efforts incorporated SMRT Sequencing to improve contiguity of the assemblies. The new publication reveals that the ancestral plant’s genome has evolved more quickly than usual, driven largely by repeats.
The paper, “Genome sequence of the progenitor of the wheat D genome Aegilops tauschii,” comes from senior author Jan Dvořák; lead authors Ming-Cheng Luo, Yong Gu, Daniela Puiu, Hao Wang, and Sven Twardziok; and collaborators. “Aegilops tauschii is the diploid progenitor of the D genome of hexaploid wheat,” the scientists note. “The large size and highly repetitive nature of the Ae. tauschii genome has until now precluded the development of a reference-quality genome sequence.”
To tackle this difficult genome, the team used a number of genome analysis technologies, including SMRT Sequencing, BAC sequencing, optical mapping, and more. Scientists from Johns Hopkins contributed 35-fold PacBio coverage of the Ae. tauschii genome, which is a 4.3 Gb in size and organized into seven chromosomes.
With a high-quality assembly in hand, the team turned to exploring unique features of the ancestral wheat genome. “Compared to other sequenced plant genomes … the Ae. tauschii genome contains unprecedented amounts of very similar repeated sequences,” the scientists report. Transposable elements, including the frequent long terminal repeat retrotransposons, accounted for nearly 85% of the sequence.
“Our genome comparisons reveal that the Ae. tauschii genome has a greater number of dispersed duplicated genes than other sequenced genomes and its chromosomes have been structurally evolving an order of magnitude faster than those of other grass genomes,” the team writes. “We propose that the vast amounts of very similar repeated sequences cause frequent errors in recombination and lead to gene duplications and structural chromosome changes that drive fast genome evolution.”
Scientists championed their cases, school children sifted through species, and thousands of members of the public from around the globe took to social media to weigh in. Now the results are in, and high-quality genome assemblies for 25 organisms integral to United Kingdom ecosystems can begin.
As mentioned last month, we teamed up with the Wellcome Trust Sanger Institute on a project to celebrate their twenty-fifth anniversary. Sanger scientists will use the Sequel System and complementary technologies to produce reference-grade assemblies for squirrels, scallops, and sharks, as well as balsam, blackberries, bats, butterflies, bees, and many others.
The final five of the 25 candidates were chosen by a public vote via the I’m a Scientist, Get Me Out of Here campaign. After five weeks of online engagement, more than130 live chats between school children and scientists, and nearly 5,000 votes, the results were announced December 8. They include:
- Common Starfish
- Fen Raft Spider
- Lesser Spotted Catshark
- Asian Hornet
- Eurasian Otter
“The project could reveal why some brown trout migrate to the open ocean, whilst others don’t, or tell us more about the magneto receptors in robins’ eyes that allow them to ‘see’ the magnetic fields of the Earth,” the Institute states in their announcement. “It could also shed light on why Red Squirrels are vulnerable to the squirrel pox virus, yet Grey Squirrels can carry and spread the virus without becoming ill.”
Once completed, the assemblies will be made publicly available for future studies to understand the biodiversity of the UK and aid the conservation and understanding of these species.
Once again, we’d like to congratulate the Sanger team on a remarkable 25 years!
The ability to study the speciation of an animal in real-time is a dream come true for evolutionary and developmental biologists. A group of Japanese researchers has gotten that opportunity, thanks in part to SMRT Sequencing.
Scientists at the University of Tokyo were the first to create a reference genome for an inbred strain of the medaka fish (Oryzias latipes), genome size ~800 Mb, in 2007. The genome assembly was created using Sanger sequencing, but contained low-quality regions and 97,933 sequence gaps. So, the team started from scratch with long-read sequencing to generate genome assemblies with far less missing sequence.
In a paper published in Nature Communications, senior authors Hiroyuki Takeda and Shinich Morishita report new assemblies generated via PacBio long-read sequencing from three geographically isolated medaka strains. These high-quality assemblies allowed them to dive deeper than ever before into the genetics of the fish, and to discover new insights into how previously difficult-to-detect centromeres and large-scale structural variants evolve and contribute to genome diversity during vertebrate speciation.
“Highly accurate long contigs have been useful in enumeration of structural variants (SVs), filling gaps such as centromeres, extending contigs to telomeres, and phasing haplotypes,” the authors write.
The team focused its attention on centromeres, which are difficult to sequence and assemble with short-read and even Sanger platforms. “Once speciation is completed, representative centromeric monomers are highly diversified among 282 species; however, centromere evolution during speciation and its relevance with speciation are unknown,” the authors note.
With this in mind, the team sequenced the genomes of three medaka inbred strains derived from different local subpopulations: HNI from northern Japan, Hd-rR from southern Japan (the strain sequenced for the original reference genome), and HSOK from east Korea.
Originally considered a single species since they can mate and produce healthy offspring under laboratory conditions, the strains have accumulated genetic mutations and phenotypic diversity over a long period of geographical separation. They are now thought to be in the middle of speciation, making them the perfect platform for analyzing this type of evolution, the authors report.
Combining PacBio data with centromere-specific DNA probes and fluorescence in situ hybridization experiments, the team reports obtaining “an unprecedented resource of centromeric repeats of length 20–345 kbp in vertebrates.”
They found that the position of centromeres tended to be preserved unless chromosomal rearrangement took place on a large scale. This happened to the medaka, which remained the same for millions of years, until fissions, fusions, and translocations shaped its genome.
The scientists further discovered that this evolution happened at a different pace among the three strains, depending on the shape and sequence of the centromeres. Centromeric monomers in acrocentric chromosomes evolved more slowly than those in non-acrocentric chromosomes, the team reports. Using AgIn software, the authors estimated methylation states of CpG sites from kinetic SMRT Sequencing information and found divergent methylation patterns, suggesting that centromeres accumulate epigenetic diversity as well as sequence diversity during speciation.
They observed that each local strain has independently experienced thousands of mid-sized (1-50 kbp) insertion events—not enough to cause reproductive isolation, but possibly enough to participate in the regulation of genes and contribute to phenotypic variations.
“These findings reveal the potential of non-acrocentric centromere evolution to contribute to speciation,” conclude the authors. “Further analysis of the mid-sized insertions associated with novel transcripts and increased transcription will provide important clues to the genomic basis for vertebrate speciation.”
Unraveling the role of the microbiome in human health and environmental samples is an emerging priority in scientific study. But despite the best advances in sequencing technology, identifying the bacteria, fungi, and other organisms present in complex samples remains a huge challenge.
Metagenomic shotgun sequencing can read chromosomes, plasmids, and bacteriophages, and comparison to reference genome sequences can be used to place them into putative taxa and species bins, but these methods fail to sufficiently distinguish between genomes that are very similar.
A team of scientists from the Icahn School of Medicine at Mount Sinai, Sema4, and other institutions has come up with a novel solution: a computational method that uses PacBio long-read sequencing of metagenomic DNA to identify methylated motifs and create an epigenetic barcode that enables more precise microbiome analysis.
The process takes advantage of methyl groups which are added to nucleotides in bacteria and archaea in a highly sequence-specific manner, and these motifs often differ among species and strains.
The team took advantage of inter-pulse duration values that represent the time it takes a DNA polymerase to translocate from one nucleotide to the next during SMRT Sequencing. This measure can distinguish between methylated and non-methylated bases. They calculated methylation scores across motifs of several bacterial samples and murine fecal samples and created methylation profiles, which were used alongside sequence composition features to assemble contigs into species- and strain-level bins.
In a paper published in Nature Biotechnology, senior author Gang Fang describes how the method was also able to link mobile genetic elements, including antibiotic resistance-encoding plasmids, to their host species in a real microbiome sample.
Although their sequence coverages and composition profiles often differ, plasmid and chromosomal DNA of the bacterial host are methylated by the same set of methyltransferases, resulting in matching methylation profiles, the authors note.
“The biomedical community has long needed a microbiome analysis method capable of resolving individual species and strains with high resolution,” Fang said in statement.
The method could ultimately prove useful in both research and clinical settings, since it allows for linking mobile genetic elements to their bacterial hosts. This information makes it possible for scientists to more accurately predict virulence and antibiotic resistance of individual bacterial species and strains, among other important traits.
In a new publication, scientists from Anthony Nolan Research Institute and the UCL Cancer Institute present an in-depth analysis of the utility of SMRT Sequencing for Human Leukocyte Antigen (HLA) typing. They assessed more than 100 cell lines and found that PacBio long-read sequencing significantly improves the accuracy of HLA typing.
“Single molecule real-time (SMRT®) DNA sequencing of HLA genes at ultra-high resolution from 126 International HLA and Immunogenetics Workshop cell lines” comes from lead author Thomas Turner, senior author Steven Marsh, and collaborators. The scientists implemented SMRT Sequencing to perform high-resolution HLA typing for 126 B-lymphoblastoid cell lines, including a group of 107 cell lines established in 1987 that is now an essential resource for the community. The goal of the present study was to increase the resolution of the reference sequences in the IMGT/ HLA database and improve standardization of HLA typing calls for these cell lines.
HLA genes — used to evaluate donor-recipient tissue match before organ transplant, as well as other immune-related traits — are among the most polymorphic in the genome. Characterizing them has been a challenge with short-read sequencing platforms, but recent efforts to perform full-length sequencing and phasing of the genes with PacBio long-read sequencing have generated impressive results. Indeed, the authors write, “Anthony Nolan’s Histocompatibility Laboratory now routinely uses SMRT sequencing for HLA typing.”
For this project, scientists carried out amplicon sequencing for full-length gene analysis of HLA class I genes and partial analysis of class II genes. In total, they sequenced 931 HLA alleles, with 96% yielding results that matched previously established HLA types for those cell lines. Of the few dozen discrepancies, Turner et al. discovered that 10 harbored novel alleles and 13 were different because of zygosity results, while many others included allele types not previously reported for those cell lines. Confirmation studies showed that these SMRT Sequencing results accurately resolved ambiguities and corrected errors in earlier HLA typing efforts. “We identified numerous discrepancies and novel intronic polymorphisms, extended several alleles to full genomic sequences, and confirmed the existence of some alleles identified by other researchers,” the team reports.
“The work presented here has further demonstrated the efficacy of SMRT sequencing to provide the highest resolution, unambiguous HLA typing data when full genes are sequenced,” the scientists conclude. “This knowledge ensures the continued usefulness of the reference cell line panel as a resource to the immunogenetics community in the age of next generation DNA sequencing.”
What can one koala tell us about an endemic that threatens the survival of its species? A great deal, it turns out.
While doing a deep dive into the genome of a wild female koala, a team of Australian scientists led by Matthew Hobbs and Andrew King of the Australian Museum Research Institute were able to unravel some of the complexity of the species-specific gammaretrovirus KoRV.
The results, published recently in Nature, paint a picture of a rapidly evolving and diversifying virus, with implications for the long-term survival of the koala, as well as our understanding of retroviral-host species interactions.
The study allowed the researchers to see interspecies transmission, multimerization of sequences in the long terminal repeats, and recombination between different retroviruses, processes which have been reported for other retroviruses but occurred millions of years ago, rather than in very recent times, as for the koala.
KoRV is a retrovirus closely related to gibbon ape leukemia virus, and is thought to be the result of an interspecies transmission. Implicated in the pathogenesis of two major koala diseases, hematopoietic neoplasia and the endemic chlamydiosis, it is considered to be a significant threat to the survival of the species.
Several KoRV subtypes have been proposed. Presumed to be the original transmitted strain, KoRV-A is endogenous and widespread in northern Australian koalas, which are thought to be 100 percent infected. KoRV-B is a more recent, more virulent subtype believed to be the result of recombination. Additional variants — KoRV-C, D, E, F, G, H and I — have also recently been identified, but it has been difficult to examine the population of KoRV and KoRV-like insertions in any koala genome.
PacBio long-read sequencing technology finally made it possible. DNA from the koala’s spleen was sequenced to give an estimated overall coverage of 57.3-fold based on a genome size of 3.5 Gb. The authors used SMRT Sequencing due to its capacity to generate sequences of up to ~70 kb that carry full-length (8.4 kb) KoRV insertions and substantial flanking koala genome sequence. This provided a considerable advantage over short reads, which could not resolve the different KoRV insertion sites or types.
“Obtaining sequence data from elements (such as retroviruses) that are repeated throughout the genome cannot be done with short read sequencing technology, which is why we used long read PacBio sequencing in our study,” the authors add.
The team reported putative somatic integrations of five distinct forms of KoRV (KoRV-A, KoRV-B, KoRV-D, KoRV-E), as well as germline evidence of KoRV-A. They also found an endogenous recombinant element (recKoRV) in which most of the KoRV protein-coding region was replaced with an ancient, endogenous retroelement.
“This diverse pool of viral variants in the same animal highlights the range of strategies being used by this retrovirus as it invades, or comes to equilibrium with its new host,” the authors add.
“As KoRV-A, B and potentially other more pathogenic KoRV types sweep through koala populations, we might expect to see worsening effects of chlamydial disease. This highlights the importance of understanding the complex mix of KoRV types present in an individual animal.”
Read the full report and learn more about the project in this video presentation from a PAG 2017 Workshop by Rebecca Johnson, a co-author on the paper and the Director of the Australian Museum Research Institute. Johnson will also be presenting her work at the Advances in Genome Biology and Technology meeting in February 2018.
In a recent paper, scientists in Germany call for a genomic database of Klebsiella pneumoniae strains to accelerate strain identification as well as drug-resistance status. To that end, they used SMRT Sequencing to generate high-quality assemblies for 16 isolates collected in German hospitals.
“Monitoring microevolution of OXA-48-producing Klebsiella pneumoniae ST147 in a hospital setting by SMRT sequencing” comes from lead authors Andreas Zautner and Boyke Bunk, senior authors Jorg Overmann and Wolfgang Bohne, and collaborators at University Medical Center and other institutes in Germany.
The urgency to characterize K. pneumoniae strains comes from the rapid rise of carbapenem-resistant Klebsiella given that drug resistance, and increasingly multidrug resistance (MDR), is a major public health threat with these infections. “A continuous monitoring of [strain type] distribution and its association with resistance and virulence genes is essential for early detection of successful K. pneumoniae lineages,” the scientists report.
K. pneumoniae strains carry plasmids encoding different types of carbapenemase, which confers resistance to the carbapenem class of antibiotics. OXA-48 is currently the most common carbapenemase found in K. pneumoniae isolates in Germany, according to the authors; similar strains are commonly found in North Africa, the Middle East, and European countries along the Mediterranean. The team chose to focus on OXA-48 strains, selecting 16 isolates collected in 2013 and 2014 for whole genome SMRT Sequencing.
The technology choice was no accident. “A comprehensive K. pneumoniae database of closed genomes is necessary for a complete understanding of the genome plasticity of these organisms and can significantly improve the tracking of MDR isolates,” the scientists write. With SMRT Sequencing, they were able to generate closed genomes. In most cases they used a single SMRT Cell per strain, and “a consensus concordance of QV60 could be confirmed for all genomes,” they report.
Based on the 16 genome assemblies, the scientists determined that half of the isolates shared the same type, ST147, and differed by no more than 25 SNPs throughout the core genome. They identified several plasmids, including a novel linear plasmid prophage of Klebsiella oxytoca. “The comparative whole-genome analysis revealed several rearrangements of mobile genetic elements and losses of chromosomal and plasmidic regions in the ST147 isolates,” they write.
“Single molecule real-time sequencing allowed monitoring of the genetic and epigenetic microevolution of MDR OXA-48-producing K. pneumoniae,” the team concludes, noting that the approach was amenable to spotting individual SNPs, as well as complex rearrangements.
We’re pleased to announce the winner of this year’s ‘Open Your Eyes to Isoform Diversity’ SMRT Grant, which was launched during the American Association for Cancer Research annual meeting. The grant program, co-sponsored by PacBio and GENEWIZ, received many compelling entries, and it was a challenge choosing just one winner.
Congratulations to Andrew Ludlow, a new faculty member at the University of Michigan, who impressed reviewers with his proposal to investigate the splicing of transcripts regulated by the oncogene NOVA1. Ludlow notes that in lung cancer cells, NOVA1 acts as a splicing enhancer to produce full-length hTERT and promote telomerase activity, which is essential for cancer cell survival. He proposes to explore whether NOVA1 also regulates other critical hallmarks of cancer by studying changes in the transcriptome of a lung cancer cell line upon NOVA1 knockdown. Since NOVA1 regulates alternative splicing, the unambiguous resolution of isoforms enabled by the Iso-Seq method will be key to developing a complete picture of how NOVA1 alters the functionality of key players in biochemical pathways involved in cancer development.
Thank you to all of the applicants who participated in the ‘Open Your Eyes to Isoform Diversity’ grant program.
For another chance to win, check out our latest SMRT Grant Program with GENEWIZ, focused on discovering structural variation.
They are colonizers and killers, growing as large as 2,400 acres, leaving devastation in their wake. Armillaria fungi are the cause of root rot disease in forests, fields, parks, and vineyards in more than 500 host plant species across the world. But despite this huge impact on agriculture, the pathogenicity of Armillaria species has been poorly understood.
A new international study led by Hungarian researchers, published in Nature Ecology & Evolution, reveals novel insights into how the fungus spreads and kills. Lead author György Sipos of the University of Sopron, senior author László Nagy of the Hungarian Academy of Sciences, and an international team of collaborators used PacBio long-read sequencing to generate the haploid genomes of four Armillaria species: A. ostoyae, A. cepistipes, A. gallica, and A. solidipes. They also compared them to 22 related saprotrophic, hemibiotrophic, and mycorrhizal fungi.
In the paper, the team reports that Armillaria genomes are “unusually large” for fungi, containing as much as 85 Mb and nearly 26,000 genes, while members of the Physalacriaceae family used for comparison had genomes smaller than 36 Mb with fewer than 14,000 genes. The scientists note that the Armillaria genomes seem to have “drawn upon ancestral genetic toolkits for wood-decay, morphogenesis and complex multicellularity.”
Based on gene mapping and phylogenomic analysis, the authors estimate that the genus diverged from its closest relatives 21 million years ago. While other organisms tended to lose genes in that time, Armillaria species showed a net genome expansion. “The transposable element (TE) content of Armillaria genomes shows a modest expansion relative to other Agaricales and an even distribution along the scaffolds, suggesting that their genome expansion is not driven by transposon proliferation, as observed in other plant pathogens,” the scientists report.
The team also investigated rhizomorphs, root-like materials through which Armillaria spread. “Rhizomorphs are some of the most unique structures of Armillaria spp. that enable them to become the largest terrestrial organism on Earth,” the scientists write. “They express a wide array of genes involved in secondary metabolism, defense, [plant cell wall] degradation and to a lesser extent pathogenesis, which indicate active nutrient uptake and adaptations to a soil-borne lifestyle.”
In healthy forest systems, some fungi are useful for capturing carbon and culling the weakest trees. But Armillaria are not so choosy, and stresses like drought can hasten the destruction of entire forests. The largest expanse of the fungus, as the New York Times points out, occupies an area in Oregon “the size of three Central Parks and may weigh as much as 5,000 African elephants.”
The authors hope their study will contribute to combating future losses. With more genome resources and follow-up studies, they write, this information “could soon bring the development of efficient strategies for containing the spread and damage of Armillaria root-rot disease in various forest stands within reach.”
Three recent publications report results from transcriptome studies of plants often used for medicinal purposes, all powered by SMRT Sequencing and the Iso-Seq method. The papers on ginseng, Huangqi, and tea collectively show the importance of sequencing full-length isoforms for the most accurate and comprehensive gene expression analysis; they also demonstrate the usefulness of characterizing gene models for complex species in the absence of a reference genome assembly.
In “Isoform Sequencing Provides a More Comprehensive View of the Panax ginseng Transcriptome,” scientists from the National Institute of Horticultural and Herbal Science, Chungbuk National University, and other institutions in Korea describe efforts to characterize the functional genomics of P. ginseng. They studied four tissues (flower, leaf, stem, and root) from the ginseng plant, producing more than 135,000 assembled transcripts, averaging 3.2 kb in length. “We successfully identified unique full-length genes involved in triterpenoid saponin synthesis and plant hormonal signaling pathways, including auxin and cytokinin,” the team reports. “Transposable elements (TEs) were also identified, suggesting transcriptional activity of TEs in P. ginseng.” They also took a deep dive into 88 genes to assess alternative splicing. The scientists believe this work will facilitate the discovery of novel genes in this plant and its relatives, and suggest their results will be important for crop breeding programs.
Separately, scientists at the University of Adelaide in Australia and Shanxi University of Traditional Chinese Medicine in China published “Long read reference genome-free reconstruction of a full-length transcriptome from Astragalus membranaceus reveals transcript variants involved in bioactive compound biosynthesis.” A. membranaceus is also known as Huangqi and is a commonly used herb in Chinese medicine for cancers, diabetes, nephritis, and other diseases. The team studied transcriptomes of the plant’s leaf and root, identifying nearly 28,000 full-length, unique transcripts in leaf tissues and more than 22,000 in root tissues. “Compared with previous studies that used short read sequencing, our reconstructed transcripts are longer, and are more likely to be full-length and include numerous transcript variants,” the scientists report. Their analysis highlighted complex patterns of alternative splicing and enabled the detection of long-noncoding RNAs as well as characterization of biosynthesis genes. “Our study provides a practical pipeline to characterise the full-length transcriptome for species without a reference genome and a useful genomic resource for exploring the biosynthesis of active compounds in Astragalus membranaceus,” they conclude.
Finally, “Transcriptome Profiling Using Single-Molecule Direct RNA Sequencing Approach for In-depth Understanding of Genes in Secondary Metabolism Pathways of Camellia sinensis” comes from scientists at Anhui Agricultural University in China. They were investigating the biosynthesis of metabolites such as flavonoids and caffeine in this tea plant, performing Iso-Seq analysis of eight different tissues. The approach led to the identification of 94 full-length transcripts, plus four alternative splicing events, associated with the biosynthesis of the compounds of interest. “The longer reads improved the quality and accuracy of transcripts generated from [a previous] short-read assembly,” the scientists report. The new transcripts “provide a more accurate depiction of gene transcription and will greatly improve C. sinensis genome annotation in the future.”
We’re pleased to be teaming up with the Wellcome Trust Sanger Institute on a project to celebrate their twenty-fifth anniversary: generating high-quality genome assemblies for 25 species that are integral to the ecosystems found in the United Kingdom.
For this work, Sanger scientists will use the Sequel System and complementary technologies to produce reference-grade assemblies. Twenty organisms have already been selected, and the last five will be chosen by a public vote reminiscent of our own SMRT Grant program, which earlier this year saw dingo beat out bombardier beetle, sea slug, temple pitviper, and pink pigeon for the coveted sequencing prize. Have a look at the Sanger finalists!
From dormouse to dragonfly, hogweed to hornet, the 25 species represent the considerable natural biodiversity of the UK. Organisms run the gamut from endangered to iconic, thriving to dangerous.
In a statement announcing this celebration, Sanger’s Associate Director Julia Wilson said, “Through sequencing these 25 genomes, scientists will gain a better understanding of UK species, how they arrived here, their evolution and how different species are adapting to a changing environment. The results could reveal hidden truths in these species, and will enable the scientific community to understand how our world is constantly changing and evolving around us. We want to celebrate the 25th anniversary of the Sanger Institute in a special ‘Sanger’ way, and I am excited to see how the 25 genomes project unfolds.”
Voting for the final five species runs through December 8th. We’d like to congratulate the Sanger team on a remarkable 25 years, and we look forward to seeing the results of this latest project!
Scientists in Singapore, Hong Kong, and Malaysia recently reported the high-quality draft genome assembly of Durio zibethinus, a type of durian fruit commonly eaten in southeast Asia. The team used SMRT Sequencing and chromosome mapping techniques to produce the assembly, which will be an important tool for agricultural monitoring.
“The draft genome of tropical fruit durian (Durio zibethinus)” was published in Nature Genetics from lead authors Bin Tean Teh, Kevin Lim, Chern Han Yong, Cedric Chuan Young Ng, senior author Patrick Tan, and collaborators at the Duke-NUS Medical School and other institutions. They chose to study durian — a tropical fruit best known for its strong sulfuric smell — because it represents an economically important food crop in the region. “In 2016 alone, durian imports into China accounted for about $600 million, as compared to $200 million for oranges, one of China’s other main fruit imports,” the authors note. Until this project, however, there has been little genomic research into the plant or even its close relatives.
The scientists used PacBio sequencing to 153-fold coverage and applied both FALCON and FALCON-unzip to generate a haplotig-merged assembly. The 738 Mb genome was then enhanced using Chicago and Hi-C methods, increasing scaffold N50 lengths to 22.7 Mb. “The final reference assembly comprised chromosome-scale pseudomolecules, with 30 pseudomolecules greater than 10 Mb in length,” the team reports.
A comparison to gene families in other plants led the scientists to conclude that durian is most closely related to cotton and cacao, and probably shares an ancestral whole genome duplication (WGD) event with cotton. That finding means that “other subfamilies within Malvaceae … are also likely to have the cotton-specific WGD” and that the duplication event “may also have been involved in driving the evolution of unique durian traits.”
Speaking of unique durian traits, the team’s gene expression analysis of the plant revealed upregulated sulfur-related pathways as well as “durian-specific gene expansions … associated with production of volatile sulfur compounds.” Both findings offer new insight into the fruit’s odor and even suggest the biological reason for it: “Certain plants whose primary dispersal vectors are primates with more advanced olfactory systems show a shift in odor at ripening,” the authors write. “Durian—by emanating an extremely pungent odor at ripening—appears to have the characteristic of a plant whose main dispersal vectors are odor-enticed primates rather than visually enticed animals.”
The team concludes that the availability of this genome resource will have important agricultural implications. “As an example, rapid commercialization of durian has led to the proliferation of cultivars with a wide discrepancy in prices and little way to verify the authenticity of the fruit products at scale,” they write. “A high-quality genome assembly may aid in identifying cultivar-specific sequences, including SNPs related to important cultivar-specific traits (such as taste, texture, and odor), and allow molecular barcoding of different durian cultivars for rapid quality control.” They also hope the work will fuel studies of other durian varieties to characterize the plant’s natural biodiversity.
Scientists at Huazhong Agricultural University in China and collaborating institutions recently published results of an Iso-Seq analysis of allotetraploid cotton. The team’s findings are expected to be particularly useful for functional genomics, driving advances for cotton breeders as well as research biologists.
“A global survey of alternative splicing in allopolyploid cotton: landscape, complexity and regulation” was published in New Phytologist by lead authors Maojun Wang and Pengcheng Wang, senior author Xianlong Zhang, and collaborators. Existing genome assemblies for polyploidy cotton were not “released with a well-annotated transcript isoform set,” the scientists write, “and so the extent and differences in [alternative splicing (AS)] of homoeologous gene transcripts remain poorly understood.”
To remedy the situation, the team used SMRT Sequencing to analyze RNA in 12 plant samples across six tissues, including root, leaf, and petal, among others. They focused on the allotetraploid Gossypium barbadense because “allotetraploid cottons contribute to the vast majority of fibre yield every year, and their recently published genome sequences of allotetraploid cottons are of interest to both breeders and genome biologists,” they explain. As part of this project, the scientists developed a new Iso-Seq data analysis pipeline geared toward complex genomes. “This includes methods for quality control of raw data, classification of transcripts, clustering and transcriptome analysis,” they note. The tool is available for public use with step-by-step instructions.
Data analysis revealed nearly 177,000 unique, full-length transcripts of almost 45,000 gene models. Aligning to the reference genome and comparing with its annotation allowed the team to extend genes at both 5’ and 3’ ends and expand the number of transcripts linked to each gene. The previous annotation had just 20% of genes associated with more than one transcript, while the SMRT Sequencing annotation boosted that to 57%.
“These data led us to identify 15,102 fibre-specific AS events and estimate that c. 51.4% of homoeologous genes produce divergent isoforms in each subgenome,” the scientists report. Among other key findings: thanks to alternative splicing, the same gene can be regulated differently by various microRNAs.
“This study provides a rich resource of transcript isoforms for the cotton community, evolutionary biologists and provides a useful reference for other species,” the scientists conclude. “These results will facilitate future functional genomics studies and enhance our understanding of AS in polyploid species.”