This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
Pop quiz: Which animal accounts for around 20% of all living mammals, harbors (yet survives) some of the world’s deadliest diseases, lives proportionately longer than humans given its body size, and helps make tequila possible?
From the tiniest bumblebee bat (Craseonycteris thonglongyai) to the large (1kg) golden-capped fruitbat (Acerodon jubatus), the diversity and rare adaptations in bats have both fascinated and terrified people for centuries. Now, an international consortium of bat biologists, computational scientists, conservation organizations, and genome technologists has set out to decode the genomes of all 1,300 species of bats using SMRT Sequencing and other technologies.
The aim of the Bat1K initiative, as set forth by Emma Teeling of the University of Dublin, Sonja Vernes of the Max Planck Institute, and 146 others in this paper in the Annual Review of Animal Biosciences, is to “catalog the unique genetic diversity present in all living bats to better understand the molecular basis of their unique adaptations; uncover their evolutionary history; link genotype with phenotype; and ultimately better understand, promote, and conserve bats.”
The large sequencing project will be accomplished in three phases, starting with 21 representatives of each bat family, followed by 220 representatives for every genus of bat, and then the remaining 1,288 of the species. It will greatly expand upon the 14 bat genome assemblies currently available from the National Center for Biotechnology Information (NCBI) database, which are of varying quality and completeness.
“One primary goal of Bat1K is to standardize assembly strategies to provide assemblies of uniform optimal quality for the bat genomics community through combining multiple sequencing and scaffolding technologies,” the authors write. “We believe it is important not just to generate genome-level data, but to produce high-quality genome sequences that maximize the usefulness and accessibility of the data for all research fields.”
The bat clade exhibits a wide range of chromosomal variation. High-quality, chromosome-level genome assemblies across the group will allow researchers to investigate things like evolutionary trajectories of autosomal and sex chromosomes from nucleotide, syntenic, and phylogenomic perspectives.
The team is also hoping to resolve “some of the most passionate debates in science” centered around the evolutionary history of bats, which has been difficult to piece together due to an impoverished fossil record.
The information they uncover could benefit not only the research community, but the world at large. The authors argue that studying bats will enable us to address some of the most important challenges facing humanity into the next century including improving the well-being of a large and rapidly aging human population, preventing the spread of emergent infectious diseases, maintaining agricultural productivity, and restoring natural ecosystems worldwide.
Bats are suspected reservoirs for some of the deadliest viral diseases, including Ebola, SARS (severe acute respiratory syndrome), rabies, and MERS (Middle East respiratory syndrome coronavirus). But they appear to be asymptomatic and survive these infections. Figuring out why could increase our understanding of immune function and help prevent viral spillovers into humans.
Bats also exhibit extraordinary longevity—they can live up to 10 times longer than expected given their small body size and high metabolic rate. Only 19 mammal species are known to live proportionately longer than humans given their body size, and 18 of these are bats.
“Bats show few signs of senescence and low to negligible rates of cancer, suggesting they have also evolved unique mechanisms to extend their health spans, rendering them excellent models to study extended mammalian longevity and ageing,” the team writes.
By identifying bats’ cellular repair mechanisms, researchers could also gain insight into inflammatory disorders associated with autoimmune diseases, which are among the fastest growing causes of disease worldwide.
“The ability to modulate inappropriate inflammation in response to stressors without impairing immune function could improve the lives of millions,” the authors write.
Studying the genetics of echolocation, vocal learning, and sensory perception in bats could shed light into human blindness, deafness, and speech disorders, they add. And characterizing bat wing development could improve our understanding of how changes in limb developmental building blocks can lead to human limb malformations.
In regard to the ecosystem, bats perform key services. They pollinate crop species in the tropics (including agave, making possible the distillation of tequila) and disperse seeds across long distances, maintaining plant genetic diversity and aiding the regeneration of forests after clearing. They are able to breach ocean barriers, making them indispensable to isolated island ecosystems. They also feed on crop pests throughout their range; without bats, it is estimated that the United States would spend more than $3 billion a year on pesticides alone, the authors report.
“Bat1K will develop a genomic ark that can be used to benchmark the genomic health of different bat species to uncover populations in need of immediate conservation efforts,” the authors write. “Prioritization of bat genomes is not just desirable but indispensable to confront the many challenges to human well-being, ecosystem function, and biodiversity conservation we now face.”
Catch one of the Bat1K project leaders, Sonja Vernes, as a keynote speaker at the 2018 SMRT Leiden Conference, to be held in the Netherlands June 12-14. The meeting includes two back-to-back events: SMRT Scientific Symposium and the SMRT Informatics Developers Meeting. View the preliminary agenda and register
We’re told to avoid sugar and refined carbohydrates if we want our teeth to remain strong and cavity-free. But what is the role of microbiota in our oral health?
Cavities – or caries – actually occur as the result of bacterial infection that leads to sustained decalcification of tooth enamel and the layer beneath it, the dentin. Left unchecked, it can reach the tooth’s inner layer, with its soft pulp and sensitive nerve fibers, and, in some cases, can cause serious complications such as phylogenetic osteomyelitis and the life-threatening bacterial endocarditis.
In addition to diet and host factors, the occurrence and development of dental caries seems to be closely related to the imbalance of the oral microbiota. With this in mind, researchers at Zhejiang University in Hangzhou, China, wanted to create a profile of oral microbiota in early childhood caries, and they turned to PacBio SMRT Sequencing to do so.
As detailed in a paper published in the Frontiers in Microbiology, lead author Hui Chen, first author Yuan Wang, and colleagues derived 876 species from 13 known bacterial phyla and 110 genera from saliva samples collected from 41 Chinese preschoolers, aged 3–5 years old (21 with severe early childhood caries, and 20 who were caries-free).
A shift in the oral core microbiota was observed in the two groups, allowing the researchers to identify both protective and destructive bacteria.
“Our findings indicate that dental caries have a microbial component, which might have potential therapeutic implications,” the authors write.
At the species level, 38 species, including Streptococcus spp., Prevotella spp., and Lactobacillus spp., showed higher abundance in the caries group compared to the caries-free group. This suggests these bacteria may be risk factors for dental caries in children, the authors state.
The researchers also collected samples from the same children six months later. New cavities were developing in 5 children who were initially caries-free. Analyzing their microbiota, the researchers found that 6 species of bacteria that were abundant in the caries-free children, including Abiotrophia spp. and Neisseria spp., were much less abundant in these cases. Those bacteria were also less abundant in the initial caries group, leading the researchers to associate the strains with a healthy oral microbial ecosystem.
The authors say they chose single-molecule real-time sequencing because of its richness and resolution. Previous studies have explored the relationship between microorganisms and the development of caries; however, most of the cariogenic bacteria were only identified at the genus level, they noted.
“Species-level and even strain-level resolution is thought to be important for caries prognosis,” the authors state. “PacBio outperformed other sequencers… in terms of the length of reads, and it reconstructed the greatest portion of the 16S rRNA genome when sequencing the oral microbiota.”
At the HudsonAlpha Institute for Biotechnology, scientists are building on advances in agricultural research to power a clinical pediatric program. For this work, they’re using the Sequel System to perform whole-genome sequencing on trios of children with developmental disabilities and their parents.
HudsonAlpha researchers have been using SMRT Sequencing to resolve challenging plant genomes, deploying a Sequel System and a PacBio RS II for these complex projects. The successfulness of that program led the institute to add a second Sequel System for clinical use.
The organization is part of the NIH-funded Clinical Sequencing Exploratory Research Program, with faculty investigator Greg Cooper leading an effort to apply whole genome sequencing to better understand the genetic basis of intellectual and developmental disabilities in children and to provide diagnostic information to affected families. More than 500 children and their parents have been enrolled in the study.
In a statement announcing this work, Cooper said, “By applying whole genome PacBio Sequencing in this study we hope to more sensitively identify all sizes of genetic variants, thereby increasing our solve rate for previously undiagnosed children. In many cases, an accurate clinical diagnosis can improve our ability to manage the child’s condition. We also anticipate that we will make novel discoveries through this work that may benefit many families beyond those directly tested here.”
The group’s efforts to diagnose children using short-read sequencing technology have achieved a success rate of about 30 percent, but it is widely known that these platforms are unable to detect certain types of variation that contribute to disease. Structural variants such as repeat expansions and copy number variations are larger and more complex than short-read sequencers can resolve, and likely represent some of the cases that have gone undiagnosed. With PacBio long-read sequencing, scientists may be able to produce diagnostic answers for cases that have proven intractable with other technologies.
“We believe projects like HudsonAlpha’s CSER program to help solve undiagnosed genetic disease in children are among the most important and rewarding uses for our technology,” stated Kevin Corcoran, our Senior Vice President for Market Development. “We look forward to seeing how PacBio sequencing can both improve their diagnostic success rate as well as support new discoveries.”
Revered around the world, rice is a staple food for nearly half of the population. But as that population grows, rice breeders are faced with the challenge of producing crops that are high yielding, disease-resistant and nutritious, while at the same time being more sustainable.
The International Oryza Map Alignment Project (OMAP) was initiated in 2003 to develop a set of high-quality genomic resources for the wild relatives of rice that could be used as a resource to discover and utilize novel genes, traits and/or genomic regions for crop improvement and basic research.
Members of the consortium recently released new reference assemblies for six wild Oryza species (O. nivara, O. rufipogon, O. barthii, O. glumaepatula, O. meridionalis and O. punctata), two domesticates (O. sativa vg. indica (IR 8) and O. sativa vg. aus (N 22)) and the closely related outgroup species L. perrieri.
In a paper published in Nature Genetics, “Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza,” senior author Rod A. Wing, of the Arizona Genomics Institute at the University of Arizona, first author Joshua C. Stein, of Cold Spring Harbor, and colleagues from 17 other institutions, describe what they found when analyzing the new assemblies and comparing them to four previously published genomes.
Among the major findings were the identification of several disease resistance genes and haplotypes, which could support the breeding of new varieties for natural resistance to growing pathogen threats such as blast (Magnaporthe oryzae).
“Our sequencing of seven wild relatives of crop species opens a treasure trove of novel resistance haplotypes and loci to sustain this strategy,” the authors write.
“The practical utility of our resources is directly demonstrated by our identification of a strong candidate for the long-sought Pi-ta2 locus, which in combination with Pi-ta provides broad-specificity resistance to M. oryzae,” they add.
The study is also the first to contain a complete long-read assembly of IR 8 ‘Miracle Rice’, which relieved famine and drove the Green Revolution in Asia 50 years ago.
And it should prove valuable for the study of molecular evolution. As the authors note, the new dataset represents “a genome-wide vista of the results of ~15 million years of both natural and artificial selection on a single genus.”
Over this time period, the Oryzeae have maintained a base chromosome number of 12, despite their global distribution and bursts of transposable element diversification that, in some cases, led to doubling of genome sizes, the study found.
The reference genomes span the species tree, and were used to resolve several areas of the Oryza phylogeny. Their genome-based age estimates imply a “remarkably rapid diversification rate” (~0.50 net new species/million years), placing it on par with many rapidly diversifying taxa in island and continental hotspots, the authors state.
“Our phylogenomic work illustrates both the challenges of inferring species phylogenies in closely related plant taxa—incomplete lineage sorting, hybridization and introgression—and the power of whole-genome sequences to untangle the resulting phylogenetic discordance,” they write.
The amount and richness of data provided by long-read sequencing led to “a much more nuanced view… that reflects the mosaic history of different parts of the genome,” the author add.
A publication from the Molecular Plant journal demonstrates the use of SMRT Sequencing to characterize activity of transposable elements in Magnaporthe oryzae, the destructive fungus responsible for rice blast disease. This information will help scientists better understand pathogen biology and potentially find new ways to reduce its impact on an important food source.
Lead authors Jiandong Bao, Meilian Chen, Zhenhui Zhong, Wei Tang, senior author Zonghua Wang, and collaborators at Fujian Agriculture and Forestry University and Minjiang University report their findings in “PacBio Sequencing Reveals Transposable Element as a Key Contributor to Genomic Plasticity and Virulence Variation in Magnaporthe oryzae.”
They embarked on the study because “the sustainable cultivation of rice, which serves as staple food crop for more than half of the world’s population, is under serious threat due the huge yield losses inflicted by the rice blast disease,” they write. Until this project, however, some 50 previous short-read genome assemblies were not of sufficient quality to support the kinds of in-depth investigations required to understand the pathogen’s genetic mechanisms or variation across species. These assemblies “are highly fragmented and lack most of the lineage-specific (LS) regions which are more plastic than the core genome and enriched with repeats and effector proteins,” the scientists explain.
To build a better assembly, the team applied PacBio long-read sequencing to the challenge. They produced high-quality, nearly complete genome representations for two M. oryzae isolates. The resulting assemblies were far more contiguous than previous ones, with contig N50s increased to 3.28 Mb and 4.13 Mb, compared to 180 kb and 156 kb respectively for short-read assemblies. That led to a “>95% reduction in genome fragmentation,” the scientists report, and “approximately 98% of the PacBio assembled contigs were longer than 100 kb.” Alignment to the reference genome filled about 70% of sequence gaps and “confirmed that PacBio assemblies have sufficient genome coverage and superior integrity,” the team adds.
Importantly, the PacBio assemblies were about 10% larger than the short-read assemblies. Analysis of this “showed that the increased size of PacBio assembled genome was not accompanied by a corresponding increase in the number of new genes, but was as a result of significant increase in the recovery of repeat sequence,” the scientists write. That new content included many transposable elements, with some entirely novel elements detected. The scientists also analyzed the effects of transposable elements and determined that they “play a key role in regulating genomic plasticity, promote chromosome rearrangement and presence/absence polymorphism of [secreted protein] genes,” the team writes.
This study offers strong validation of the importance of transposable elements in pathogen virulence and demonstrates the utility of SMRT Sequencing for achieving high-quality assemblies to fully represent these elements.
In an exciting new Cell paper, scientists report identification of an intronic structural variant that causes a neurodegenerative Mendelian disorder that primarily affects people on the island of Panay in the Philippines. The team used a number of approaches, including SMRT Sequencing and the Iso-Seq method, to solve the medical mystery.
“Dissecting the Causal Mechanism of X-Linked Dystonia-Parkinsonism by Integrating Genome and Transcriptome Assembly” comes from lead authors Tatsiana Aneichyk, William Hendriks, Rachita Yadav, David Shin, and Dadi Gao; senior authors Cristopher Bragg and Michael Talkowski; and many collaborators at Massachusetts General Hospital, the Broad Institute, and other organizations.
The team targeted X-linked dystonia-parkinsonism (XDP), “an adult-onset neurodegenerative disease that has challenged conventional gene discovery for several decades.” Endemic to the island of Panay, the progressive disease was previously associated with several genetic variants, but none were deemed definitively causative. Scientists attribute that in part to a lack of solid annotation for this genomic region.
“We investigated XDP as an exemplar of an unsolved Mendelian disorder arising from a founder haplotype in an isolate population,” the team writes. “We hypothesized that the genetic diversity of XDP has not been captured by previous approaches and that unbiased assembly of the genome and transcriptome spanning the XDP haplotype could reveal additional sequences or aberrant transcripts unique to probands.” While most Mendelian analyses to date have used exomes or whole genome sequencing on short read platforms, the disease causing variation in the case of XDP is difficult to detect by these methods. To that end, they applied a bevy of sequencing and analysis tools — including SMRT Sequencing, hybridization capture sequencing and scaffolding technologies — to study a large cohort of about 800 individuals, most of them affected males or carriers.
“Our results identified previously unknown genomic variants and assembled transcripts that were shared among XDP probands, but not observed in controls, including aberrant splicing and partial retention of intronic sequence proximal to the disease-specific SVA [SINE-VNTR-Alu retrotransposon] insertion in TAF1,” the scientists report. SMRT Sequencing of BAC clones from a proband generated a 200 kb region spanning TAF1, assembling the full SVA sequence.
TAF1 is a general transcription factor encoded on the human X chromosome and is expressed is all tissue types, but in the case of XDP, a portion of the mRNA transcript is spliced in a non-functional manner within the intron containing the SVA. The team followed up on this observation with CRISPR/Cas9 editing to remove the SVA sequence in cell models derived from patient samples. Removal of the SVA by gene editing “rescued this XDP-specific transcriptional signature and normalized TAF1 expression,” proving that this mobile element really is the causal agent, the authors write.
“These data suggest that XDP may join a growing list of human diseases involving defective RNA splicing, [intron retention], and transcriptional alterations driven by transposable elements,” the team concludes. “These studies also illustrate the potential for layered genomic analyses to provide a roadmap for unsolved Mendelian disorders that is capable of simultaneously capturing coding and noncoding regulatory variation and interpreting their functional consequences in human disease.”
Genomic data standards will be essential for continuing the growth of genomics and ensuring its smooth transition into the clinic, according to a new Bio-IT World article written by PacBio scientist Aaron Wenger. The piece nicely sums up recent efforts from the Genome in a Bottle Consortium, the Genome Reference Consortium, and the Global Alliance for Genomics and Health to paint a picture of the state of genomic data standard development today.
“The more we learn about the human genome, the more needs we identify for data standards,” Wenger reports. “For example, early efforts focused on ensuring that single nucleotide variant (SNV) calls could be tested for accuracy; today we know that structural variants, which are responsible for the vast majority of base pair differences between any two people, are just as critical to call with precision.”
While the consortia and other collaborative programs working to fine-tune data standards are doing excellent work, Wenger notes that it is important that the initiatives avoid multiple, possibly competing, standards. “It will be essential for these large consortia to collaborate with each other to ensure that appropriate data standards are available for various needs (such as supporting both SNVs and structural variants) and that the specific use for each is clearly defined,” he writes.
He concludes with a call to all genomic scientists to get involved in these types of efforts. Data standards “will help the genomics community cross an important threshold, from the realm of pioneering tinkerers to a robust, reliable, and highly accurate science that can be readily applied both in research and in the clinic,” Wenger adds.
Genetic knowledge is powerful when it comes to breeding. The ability to trace desirable traits to the gene level can help create plants and animals that are adapted to existing and emerging challenges, such as temperature tolerance, productivity, or disease resistance.
By crossing two breeds of cattle, Angus (Bos taurus taurus) and Brahman (Bos taurus indicus), from opposite ends of the species spectrum, breeders can benefit from the Angus’s high productivity in cool environments and the Brahman’s tolerance for harsh, hot climates and the diseases and parasites found there.
Genetically and phenotypically, the two subspecies are very different. And, their offspring are as well, as John Williams of the University of Adelaide explained in his recent talk at PAG. There are even differences depending on how the breeds are crossed (i.e. Angus bull and Brahman cow, or Brahman bull and Angus cow). Fetal weight at mid-gestation, for instance, varies markedly among purebreds and crosses, and between crosses.
Interested in exploring these differences, Williams and colleagues embarked on two approaches to assembling this heterozygous genome.
The first was a one-technology methodology involving PacBio long-read sequencing, assembled with FALCON-Unzip and scaffolded with Hi-C data, to examine the genome of an F1 cross-breed (Angus x Brahman). The Iso-Seq method was then used to explore the cattle’s transcriptome. It enabled the team to examine entire transcripts, as well as isolate 30,000 isoforms from 12,000 genes.
Although they are still sifting through the data, Williams said the team is “starting to be able to differentiate between Angus and Brahman specific transcripts.”
“Initial results show that Iso-Seq data can be haplotyped and is highly concordant with genome phasing results, revealing possible allelic-specific isoform expression,” he added.
Mapped back to the assembly, the Iso-Seq data also confirmed that the F1 cattle reference genome is of good quality.
Among the genes they explored, 10 were heavily differentially expressed between male and female. The team wanted to drill down deeper, to determine which parent of origin the differences come from, and to create better assemblies of sex specific genes.
So, a sub group led by Adam Phillippy, Sergey Koren, and Arang Rhie of the National Human Genome Research Institute (NHGRI) in Bethesda, Maryland, created a new process that took advantage of access to the cattle’s parents.
Trio Binning: Two Genomes From One Individual
The “trio binning” process, also presented at PAG, enabled them to generate two high-quality (maternal and paternal) genomes from the single F1 cross-breed. It uses short reads from two parental genomes to partition SMRT Sequencing long reads from an offspring into haplotype-specific sets prior to assembly. Each haplotype is then assembled independently using a new module of the Canu assembler the NHGRI team created — TrioCanu — resulting in a complete diploid reconstruction.
As described in this preprint, the method requires moderate coverage of short sequencing reads (e.g. 30-fold Illumina) from two parental genomes to identify short, k-length subsequences (k-mers) that are specific to each parent. These k-mers are presumed to be specific to the corresponding haplotypes of the offspring. Next, long reads are collected from an offspring of the parents to sufficiently cover both haplotypes (e.g. 80-fold PacBio, 40-fold per haplotype). Long reads from the offspring are then binned into paternal and maternal groups based on the presence of the haplotype-specific k-mers and assembled separately.
In the case of the cattle, the Angus and Brahman haplotypes aligned to one another with 99.35% identity and contained 25,245 haplotype-specific structural variants and 124 inversion breakpoints.
Phillippy et al. note that trios have long been used in genomics to infer inheritance, including for the HapMap and the 1000 Genomes projects, as well as by trio-sga to simplify heterozygous diploid genome assembly. But reliance on short-read sequencing limited the haplotype-specific contigs (haplotigs) to an average size of a few kilobases.
“In contrast, our long-read method enables the assembly of multi-megabase haplotigs and complete parental haplotypes,” the authors write.
Long-read trio binning is also advantageous because it requires fewer resources than inbreeding, simplifies assembly graphs, and can accurately reconstruct structurally heterozygous alleles that can be important factors in adaptation and immunity, Phillipy states.
“Accurate representation of haplotypes is essential for studies of intraspecific variation, chromosome evolution, and allele-specific expression,” the authors add.
Its applications could also spread into human and other areas of agricultural genomics, including polyploid plant genomes.
“Reference genome projects have historically selected inbred individuals to minimize heterozygosity and simplify assembly,” the authors write. “We challenge this dogma and present a new approach designed specifically for heterozygous genomes.”
A new preprint from scientists at Uppsala University’s SciLifeLab reports the de novo genome sequencing and assembly of two Swedish individuals using PacBio SMRT Sequencing. By comparing the Swedish genomes to the human reference (GRCh38), the team found a substantial amount of novel sequence which is not present in the reference – along with over 17,000 structural variants. Further comparison of the Swedish genomes to other population-specific reference genome assemblies – including a Korean and a Chinese genome – identified novel sequences that appear to be population-specific as well as several megabases that seem to be more universal in the human genome.
“De novo assembly of two Swedish genomes reveals missing segments from the human GRCh38 reference and improves variant calling of population-scale sequencing data” comes from lead author Adam Ameur, senior author Ulf Gyllensten, and collaborators. They aimed to produce higher-quality genome assemblies than are possible with short-read sequencing technologies. “PacBio’s single-molecule real-time (SMRT) sequencing technology has proven to be an excellent method for de novo genome assembly,” they write. “The human de novo assemblies available based on long-read data … indicate that each personal genome contains a significant amount of ‘dark matter’ of structural variation that is not detected by short-read WGS.”
The scientists generated about 78-fold SMRT Sequencing coverage of each genome (one male and one female), followed by optical mapping for scaffolding. The PacBio-only assemblies were highly contiguous: authors report that each one “contained about 3,000 primary contigs and an additional 4,000 alternative contigs originating from regions with high heterozygosity.” For primary contigs, N50 values were 9.5 Mb and 8.5 Mb.
They then compared the assemblies to each other, to GRCh38, and to other de novo assemblies recently produced with SMRT Sequencing. More sequence matched between the two Swedish genomes than with GRCh38, “suggesting that the [human] reference does not contain all sequences present in these Swedish individuals,” the scientists report, citing about 10 Mb absent from the reference. Of the novel sequences discovered, about 6 Mb aligned to a Chinese genome assembly, indicating that much of the data missing from GRCh38 is not specific to the Swedish population. Novel sequences had the typical hallmarks that make detection difficult for short-read sequencers: they “are highly repetitive, have elevated GC-content and are primarily located in centromeric or telomeric regions,” according to the preprint.
“Inclusion of these novel sequences into the GRCh38 reference radically improves the alignment and variant calling of whole-genome sequencing data at several genomic loci,” the scientists add. By re-analyzing 200 samples from a short-read-based Swedish population study, the team found more than 75,000 putative novel SNVs in each person and removed 10,000 SNV calls per person that had been false positives.
“The benefits of an improved reference are likely to be even stronger for other, non-European, population groups that were poorly represented in the original assembly of GRCh38. Despite all efforts to refine the human genome since its original release in 2001, our results indicate that substantial improvements could still be made … by de novo assembly of representative human genomes from different populations,” the scientists conclude.
A new publication in the journal of the American Society of Gene & Cell Therapy demonstrates the novel use of SMRT Sequencing for improving the safety and quality of a type of gene therapy delivered through a viral vector. The work established new quality control standards and revealed previously undetected risks for gene therapy delivery.
“Adeno-Associated Virus Genome Population Sequencing Achieves Full Vector Genome Resolution and Reveals Human-Vector Chimeras,” published in Molecular Therapy — Methods & Clinical Development, comes from lead author Phillip Tai, senior author Guangping Gao, and collaborators from the University of Massachusetts Medical School and other institutions. The team aimed to characterize the adeno-associated virus (AAV) vectors that have gained traction recently as delivery vehicles for gene therapy. Previously, there had been no reliable, comprehensive method for assessing the integrity of those AAV vectors prior to introducing them to patients.
The team used PacBio long-read sequencing to analyze a population of AAV vectors, providing the first high-resolution view of the DNA sequences they would deliver to a patient. They found more problems than expected: some vectors contained less than half of the DNA sequence they should have, while genetic errors called chimeras were discovered in other vectors. In a patient, these vectors would have reduced the likelihood of successful gene therapy outcomes.
With this approach, known as AAV-genome population sequencing or AAV-GPseq, “we can comprehensively profile packaged genomes as a single intact molecule and directly assess vector integrity without extensive preparation,” the scientists report. This detailed view allowed the team to spot reverse-packaged genomes, chimeras with inverted terminal repeat-containing sequences, and truncated genomes. “These discoveries redefine quality control standards for viral vector preparations and highlight the degree of foreign products in [recombinant AAV]-based therapeutic vectors,” the team writes.
The full AAV-GPseq method is available and “can be easily adapted for research-grade and clinical vector manufacturing QC pipelines,” the scientists add.
A new preprint from scientists at the University of California, Davis, demonstrates the usefulness of SMRT Sequencing and the Iso-Seq method to enable comprehensive transcriptome analysis even in the absence of a reference genome assembly. The team conducted this study on the grape used to make Cabernet Sauvignon.
Lead author Andrea Minio, senior author Dario Cantu, and collaborators used SMRT Sequencing to generate full-length transcripts of Vitis vinifera cv. Cabernet Sauvignon, analyzing RNA collected from four replicates at each of four important stages of ripening. Previously, this team used PacBio sequencing to generate a highly contiguous genome assembly for the wine grape. In this new effort, they focused on characterizing the networks associated with metabolism and berry development.
Cantu and his team are used to studying plants that lack reference-grade genome assemblies. They adopted the Iso-Seq method for this new project as a means of evaluating it for use on other targets without references in the future. “Iso-Seq is an ideal technology for reconstructing a transcriptome without a reference sequence and for resolving isoforms,” they report in the preprint, noting that full-length transcripts capture splice variants and noncoding RNAs. Their wine grape analysis yielded more than 170,000 transcripts associated with 13,402 genes. “Full-length transcripts refined approximately one third of the gene models predicted using several ab initio and evidence-based methods,” the scientists write. “The Iso-Seq information also helped identify 563 additional genes, 4,803 new alternative transcripts, and the 5’ and 3’ UTRs in the majority of predicted genes.”
The results indicated significant changes in gene expression depending on developmental stage. Just a quarter of loci were represented at all four stages tested; a third of loci were detected only at specific stages, “confirming the importance of collecting different stages of development to capture the complexity of the berry transcriptome,” the authors note.
“This study demonstrates that Iso-Seq data can be used to compile a comprehensive reference transcriptome that represents most genes expressed in a tissue undergoing extensive transcriptional reprogramming,” the scientists conclude. “The pipeline described here can be of even greater value for projects aiming to reconstruct the gene space in plant species with complex and large genomes that have not been resolved yet.”
Corals are critical to sustaining sea life in many parts of the world, contributing to an elaborate ecosystem that lives in and around their mineralized calcium carbonate skeletons. In addition to hosting photosynthetic endosymbionts in exchange for energy, corals harbor a diverse microbial community. What role does this microbial metagenome play in the health of the coral reef, especially during thermal challenges induced by climate change?
Alexander Shumaker of Rutgers University will get a chance to investigate this question, thanks to long-read sequencing provided by PacBio and Certified Service Provider, the University of Maryland’s Genomics Resource Center (GRC) at the Institute for Genome Sciences (IGS).
As the winner of the 2017 Microbial SMRT Grant, Shumaker’s proposal was selected from many entries in an extremely close competition launched during the American Society for Microbiology annual meeting last summer.
Shumaker, a member of the Debashish Bhattacharya Lab, and collaborators in Hawaii and South Korea have previously used PacBio SMRT Sequencing to generate a high-quality genome assembly from Montipora capitata (rice coral), identifying the major genes involved in the coral stress response, their expression during development, and the evolutionary history of this lineage.
The lab is currently using RNA-seq methods to investigate further the extent of resilience and adaptation in M. capitata under the major stress factors associated with climate change. Shumaker now wants to turn his attention to the microbiome.
“The potential for interactions between the host animal and its microbial community, and between the diverse members of this community to contribute to holobiont health is poorly understood,” he writes in his proposal.
Shumaker will be working with samples from neighboring sites in a reef in the Hawaiian Archipelago collected over the course of six months, during which there was a natural temperature-induced coral bleaching event. In coral bleaching, high temperatures lead to expulsion of the symbiotic algae that live within them. The corals can recover if the algae take up residence again before they succumb to disease or stress. Using shotgun metagenome sequencing, he hopes to characterize the symbiotic microbial community of M. capitata and understand whether differences in the coral microbiome may have been a factor in the recovery of colonies that survived.
“This platform will provide greater taxonomic resolution (due to longer reads) of the M. capitata microbiome and allow us to reconstruct genomes and assign functions to the microbial community resident in this species.” he writes.
Congratulations to the Bhattacharya Lab, and we look forward to the results!
Check out additional coverage of the award-winning project on the Rutgers University site.
Today we’re pleased to announce the release of a new version of Sequel Software (V5.1) and a new polymerase. Combined, these upgrades increase throughput and overall performance for key SMRT Sequencing applications such as de novo assembly, structural variant detection, targeted sequencing, and RNA sequencing using the Iso-Seq method. Orders for the new products can be submitted today.
With this release, the Sequel System can achieve up to 10 Gb per SMRT Cell for de novo genome assembly, effectively doubling the throughput when using ultra-long inserts (>40 kb). For targeted and RNA sequencing, customers can achieve up to 20 Gb per SMRT Cell.
For human whole genome sequencing (WGS) studies, the new improvements support sensitive detection of structural variants with as little as 5- to 10-fold coverage per individual. As a result, customers can now complete low-cost WGS studies in thousands of individuals using fewer SMRT Cells. Sequel System users can obtain 10-fold coverage using as few as 4 SMRT Cells and population genetics projects can be conducted with 5-fold coverage using a few as 2 SMRT Cells. Check out our structural variation project calculator to inform your study design.
For long amplicons (>3 kb), the new polymerase increases the number of high-quality sequences per SMRT Cell, reducing costs for HLA sequencing and other targeted applications. Also, the analytical workflow for multiplexed samples has been simplified through the software update.
Read length data shown above from a E. coli 35 kb size-selected library using the Express kit, on a Sequel System with 2.1 Chemistry and 5.1 Sequel System Software.
Dr. Nezih Cereb, CEO and co-founder of HistoGenetics, was one of several early access users for this release. “By introducing SMRT Sequencing for routine HLA testing, HistoGenetics set the new gold standard. To date, we have run 475,000 samples on PacBio systems in our high-volume HLA typing lab. With these new Sequel System improvements, we are achieving increased throughput as well as saving time.”
Our SVP of Market Development, Kevin Corcoran, said: “We remain focused on enhancing key applications and, with increased throughput, empowering users to perform large-scale studies at a price per sample that meets their research or commercial objectives.”
From wild animals to perfect pets, dogs have undergone some interesting changes during their centuries-long domestication. Intent on unraveling some of the developmental secrets of the process, a team of scientists from the University of New South Wales in Sydney, Australia, is doing deep dives into the genomes of a range of canine cousins along the evolutionary chain.
A desert dingo named Sandy has already provided some insight into the process after its genome was sequenced as part of the 2017 Plant and Animal SMRT Grant.
Study leader Bill Ballard described in this presentation at PAG 2018 that pure dingoes are intermediates between wild wolves and domestic dogs, with a range of domestication traits. This includes duplication of the amylase (AMY2B) locus, associated with adaptation to starch in the diet common in domestic canines, which the dingo lacked.
Being able to study structural variation was key, so Ballard was delighted to use SMRT Sequencing to generate a 60-fold coverage long read contig assembly. Working with program sponsors the Arizona Genomics Institute and Computomics (in Tübingen, Germany) and adding 10X Genomics Chromium linked-reads and Bionano Irys maps, his team was able to create a full 2.46 Gb assembly, with the longest scaffold reaching 123.2 Mb (spanning the entire chromosome 1 of the domestic dog genome).
His team is just beginning to delve into the data they generated from sequencing Sandy’s genome, but they have already noted differences in chromosomal loci. To fully annotate the dingo genome, they are also sequencing the transcriptome of three tissues and the epigenome.
The UNSW team will then compare the dingo genome with those of the grey wolf and the German Shepherd, which they also plan to generate with long reads. This will help shed further light on the domestication process.
Blood, saliva and hip x-rays have been collected from more than 400 German shepherds. They are hoping one dog in particular, Kira, will provide additional insight into a debilitating dog condition, hip dysplasia. The seven-year-old female has good hips, and her full DNA sequence could help identify why some dogs get the condition while others do not.
She has become the mascot of the project, and the poster dog for their Hip2Fit crowdfunding campaign. They are aiming to collect $50,000 AUS to cover the costs of PacBio sequencing and assembly and Bionano scaffolding.
“This work is critical to help ease the pain associated with this condition and to ensure that service dogs are able to help their humans for as long as possible,” Ballard notes on the fundraising page.
Identification of DNA variations that cause hip dysplasia in German Shepherds will also help future studies in other breeds that have hip-associated problems, including golden retrievers, St Bernards, labradors and rottweilers, he adds. Like the dingo project, it will also advance our overall knowledge of canine genetics and development.
The last day of February each year is designated as Rare Disease Day, a unique opportunity to recognize people who sometimes seem to be forgotten by the mainstream medical community. Once again PacBio is an official sponsor of the day, which will be marked with awareness-raising events in 80 countries around the world. It’s a beautiful way to remember the hundreds of millions of people affected by a rare disease, as well as the caretakers, researchers, and clinicians who work so hard to make their lives better.
The thing about rare diseases is that, while each individual disease might affect a vanishingly small fraction of the population, collectively they affect an estimated 350 million people around the world. That’s a lot of reasons to find new ways to solve rare disease. There are more than 7,000 rare diseases known, with more being discovered all the time. You can get a glimpse of the impact on Twitter today by searching #ShowYourRare, or listen to this podcast from Howard Jacob for an insider’s look at working in this community.
The rare disease community is especially important to us because of our strong belief that SMRT Sequencing will be a powerful tool for resolving so many phenotypes that remain undiagnosed. This was recently validated through the newly funded SOLVE-RD research program, which plans to use long-read sequencing to look for the genetic cause of rare diseases in 500 genomes that remain undiagnosed despite previous analysis with other genomic technologies. Identifying the genetic cause of a rare disease is an important first step toward understanding the disease mechanism – which can ultimately lead to new therapy development and treatment options.
Toward this end, scientists have made great strides in discovering the genetic cause of many rare diseases through advances in sequencing methods, including whole exome sequencing. These methods are now routinely used to look for coding changes in genes, typically caused by SNPs, but there is plenty of evidence to suggest that the next wave of discoveries will come from looking beyond these small coding variants for larger structural variants. These variant types include repeat expansions, deletions, duplications, and insertions – such as mobile elements, and other structural variants (>50 bp in length) – that are frequently missed by short-read sequencers. You can check out our informative short video tutorial on structural variation.
Our users have proven again and again that SMRT Sequencing generates a comprehensive view of all variant types and sizes across the human genome, detecting structural variation that has clear clinical implications. A recent study published this week in Cell provides a nice example of how a non-coding structural variant is causative for a rare Mendelian disorder, X-Linked Dystonia-Parkinsonism. This follows a similar study where a structural variant, in this case a mobile element insertion in the dystrophin gene, was found to be the cause of Becker Muscular Dystrophy in an individual from Portugal. These studies reinforce the need to comprehensively identify structural variants in the human genome to increase the solve-rate for rare disease.
In another study, published in 2017, Euan Ashley’s team at Stanford University described a novel method using low-fold, long-read PacBio sequencing of the whole human genome for an individual with a previously undiagnosed disease. By calling structural variants across the whole genome using low-fold, long-read WGS, Ashley and his team found six novel variants in disease relevant genes. One of those genes was linked to Carney syndrome, which was later validated as the correct diagnosis for the individual. We love this story because it offers a glimpse of how much we might learn about rare disease simply by taking advantage of highly accurate, long-read sequence data.
To see more examples of how SMRT Sequencing users have already made inroads in rare disease research, check out our blog post from Rare Disease Day 2016. And don’t miss out on the opportunity to participate in today’s festivities! Check out these global efforts and US-based initiatives to get involved.
For more information on sequencing structural variants for disease gene discovery, you can check out our upcoming Nature webcast.
February 21, 2018
Congratulations to Winston Timp’s team on the publication of their Iso-Seq analysis of hummingbird! The paper is now available at GigaScience.
April 4, 2017
A new preprint offers an enticing look at transcriptome results from analysis of a hummingbird using SMRT Sequencing. In this study, scientists found new clues to explain unique attributes of the bird’s metabolism. The work was made possible through full-length isoform sequencing, which allowed deep, assembly-free analysis even though no reference genome was available.
“Single molecule, full-length transcript sequencing provides insight into the extreme metabolism of ruby-throated hummingbird Archilochus colubris” is now available on BioRxiv. From Rachael Workman, Alexander Myrka, Elizabeth Tseng, William Wong, Kenneth Welch, and Winston Timp, the paper describes a project designed to better understand how hummingbirds switch metabolic gears to focus on sugars or lipids as needed. “This metabolic flexibility is remarkable both in that the birds can switch between exclusive use of each fuel type within minutes,” they write, “and in that de novo lipogenesis from dietary sugar precursors is the principle way in which fat stores are built, sometimes at exceptionally high rates, such as during the few days prior to a migratory flight.”
The team used the Iso-Seq method with long-read PacBio data to generate full-length isoform sequences, focusing on the liver of Archilochus colubris. According to the paper, this represents “the first high-coverage transcriptome of any single avian tissue.” They also aligned transcripts to Calypte anna, a recently completed hummingbird assembly that also made use of SMRT Sequencing.
Workman et al. report that the use of long-read PacBio data allowed for more accurate views of isoforms and alternative splicing, even without a reference genome. “Using full-length transcript data, we found alignment unnecessary to generate clear pictures of the gene isoforms,” they note. “The long reads negate the need for transcript assembly, a precarious analysis in the absence of a genome.” Nearly half of the reads in the final analysis covered full-length genes, including the 5’ and 3’ ends as well as the polyA tail.
The team used the COGENT pipeline to assign transcripts to gene families and focus on unique isoforms. “COGENT is specifically designed for transcriptome assembly in the absence of a reference genome, allowing for isoforms of the same gene to be distinctly identified from different gene families,” the scientists write. Their analysis generated a highly diverse set of isoforms, which the authors believe “represents a nearly complete transcriptome of the hummingbird liver.”
With that dataset, the scientists found genes unique to hummingbird. “These genes showed a specific enrichment for pathways involved in lipid metabolism — suggesting that the hummingbird has evolved variants of these genes to achieve its high levels of metabolic efficiency,” they report.
The scientists note that follow-up functional assays will be an important next step in understanding and verifying the function of many genes of interest.
We can’t resist a good reference genome, so the pre-AGBT workshop entitled “Updating Reference Assemblies: New Technologies, New Sources of Diversity” was right up our alley. Hosted by the McDonnell Genome Institute, a member of the Genome Reference Consortium, the event offered conference attendees useful updates on efforts to expand the diversity of human reference genome sequences by incorporating samples from multiple continents of origin (the Americas, Africa, and Asia in addition to Europe).
NCBI’s Valerie Schneider spoke about opportunities and challenges in mining assemblies other than the current GRCh38 build. There are more human genome assemblies than ever, she said, noting that this is providing new insight into where variants are most commonly found — and also helps focus efforts to represent additional diversity. She also covered recent improvements to the GRCh38 assembly, plus a list of remaining technical challenges, while reporting that 65 new human genomes have been submitted to GenBank since GRCh38 was published. Most of those are based on PacBio data, and Schneider spoke about how those assemblies are used to help understand alternate loci and genetic variants in GRCh38. Going forward, she indicated that assemblies from people of African descent are still needed, offering a major opportunity for improvement.
Tina Graves Lindsay from the McDonnell Genome Institute continued the diversity theme, showing how her team relies on a strategy of 60-fold coverage with PacBio long reads paired with scaffolding technologies to produce reference-grade assemblies. By sequencing genomes from underrepresented ethnicities, including Gambian and Yoruban assemblies she shared, her group has successfully resolved conflicts in GRCh38.
Ed Green from the University of California, Santa Cruz, spoke about updating reference genomes with proximity ligation techniques such as Hi-C and Chicago. The approaches are analogous to mate-pair data, he said, and talked about data from 12 diploid human genomes. In one example, proximity ligation showed alignment errors in NA19240, a reference just submitted to GenBank that had sections of chromosome 4 incorrectly placed on chromosome 1, among other problems.
We’d like to thank the workshop organizers for a great event!
The Department of Energy has its eyes on an unassuming solution to our bioenergy needs: Aspergillus. The fungal genus contains hundreds of variations, which include powerful pathogens, industrial cell factories, and prolific producers of bioactive secondary metabolites.
The DOE’s Joint Genome Institute (JGI) has embarked on an ambitious plan to sequence, annotate and analyze the genomes of 300 Aspergillus fungi, and the first results are in.
In a study published in the Proceedings of the National Academy of Sciences, “Linking secondary metabolites to gene clusters through genome sequencing of six diverse Aspergillus species,” a team led by researchers at the JGI in partnership with the DOE’s Joint BioEnergy Institute (JBEI) and the Technical University of Denmark (DTU), describe how they applied SMRT Sequencing to four diverse Aspergillus species (A. campestris, A. novofumigatus, A. ochraceoroseus, and A. steynii), producing very high-quality genome assemblies that can serve as reference strains for future comparative genomics analyses.
Two additional strains (A. taichungensis and A. candidus) were also sequenced and a comparative analysis involving these and other Aspergillus genomes was then conducted, allowing the team to identify biosynthetic gene clusters for secondary metabolites (SMs) of interest.
“One of the things we found to be interesting here was the diversity of the species we looked at; we picked four that were distantly related,” says study senior author Mikael R. Andersen, Professor at DTU. “With that diversity comes also chemical diversity, so we were able to find candidate genes for some very diverse types of compounds.”
Using a new analysis method developed by first author Inge Kjaerboelling, the team looked for genes found in all producer species and was able to “elegantly pinpoint the genes,” Andersen adds.
Among the traits they traced were allergens, virulence, and pathogenicity. Aspergillus fungi are also known to contain more than 250 carbohydrate active enzymes (CAzymes), which break down plant cell walls. Knowledge of how this works could help the DOE as it pursues sustainable alternative fuels using bioenergy feedstock crops.
The fungal species’ secondary metabolites are also of interest to DOE researchers, as these small molecules have the potential to act as biofuel and chemical intermediates. Determining the structures of purified secondary metabolites is often relatively straightforward, but connecting these molecules to their biosynthetic pathways can be quite challenging, says study co-author Scott Baker, a fungal researcher at the Environmental Molecular Sciences Laboratory, a DOE Office of Science User Facility located at the Pacific Northwest National Laboratory.
“We show that using comparative genomics can efficiently lead to reasonable predictions of gene clusters involved in biosynthetic pathways,” Baker says.
The authors hope that by characterizing the identity and roles of secondary metabolites, and the genes necessary for their generation, they will discover potential tools for improving the ability to process recalcitrant biomass into precursors for biofuels and bioproducts.
A new review in Nucleic Acids Research offers a sweeping look at clinical uses for SMRT Sequencing, concluding:
“The myth that SMRT sequencing is too error prone to be diagnostically useful is being expunged and replaced by evidence that it offers advantages over short-read sequencers.”
The authors continued, “Just as second-generation platforms stepped beyond Sanger sequencing and enabled a revolution in genomics medicine, third-generation single molecule sequencing platforms will likely be the next genetic diagnostic revolution.”
“Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics” written by Simon Ardui, Joris Vermeesch, and Matthew Hestand at KU Leuven and Adam Ameur at Uppsala University, offers a great overview of how SMRT Sequencing is being used in clinically relevant applications ranging from cancer to reproductive medicine and more. The paper notes that SMRT Sequencing offers tremendous benefits because it resolves many problems with short-read platforms — “limitations such as GC bias, difficulties mapping to repetitive elements, trouble discriminating paralogous sequences, and difficulties in phasing alleles.” In addition, SMRT Sequencing has “higher consensus accuracies and can detect epigenetic modifications from native DNA,” Ardui et al write.
“SMRT sequencing is opening up new diagnostic avenues, such as the ability to determine tandem repeat lengths, interruptions, and even epigenetics in a single test at base pair resolution,” the scientists report. “Long read sequencing is already considered the gold standard for some applications, such as for HLA genotyping for tissue transplants.”
The review walks through many of those applications, offering prominent examples for each. Resolving tandem repeats, for example, is already important for Fragile X syndrome, spinocerebellar ataxia, and other repeat expansion disorders. “Replacing Southern Blots with faster and more direct SMRT sequencing will greatly enhance … repeat disorder diagnostics,” the scientists note. They also cover examples such as distinguishing pseudogenes, needed for CYP2D6 analysis for drug metabolism studies; identifying fusion genes to guide therapy selection for cancer patients; and infectious disease analysis; among many others.
Looking ahead, the review cites data indicating that whole transcriptome and whole genome sequencing on the PacBio system will soon see regular clinical utility. Regarding the Iso-Seq method, “as costs drop and throughput increases, unbiased PacBio expression and isoform detection will become routine in the near future,” the scientists write. They also note that “SMRT sequencing is greatly expanding the utility of WGS, permitting a factor greater in assembly completeness … even nearing reference genome contig sizes and including diploid aware assemblies.”
A project that sparked widespread interest and a successful science crowdsourcing campaign has inspired an international collaboration that produced two high-quality reference genomes, as well as a draft genome of a related beetle. And the results have shed light on the evolution of bioluminescence.
We’ve been following the progress of Team Firefly since the team of scientists from MIT, University of Rochester, Brigham Young University, Indiana University, Cornell University, and Tufts University narrowly lost our 2016 SMRT Grant competition. The project to sequence the genome of the Big Dipper Firefly, Photinus pyralis, was ultimately crowdfunded through the Experiment site and our Genome Galaxy Initiative.
In its latest update, the team announced it joined forces with collaborators around the globe to add two additional genomes to the fold: the Japanese “Heike” firefly, Aquatica lateralis, and the bioluminescent click-beetle, or “cucubano”, Ignelater luminosus.
The re-named Team Bioluminescent Beetles offered a sneak peek into its results by publishing a pre-print on the bioRxiv service, “Firefly genomes illuminate the origin and evolution of bioluminescence.” The scientists have also made the genome data downloadable at http://www.fireflybase.org.
“One of most intriguing findings of our results so far is that we think that fireflies and click-beetles actually evolved their very similar bioluminescent systems independently, making beetle bioluminescence a possible new example of parallel evolution,” writes team member Tim Fallon, of the Whitehead Institute for Biomedical Research at MIT.
The data also provide insight into the evolution of other traits, including chemical defenses and the viral and microbial holobiome associated with the unique lifestyle of bioluminescent beetles.
Until now, scientists have been in the dark about the genes behind the firefly luciferase gene, which are widely used in agricultural and biomedical research. The team hopes its findings could help in the improvement of these engineered bioluminescent systems, and also aid species conservation efforts.
Interested in pitching your own genome sequencing project? We are partnering with our certified service provider GENEWIZ to offer a ‘Sequence the Tree of Life’ SMRT Grant program. Submit your proposal for a chance to win sequencing on the Sequel System by March 25.