This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
The ability to study the speciation of an animal in real-time is a dream come true for evolutionary and developmental biologists. A group of Japanese researchers has gotten that opportunity, thanks in part to SMRT Sequencing.
Scientists at the University of Tokyo were the first to create a reference genome for an inbred strain of the medaka fish (Oryzias latipes), genome size ~800 Mb, in 2007. The genome assembly was created using Sanger sequencing, but contained low-quality regions and 97,933 sequence gaps. So, the team started from scratch with long-read sequencing to generate genome assemblies with far less missing sequence.
In a paper published in Nature Communications, senior authors Hiroyuki Takeda and Shinich Morishita report new assemblies generated via PacBio long-read sequencing from three geographically isolated medaka strains. These high-quality assemblies allowed them to dive deeper than ever before into the genetics of the fish, and to discover new insights into how previously difficult-to-detect centromeres and large-scale structural variants evolve and contribute to genome diversity during vertebrate speciation.
“Highly accurate long contigs have been useful in enumeration of structural variants (SVs), filling gaps such as centromeres, extending contigs to telomeres, and phasing haplotypes,” the authors write.
The team focused its attention on centromeres, which are difficult to sequence and assemble with short-read and even Sanger platforms. “Once speciation is completed, representative centromeric monomers are highly diversified among 282 species; however, centromere evolution during speciation and its relevance with speciation are unknown,” the authors note.
With this in mind, the team sequenced the genomes of three medaka inbred strains derived from different local subpopulations: HNI from northern Japan, Hd-rR from southern Japan (the strain sequenced for the original reference genome), and HSOK from east Korea.
Originally considered a single species since they can mate and produce healthy offspring under laboratory conditions, the strains have accumulated genetic mutations and phenotypic diversity over a long period of geographical separation. They are now thought to be in the middle of speciation, making them the perfect platform for analyzing this type of evolution, the authors report.
Combining PacBio data with centromere-specific DNA probes and fluorescence in situ hybridization experiments, the team reports obtaining “an unprecedented resource of centromeric repeats of length 20–345 kbp in vertebrates.”
They found that the position of centromeres tended to be preserved unless chromosomal rearrangement took place on a large scale. This happened to the medaka, which remained the same for millions of years, until fissions, fusions, and translocations shaped its genome.
The scientists further discovered that this evolution happened at a different pace among the three strains, depending on the shape and sequence of the centromeres. Centromeric monomers in acrocentric chromosomes evolved more slowly than those in non-acrocentric chromosomes, the team reports. Using AgIn software, the authors estimated methylation states of CpG sites from kinetic SMRT Sequencing information and found divergent methylation patterns, suggesting that centromeres accumulate epigenetic diversity as well as sequence diversity during speciation.
They observed that each local strain has independently experienced thousands of mid-sized (1-50 kbp) insertion events—not enough to cause reproductive isolation, but possibly enough to participate in the regulation of genes and contribute to phenotypic variations.
“These findings reveal the potential of non-acrocentric centromere evolution to contribute to speciation,” conclude the authors. “Further analysis of the mid-sized insertions associated with novel transcripts and increased transcription will provide important clues to the genomic basis for vertebrate speciation.”
Unraveling the role of the microbiome in human health and environmental samples is an emerging priority in scientific study. But despite the best advances in sequencing technology, identifying the bacteria, fungi, and other organisms present in complex samples remains a huge challenge.
Metagenomic shotgun sequencing can read chromosomes, plasmids, and bacteriophages, and comparison to reference genome sequences can be used to place them into putative taxa and species bins, but these methods fail to sufficiently distinguish between genomes that are very similar.
A team of scientists from the Icahn School of Medicine at Mount Sinai, Sema4, and other institutions has come up with a novel solution: a computational method that uses PacBio long-read sequencing of metagenomic DNA to identify methylated motifs and create an epigenetic barcode that enables more precise microbiome analysis.
The process takes advantage of methyl groups which are added to nucleotides in bacteria and archaea in a highly sequence-specific manner, and these motifs often differ among species and strains.
The team took advantage of inter-pulse duration values that represent the time it takes a DNA polymerase to translocate from one nucleotide to the next during SMRT Sequencing. This measure can distinguish between methylated and non-methylated bases. They calculated methylation scores across motifs of several bacterial samples and murine fecal samples and created methylation profiles, which were used alongside sequence composition features to assemble contigs into species- and strain-level bins.
In a paper published in Nature Biotechnology, senior author Gang Fang describes how the method was also able to link mobile genetic elements, including antibiotic resistance-encoding plasmids, to their host species in a real microbiome sample.
Although their sequence coverages and composition profiles often differ, plasmid and chromosomal DNA of the bacterial host are methylated by the same set of methyltransferases, resulting in matching methylation profiles, the authors note.
“The biomedical community has long needed a microbiome analysis method capable of resolving individual species and strains with high resolution,” Fang said in statement.
The method could ultimately prove useful in both research and clinical settings, since it allows for linking mobile genetic elements to their bacterial hosts. This information makes it possible for scientists to more accurately predict virulence and antibiotic resistance of individual bacterial species and strains, among other important traits.
In a new publication, scientists from Anthony Nolan Research Institute and the UCL Cancer Institute present an in-depth analysis of the utility of SMRT Sequencing for Human Leukocyte Antigen (HLA) typing. They assessed more than 100 cell lines and found that PacBio long-read sequencing significantly improves the accuracy of HLA typing.
“Single molecule real-time (SMRT®) DNA sequencing of HLA genes at ultra-high resolution from 126 International HLA and Immunogenetics Workshop cell lines” comes from lead author Thomas Turner, senior author Steven Marsh, and collaborators. The scientists implemented SMRT Sequencing to perform high-resolution HLA typing for 126 B-lymphoblastoid cell lines, including a group of 107 cell lines established in 1987 that is now an essential resource for the community. The goal of the present study was to increase the resolution of the reference sequences in the IMGT/ HLA database and improve standardization of HLA typing calls for these cell lines.
HLA genes — used to evaluate donor-recipient tissue match before organ transplant, as well as other immune-related traits — are among the most polymorphic in the genome. Characterizing them has been a challenge with short-read sequencing platforms, but recent efforts to perform full-length sequencing and phasing of the genes with PacBio long-read sequencing have generated impressive results. Indeed, the authors write, “Anthony Nolan’s Histocompatibility Laboratory now routinely uses SMRT sequencing for HLA typing.”
For this project, scientists carried out amplicon sequencing for full-length gene analysis of HLA class I genes and partial analysis of class II genes. In total, they sequenced 931 HLA alleles, with 96% yielding results that matched previously established HLA types for those cell lines. Of the few dozen discrepancies, Turner et al. discovered that 10 harbored novel alleles and 13 were different because of zygosity results, while many others included allele types not previously reported for those cell lines. Confirmation studies showed that these SMRT Sequencing results accurately resolved ambiguities and corrected errors in earlier HLA typing efforts. “We identified numerous discrepancies and novel intronic polymorphisms, extended several alleles to full genomic sequences, and confirmed the existence of some alleles identified by other researchers,” the team reports.
“The work presented here has further demonstrated the efficacy of SMRT sequencing to provide the highest resolution, unambiguous HLA typing data when full genes are sequenced,” the scientists conclude. “This knowledge ensures the continued usefulness of the reference cell line panel as a resource to the immunogenetics community in the age of next generation DNA sequencing.”
What can one koala tell us about an endemic that threatens the survival of its species? A great deal, it turns out.
While doing a deep dive into the genome of a wild female koala, a team of Australian scientists led by Matthew Hobbs and Andrew King of the Australian Museum Research Institute were able to unravel some of the complexity of the species-specific gammaretrovirus KoRV.
The results, published recently in Nature, paint a picture of a rapidly evolving and diversifying virus, with implications for the long-term survival of the koala, as well as our understanding of retroviral-host species interactions.
The study allowed the researchers to see interspecies transmission, multimerization of sequences in the long terminal repeats, and recombination between different retroviruses, processes which have been reported for other retroviruses but occurred millions of years ago, rather than in very recent times, as for the koala.
KoRV is a retrovirus closely related to gibbon ape leukemia virus, and is thought to be the result of an interspecies transmission. Implicated in the pathogenesis of two major koala diseases, hematopoietic neoplasia and the endemic chlamydiosis, it is considered to be a significant threat to the survival of the species.
Several KoRV subtypes have been proposed. Presumed to be the original transmitted strain, KoRV-A is endogenous and widespread in northern Australian koalas, which are thought to be 100 percent infected. KoRV-B is a more recent, more virulent subtype believed to be the result of recombination. Additional variants — KoRV-C, D, E, F, G, H and I — have also recently been identified, but it has been difficult to examine the population of KoRV and KoRV-like insertions in any koala genome.
PacBio long-read sequencing technology finally made it possible. DNA from the koala’s spleen was sequenced to give an estimated overall coverage of 57.3-fold based on a genome size of 3.5 Gb. The authors used SMRT Sequencing due to its capacity to generate sequences of up to ~70 kb that carry full-length (8.4 kb) KoRV insertions and substantial flanking koala genome sequence. This provided a considerable advantage over short reads, which could not resolve the different KoRV insertion sites or types.
“Obtaining sequence data from elements (such as retroviruses) that are repeated throughout the genome cannot be done with short read sequencing technology, which is why we used long read PacBio sequencing in our study,” the authors add.
The team reported putative somatic integrations of five distinct forms of KoRV (KoRV-A, KoRV-B, KoRV-D, KoRV-E), as well as germline evidence of KoRV-A. They also found an endogenous recombinant element (recKoRV) in which most of the KoRV protein-coding region was replaced with an ancient, endogenous retroelement.
“This diverse pool of viral variants in the same animal highlights the range of strategies being used by this retrovirus as it invades, or comes to equilibrium with its new host,” the authors add.
“As KoRV-A, B and potentially other more pathogenic KoRV types sweep through koala populations, we might expect to see worsening effects of chlamydial disease. This highlights the importance of understanding the complex mix of KoRV types present in an individual animal.”
Read the full report and learn more about the project in this video presentation from a PAG 2017 Workshop by Rebecca Johnson, a co-author on the paper and the Director of the Australian Museum Research Institute. Johnson will also be presenting her work at the Advances in Genome Biology and Technology meeting in February 2018.
In a recent paper, scientists in Germany call for a genomic database of Klebsiella pneumoniae strains to accelerate strain identification as well as drug-resistance status. To that end, they used SMRT Sequencing to generate high-quality assemblies for 16 isolates collected in German hospitals.
“Monitoring microevolution of OXA-48-producing Klebsiella pneumoniae ST147 in a hospital setting by SMRT sequencing” comes from lead authors Andreas Zautner and Boyke Bunk, senior authors Jorg Overmann and Wolfgang Bohne, and collaborators at University Medical Center and other institutes in Germany.
The urgency to characterize K. pneumoniae strains comes from the rapid rise of carbapenem-resistant Klebsiella given that drug resistance, and increasingly multidrug resistance (MDR), is a major public health threat with these infections. “A continuous monitoring of [strain type] distribution and its association with resistance and virulence genes is essential for early detection of successful K. pneumoniae lineages,” the scientists report.
K. pneumoniae strains carry plasmids encoding different types of carbapenemase, which confers resistance to the carbapenem class of antibiotics. OXA-48 is currently the most common carbapenemase found in K. pneumoniae isolates in Germany, according to the authors; similar strains are commonly found in North Africa, the Middle East, and European countries along the Mediterranean. The team chose to focus on OXA-48 strains, selecting 16 isolates collected in 2013 and 2014 for whole genome SMRT Sequencing.
The technology choice was no accident. “A comprehensive K. pneumoniae database of closed genomes is necessary for a complete understanding of the genome plasticity of these organisms and can significantly improve the tracking of MDR isolates,” the scientists write. With SMRT Sequencing, they were able to generate closed genomes. In most cases they used a single SMRT Cell per strain, and “a consensus concordance of QV60 could be confirmed for all genomes,” they report.
Based on the 16 genome assemblies, the scientists determined that half of the isolates shared the same type, ST147, and differed by no more than 25 SNPs throughout the core genome. They identified several plasmids, including a novel linear plasmid prophage of Klebsiella oxytoca. “The comparative whole-genome analysis revealed several rearrangements of mobile genetic elements and losses of chromosomal and plasmidic regions in the ST147 isolates,” they write.
“Single molecule real-time sequencing allowed monitoring of the genetic and epigenetic microevolution of MDR OXA-48-producing K. pneumoniae,” the team concludes, noting that the approach was amenable to spotting individual SNPs, as well as complex rearrangements.
We’re pleased to announce the winner of this year’s ‘Open Your Eyes to Isoform Diversity’ SMRT Grant, which was launched during the American Association for Cancer Research annual meeting. The grant program, co-sponsored by PacBio and GENEWIZ, received many compelling entries, and it was a challenge choosing just one winner.
Congratulations to Andrew Ludlow, a new faculty member at the University of Michigan, who impressed reviewers with his proposal to investigate the splicing of transcripts regulated by the oncogene NOVA1. Ludlow notes that in lung cancer cells, NOVA1 acts as a splicing enhancer to produce full-length hTERT and promote telomerase activity, which is essential for cancer cell survival. He proposes to explore whether NOVA1 also regulates other critical hallmarks of cancer by studying changes in the transcriptome of a lung cancer cell line upon NOVA1 knockdown. Since NOVA1 regulates alternative splicing, the unambiguous resolution of isoforms enabled by the Iso-Seq method will be key to developing a complete picture of how NOVA1 alters the functionality of key players in biochemical pathways involved in cancer development.
Thank you to all of the applicants who participated in the ‘Open Your Eyes to Isoform Diversity’ grant program.
For another chance to win, check out our latest SMRT Grant Program with GENEWIZ, focused on discovering structural variation.
They are colonizers and killers, growing as large as 2,400 acres, leaving devastation in their wake. Armillaria fungi are the cause of root rot disease in forests, fields, parks, and vineyards in more than 500 host plant species across the world. But despite this huge impact on agriculture, the pathogenicity of Armillaria species has been poorly understood.
A new international study led by Hungarian researchers, published in Nature Ecology & Evolution, reveals novel insights into how the fungus spreads and kills. Lead author György Sipos of the University of Sopron, senior author László Nagy of the Hungarian Academy of Sciences, and an international team of collaborators used PacBio long-read sequencing to generate the haploid genomes of four Armillaria species: A. ostoyae, A. cepistipes, A. gallica, and A. solidipes. They also compared them to 22 related saprotrophic, hemibiotrophic, and mycorrhizal fungi.
In the paper, the team reports that Armillaria genomes are “unusually large” for fungi, containing as much as 85 Mb and nearly 26,000 genes, while members of the Physalacriaceae family used for comparison had genomes smaller than 36 Mb with fewer than 14,000 genes. The scientists note that the Armillaria genomes seem to have “drawn upon ancestral genetic toolkits for wood-decay, morphogenesis and complex multicellularity.”
Based on gene mapping and phylogenomic analysis, the authors estimate that the genus diverged from its closest relatives 21 million years ago. While other organisms tended to lose genes in that time, Armillaria species showed a net genome expansion. “The transposable element (TE) content of Armillaria genomes shows a modest expansion relative to other Agaricales and an even distribution along the scaffolds, suggesting that their genome expansion is not driven by transposon proliferation, as observed in other plant pathogens,” the scientists report.
The team also investigated rhizomorphs, root-like materials through which Armillaria spread. “Rhizomorphs are some of the most unique structures of Armillaria spp. that enable them to become the largest terrestrial organism on Earth,” the scientists write. “They express a wide array of genes involved in secondary metabolism, defense, [plant cell wall] degradation and to a lesser extent pathogenesis, which indicate active nutrient uptake and adaptations to a soil-borne lifestyle.”
In healthy forest systems, some fungi are useful for capturing carbon and culling the weakest trees. But Armillaria are not so choosy, and stresses like drought can hasten the destruction of entire forests. The largest expanse of the fungus, as the New York Times points out, occupies an area in Oregon “the size of three Central Parks and may weigh as much as 5,000 African elephants.”
The authors hope their study will contribute to combating future losses. With more genome resources and follow-up studies, they write, this information “could soon bring the development of efficient strategies for containing the spread and damage of Armillaria root-rot disease in various forest stands within reach.”
Three recent publications report results from transcriptome studies of plants often used for medicinal purposes, all powered by SMRT Sequencing and the Iso-Seq method. The papers on ginseng, Huangqi, and tea collectively show the importance of sequencing full-length isoforms for the most accurate and comprehensive gene expression analysis; they also demonstrate the usefulness of characterizing gene models for complex species in the absence of a reference genome assembly.
In “Isoform Sequencing Provides a More Comprehensive View of the Panax ginseng Transcriptome,” scientists from the National Institute of Horticultural and Herbal Science, Chungbuk National University, and other institutions in Korea describe efforts to characterize the functional genomics of P. ginseng. They studied four tissues (flower, leaf, stem, and root) from the ginseng plant, producing more than 135,000 assembled transcripts, averaging 3.2 kb in length. “We successfully identified unique full-length genes involved in triterpenoid saponin synthesis and plant hormonal signaling pathways, including auxin and cytokinin,” the team reports. “Transposable elements (TEs) were also identified, suggesting transcriptional activity of TEs in P. ginseng.” They also took a deep dive into 88 genes to assess alternative splicing. The scientists believe this work will facilitate the discovery of novel genes in this plant and its relatives, and suggest their results will be important for crop breeding programs.
Separately, scientists at the University of Adelaide in Australia and Shanxi University of Traditional Chinese Medicine in China published “Long read reference genome-free reconstruction of a full-length transcriptome from Astragalus membranaceus reveals transcript variants involved in bioactive compound biosynthesis.” A. membranaceus is also known as Huangqi and is a commonly used herb in Chinese medicine for cancers, diabetes, nephritis, and other diseases. The team studied transcriptomes of the plant’s leaf and root, identifying nearly 28,000 full-length, unique transcripts in leaf tissues and more than 22,000 in root tissues. “Compared with previous studies that used short read sequencing, our reconstructed transcripts are longer, and are more likely to be full-length and include numerous transcript variants,” the scientists report. Their analysis highlighted complex patterns of alternative splicing and enabled the detection of long-noncoding RNAs as well as characterization of biosynthesis genes. “Our study provides a practical pipeline to characterise the full-length transcriptome for species without a reference genome and a useful genomic resource for exploring the biosynthesis of active compounds in Astragalus membranaceus,” they conclude.
Finally, “Transcriptome Profiling Using Single-Molecule Direct RNA Sequencing Approach for In-depth Understanding of Genes in Secondary Metabolism Pathways of Camellia sinensis” comes from scientists at Anhui Agricultural University in China. They were investigating the biosynthesis of metabolites such as flavonoids and caffeine in this tea plant, performing Iso-Seq analysis of eight different tissues. The approach led to the identification of 94 full-length transcripts, plus four alternative splicing events, associated with the biosynthesis of the compounds of interest. “The longer reads improved the quality and accuracy of transcripts generated from [a previous] short-read assembly,” the scientists report. The new transcripts “provide a more accurate depiction of gene transcription and will greatly improve C. sinensis genome annotation in the future.”
We’re pleased to be teaming up with the Wellcome Trust Sanger Institute on a project to celebrate their twenty-fifth anniversary: generating high-quality genome assemblies for 25 species that are integral to the ecosystems found in the United Kingdom.
For this work, Sanger scientists will use the Sequel System and complementary technologies to produce reference-grade assemblies. Twenty organisms have already been selected, and the last five will be chosen by a public vote reminiscent of our own SMRT Grant program, which earlier this year saw dingo beat out bombardier beetle, sea slug, temple pitviper, and pink pigeon for the coveted sequencing prize. Have a look at the Sanger finalists!
From dormouse to dragonfly, hogweed to hornet, the 25 species represent the considerable natural biodiversity of the UK. Organisms run the gamut from endangered to iconic, thriving to dangerous.
In a statement announcing this celebration, Sanger’s Associate Director Julia Wilson said, “Through sequencing these 25 genomes, scientists will gain a better understanding of UK species, how they arrived here, their evolution and how different species are adapting to a changing environment. The results could reveal hidden truths in these species, and will enable the scientific community to understand how our world is constantly changing and evolving around us. We want to celebrate the 25th anniversary of the Sanger Institute in a special ‘Sanger’ way, and I am excited to see how the 25 genomes project unfolds.”
Voting for the final five species runs through December 8th. We’d like to congratulate the Sanger team on a remarkable 25 years, and we look forward to seeing the results of this latest project!
Scientists in Singapore, Hong Kong, and Malaysia recently reported the high-quality draft genome assembly of Durio zibethinus, a type of durian fruit commonly eaten in southeast Asia. The team used SMRT Sequencing and chromosome mapping techniques to produce the assembly, which will be an important tool for agricultural monitoring.
“The draft genome of tropical fruit durian (Durio zibethinus)” was published in Nature Genetics from lead authors Bin Tean Teh, Kevin Lim, Chern Han Yong, Cedric Chuan Young Ng, senior author Patrick Tan, and collaborators at the Duke-NUS Medical School and other institutions. They chose to study durian — a tropical fruit best known for its strong sulfuric smell — because it represents an economically important food crop in the region. “In 2016 alone, durian imports into China accounted for about $600 million, as compared to $200 million for oranges, one of China’s other main fruit imports,” the authors note. Until this project, however, there has been little genomic research into the plant or even its close relatives.
The scientists used PacBio sequencing to 153-fold coverage and applied both FALCON and FALCON-unzip to generate a haplotig-merged assembly. The 738 Mb genome was then enhanced using Chicago and Hi-C methods, increasing scaffold N50 lengths to 22.7 Mb. “The final reference assembly comprised chromosome-scale pseudomolecules, with 30 pseudomolecules greater than 10 Mb in length,” the team reports.
A comparison to gene families in other plants led the scientists to conclude that durian is most closely related to cotton and cacao, and probably shares an ancestral whole genome duplication (WGD) event with cotton. That finding means that “other subfamilies within Malvaceae … are also likely to have the cotton-specific WGD” and that the duplication event “may also have been involved in driving the evolution of unique durian traits.”
Speaking of unique durian traits, the team’s gene expression analysis of the plant revealed upregulated sulfur-related pathways as well as “durian-specific gene expansions … associated with production of volatile sulfur compounds.” Both findings offer new insight into the fruit’s odor and even suggest the biological reason for it: “Certain plants whose primary dispersal vectors are primates with more advanced olfactory systems show a shift in odor at ripening,” the authors write. “Durian—by emanating an extremely pungent odor at ripening—appears to have the characteristic of a plant whose main dispersal vectors are odor-enticed primates rather than visually enticed animals.”
The team concludes that the availability of this genome resource will have important agricultural implications. “As an example, rapid commercialization of durian has led to the proliferation of cultivars with a wide discrepancy in prices and little way to verify the authenticity of the fruit products at scale,” they write. “A high-quality genome assembly may aid in identifying cultivar-specific sequences, including SNPs related to important cultivar-specific traits (such as taste, texture, and odor), and allow molecular barcoding of different durian cultivars for rapid quality control.” They also hope the work will fuel studies of other durian varieties to characterize the plant’s natural biodiversity.
Scientists at Huazhong Agricultural University in China and collaborating institutions recently published results of an Iso-Seq analysis of allotetraploid cotton. The team’s findings are expected to be particularly useful for functional genomics, driving advances for cotton breeders as well as research biologists.
“A global survey of alternative splicing in allopolyploid cotton: landscape, complexity and regulation” was published in New Phytologist by lead authors Maojun Wang and Pengcheng Wang, senior author Xianlong Zhang, and collaborators. Existing genome assemblies for polyploidy cotton were not “released with a well-annotated transcript isoform set,” the scientists write, “and so the extent and differences in [alternative splicing (AS)] of homoeologous gene transcripts remain poorly understood.”
To remedy the situation, the team used SMRT Sequencing to analyze RNA in 12 plant samples across six tissues, including root, leaf, and petal, among others. They focused on the allotetraploid Gossypium barbadense because “allotetraploid cottons contribute to the vast majority of fibre yield every year, and their recently published genome sequences of allotetraploid cottons are of interest to both breeders and genome biologists,” they explain. As part of this project, the scientists developed a new Iso-Seq data analysis pipeline geared toward complex genomes. “This includes methods for quality control of raw data, classification of transcripts, clustering and transcriptome analysis,” they note. The tool is available for public use with step-by-step instructions.
Data analysis revealed nearly 177,000 unique, full-length transcripts of almost 45,000 gene models. Aligning to the reference genome and comparing with its annotation allowed the team to extend genes at both 5’ and 3’ ends and expand the number of transcripts linked to each gene. The previous annotation had just 20% of genes associated with more than one transcript, while the SMRT Sequencing annotation boosted that to 57%.
“These data led us to identify 15,102 fibre-specific AS events and estimate that c. 51.4% of homoeologous genes produce divergent isoforms in each subgenome,” the scientists report. Among other key findings: thanks to alternative splicing, the same gene can be regulated differently by various microRNAs.
“This study provides a rich resource of transcript isoforms for the cotton community, evolutionary biologists and provides a useful reference for other species,” the scientists conclude. “These results will facilitate future functional genomics studies and enhance our understanding of AS in polyploid species.”
A new program at ASHG this year came from the CoLaboratories, or educational theaters featuring 30-minute talks on a variety of clinical, laboratory, and data analysis topics. We were honored to present in the inaugural CoLabs today, with PacBio scientists offering tips on CRISPR/Cas9 enrichment as well as long-read whole genome sequencing for structural variant discovery.
Tyson Clark, director of applications development, gave a talk titled “Targeted Enrichment without Amplification and SMRT Sequencing of Repeat-Expansion Disease Causative Genomic Regions.”
He demonstrated a novel technique using the CRISPR/Cas9 system to target specific genes or elements in the human genome. Having an amplification-free method for this is particularly important for challenging targets, such as the many repeat expansions known to cause disease.
Combining this strategy with SMRT Sequencing allows for highly accurate, uniform, and complete sequencing of complex genomic regions that have proven intractable with short-read technologies. Clark reported successfully targeting and sequencing loci involved in several repeat expansion disorders — including HTT, FMR1, and ATXN10, among others — even when those regions had extreme GC content.
This work was also presented in an ASHG poster by Clark earlier in the week.
Staff scientist Aaron Wenger delivered a talk entitled “PacBio Long-Read WGS for Structural Variant Discovery.” As he noted, structural variant analysis has become increasingly important as the community realized that SNPs only explain a fraction of the DNA sequence differences between any two individuals, while structural variants represent the majority of those differences. These variants have been more difficult to detect, though, because of their size and complexity. Unlike short-read platforms, SMRT Sequencing can accurately characterize the vast majority of structural variants in a human genome, even at low coverage.
Wenger offered a look at updated structural variant calling protocols for the Sequel System, including how to tune parameters such as sequence coverage, and shared quick case studies.
Many thanks to all the ASHG attendees who joined the CoLabs to hear these talks!
The PacBio team hosted a luncheon workshop at ASHG yesterday titled “Population and Clinical Genetics Studies Using Long-read SMRT Sequencing.” Thanks to all the conference attendees who took time out of a very busy meeting to join us! If you couldn’t attend, we summarized the highlights below and will share recordings of the presentations soon.
Long-read Sequencing – for Detecting Clinically Relevant Structural Variation
Han Brunner, Head of Clinical Genetics at Radboud University Medical Center, kicked off the event with a talk about using SMRT Sequencing to detect clinically relevant structural variation. Introducing himself as a consumer of sequencing data, rather than a technology expert, he described a collaboration with PacBio to uncover structural variants associated with intellectual disability. Patients with this symptom often never receive a diagnosis, and while that situation has improved with exome and whole genome sequencing, it still isn’t fully addressed. In an assessment of 100 patients with severe disability, Brunner said, 38 of them still had no diagnosis even with WGS.
To overcome the challenge, his team turned to the Sequel System. In preliminary results from sequencing five trios, his team found 21 Mb of sequence revealed with SMRT Sequencing that had gone unresolved with short-read sequencing. They also found as many as 25,000 structural variants per genome, two-thirds of which were invisible to short-read technology. Brunner noted that the ability to phase data with PacBio sequencing provides “very useful information.” He also predicted that this approach could be implemented in the clinic within a year, based on the time it took to move exome sequencing toward patient care.
Expansion Sequence Variations Underlie Distinct Disease Phenotypes in SCA10
Next up was Karen McFarland, Research Assistant Professor at the University of Florida, who spoke about recent efforts to resolve repeat expansion regions associated with spinocerebellar ataxia type 10 (SCA10), a progressive neurodegenerative disorder. SCA10 is associated with a repeat expansion in the ATXN10 gene, ranging from nine repeats in unaffected individuals to as many as 4,500 in affected people. That’s as much as 22.5 kb of sequence, which has precluded thorough characterization of the region in the past. With SMRT Sequencing, though, McFarland and her team have not only successfully sequenced the region, but have also been able to accurately detect interruptions in the sea of ATTCT repeats that appear to have clinical consequence. Recently, they used the Cas9 enzyme to perform target capture of the region directly from genomic DNA of family members with different ataxia-related phenotypes.
Multi-platform Discovery of Haplotype-resolved Structural Variation in Human Genomes
Charles Lee, Scientific Director of The Jackson Laboratory for Genomic Medicine, gave the final user talk. He focused on a study from the Human Genome Structural Variation Consortium to assess many different technologies for discovering structural variants. The analysis of Yoruban, Han Chinese, and Puerto Rican trios showed that no single tool is currently capable of capturing the full range of human structural variants, which can be larger than a megabase. SMRT Sequencing, though, was found to dramatically increase the number of variants that could be detected and contributed to a seven-fold overall increase in structural variation discovery, he said. Among the variants he finds particularly interesting to pursue are inversions and mobile element insertions.
PacBio Applications Updates and Future Roadmap
Our CSO Jonas Korlach spoke as well, offering updates on structural variation, targeted sequencing and Iso-Seq analysis, as well as a look at future SMRT Sequencing developments. On the structural variation front, he noted that tools have evolved enough now that PacBio users can access a full SV detection workflow — including sequencing, read mapping, variant calling, and visualization — in the new SV Calling Software. Learn more about it or try our new project calculator on our new structural variation web page. For amplification-free target enrichment using the Cas9 protocol mentioned by McFarland, Korlach said that development is currently underway to adapt it for the Sequel System. If you’re attending ASHG, you can find out more at the CoLab session on Friday or check out the amp-free targeted sequencing web page.
Looking ahead, he told attendees that by the end of the year there will be a new accelerated protocol for library prep, a doubling of sequencing yield, and updated analysis tools in SMRT Link 5.1. Another 2-fold improvement is expected in 2018, and with the introduction of a new SMRT Cell with 8 million ZMWs (8-fold increase) in early 2019, a combined 30-fold capacity boost compared to current throughput will be achievable. These advances will help support the rising number of population studies that require an ever-increasing number of high-quality human genome assemblies.
We’ll be reporting on more from ASHG in the coming days, so stick with us!
The annual meeting of the American Society of Human Genetics kicked off with a splash yesterday in Orlando, Fla. The PacBio team was thrilled that the opening talks in the presidential address and plenary session included a significant focus on increasing diversity in genetic studies to better characterize underrepresented populations. Nancy Cox, ASHG president, highlighted a number of excellent efforts to address this but noted, “Compared with what we need, what we’ve done so far is really just a drop in the bucket.” As regular blog readers know, we work closely with groups around the world to build population-specific reference genomes and improve structural variant calling for all groups, so her message really resonated with us.
The other headline event for us yesterday was a joint workshop from the Genome Reference Consortium and Genome in a Bottle consortium, called “Getting the Most from the Reference Assembly and Reference Materials.” In the GRC portion of the event, talks from NCBI’s Valerie Schneider and McDonnell Genome Institute’s Tina Graves-Lindsay were particularly interesting for our team. Schneider offered a brief overview of GRC projects and a deeper dive into accomplishments and challenges in improving the human reference genome. As she noted, the latest build is GRCh38, which is a combination of sequences from some 70 people; she and her team continue to patch the assembly, correcting mistakes and adding new representations of genetic variation without changing existing coordinates. Graves-Lindsay spoke about the need to generate more high-quality human genome sequences for better characterization of people from all ancestries, and reported on efforts to do just that. The GRC has now used SMRT Sequencing, among other tools, to produce data for 10 diploid and two haploid genomes as a means of adding allelic diversity to the reference. Two chromosome-level assemblies reflect just how complete and contiguous these resources are. They plan to produce FALCON-unzip assemblies for all samples to increase quality and usefulness even further.
In its section of the workshop, GIAB covered reference materials, variant benchmarking, and more. Among the speakers, Baylor’s Fritz Sedlazeck and NIST’s Justin Zook gave talks that were especially meaningful for us. Sedlazeck, who has developed several important algorithms for working with long-read sequencing data and calling structural variants, encouraged the community to support large population studies focused on structural variation. He also announced a new project that will entail whole genome sequencing for 100 individuals using SMRT Sequencing and linked-read sequencing, with the goal of detecting and phasing structural variants. In his talk, Zook spoke about bringing the principles of metrology to genomics, and said that GIAB benchmarks are already widely used for clinical validation. His team is now focused on particularly challenging genetic regions and variants and recently released a structural variant call set.
The excitement continues today with our workshop, Population and Clinical Genetics Studies Using Long-read SMRT Sequencing. Please join us on the lower level of Hilton Orlando, Orange Ballroom D (connected to the Orange County Convention Center) at 12:30 p.m. for lunch and learning with Han Brunner, M.D., Charles Lee, Ph.D., FACMG, Karen McFarland, Ph.D., and our own Chief Scientific Officer Jonas Korlach, Ph.D.
Stay tuned for more updates as we report back daily on the great science being presented at ASHG.
The Human Genome Structural Variation Consortium, a successor to the 1000 Genomes Project Consortium, recently released a preprint describing an in-depth study of structural variant (SV) detection in human genomes. The scientists found that PacBio long-read sequencing and complementary technologies dramatically improve sensitivity for these important genomic elements when compared to standard short-read sequencing.
“Multi-platform discovery of haplotype-resolved structural variation in human genomes” comes from lead authors Mark Chaisson, Ashley Sanders, and Xuefang Zhao; along with corresponding authors Charles Lee, Evan Eichler, and Jan Korbel; and many other consortium members. The study involved extensive sequencing of three family trios — Han Chinese, Puerto Rican, and Yoruban — for comprehensive discovery of structural variants. “The Han Chinese and Yoruban Nigerian families were representative of low and high genetic diversity genomes, respectively, while the Puerto Rican family was chosen to represent an example of population admixture,” the scientists write.
To date, attempts to identify all structural variants in a human genome using short-read technology have been unsuccessful. These variants are biologically and clinically relevant, so it is imperative that the community find better ways of resolving them. As the authors stated, “The incomplete identification of structural variants from whole-genome sequencing data limits studies of human genetic diversity and disease association.” For this project, they add, “we integrated a suite of cutting-edge genomic technologies that, when used collectively, allow structural variants to be assessed in a near-complete, haplotype-aware manner in diploid genomes.” Tools included short-read sequencing, SMRT Sequencing, optical mapping, synthetic long reads, and single-cell/single-strand sequencing. The authors also applied multiple analysis algorithms for each type of data, further improving sensitivity.
The team identified in each genome more than 800,000 indel variants smaller than 50 bp and nearly 32,000 structural variants 50 bp or larger. That is “a sevenfold increase in structural variation compared to previous reports, including from the 1000 Genomes Project.” The authors also report the identification of “156 inversions per genome—most of which previously escaped detection—as well as large unbalanced chromosomal rearrangements.”
An evaluation of the contribution by technology showed that PacBio sequencing has a threefold increase in sensitivity for structural variants compared to Illumina sequencing, likely resulting “from better access to intermediate-sized SVs (50 bp to 500 bp) and improved sequence resolution of insertions across the SV size spectrum,” the team notes. “The long-read sequence data provided us with an unprecedented view of genetic variation in the human genome,” the scientists add. “Using ~15 kbp reads at an average of 40-fold sequence coverage per child, we have been able to span areas of the genome that were previously opaque and discover three to fourfold more structural variation when compared to short-read sequencing platforms.” They estimate that 77% of insertions are routinely missed by variant-calling algorithms based on short-read data.
“This study represents the most comprehensive assessment of structural variation in human genomes to date,” the authors write. “We predict that a move forward to full-spectrum SV detection using an integrated approach demonstrated in this study will increase the diagnostic yield in patients with genetic disease, SV-mediated mutation, and repeat expansions.”
This research will be showcased next week at the American Society of Human Genetics (ASHG) annual meeting in Orlando. Charles Lee will present a talk entitled “Multi-platform Discovery of Haplotype-resolved Structural Variation in Human Genomes” at the PacBio workshop on Wednesday, October 18th at 12:30 pm. In the afternoon, Xuefang Zhao will present poster #1501, “Comprehensive Discovery of Genomic Variation from the Integration of Multiple Sequencing and Discovering Technologies.” Check out the complete list of presentations at ASHG 2017 featuring SMRT Sequencing.
Good news for cloud-loving PacBio users: genomics analysis service Bluebee has now implemented HGAP4 for de novo assembly of SMRT Sequencing data. It creates a fully automated, end-to-end assembly pipeline for genomes of all sizes from unmapped BAM files. The pipeline is available now to all Bluebee users.
This new analysis service expands the use of SMRT Sequencing data beyond the realm of bioinformatics experts. Bluebee’s platform is designed for ease of use and alleviates the pressure of setting up compute clusters or other on-site tech solutions. Bluebee will also handle updating the pipeline as new versions are released, so users can just focus on their genome results.
The Bluebee de novo Assembly Pipeline for PacBio Sequencing is the exact implementation of the PacBio HGAP4 pipeline as provided by SMRT Link (v5.0.1), covering pre-assembly, assembly, and polishing steps. Users can adjust tool parameters if they wish to customize the pipeline.
Bluebee offers a private cloud service for genomics analysis that serves all major European countries and US cities, plus Canada and Asia Pacific. Based in the Netherlands, the company has implemented strict, multi-layered security controls and its platform is compliant with all applicable security and regulatory standards.
Attending the annual American Society of Human Genetics (ASHG) meeting next week? For a demonstration of the pipeline, stop by and visit during our expert hours:
PacBio booth #722 – Thursday, 2:30 – 3:30 PM
Bluebee booth #522 – Friday, 12:30 – 1:30 PM
Check out our full range of ASHG activities – we look forward to seeing you in Orlando!
A new preprint from scientists at the University of Guelph in Canada and the University of Pennsylvania reports the evaluation of SMRT Sequencing with the Sequel System as a replacement for Sanger platforms for amplicon sequencing. They found that long-read PacBio sequencing was highly accurate, exceeded Sanger coverage metrics, and reduced costs by 40-fold.
“A Sequel to Sanger: Amplicon Sequencing That Scales” comes from lead author Paul Hebert, senior author Evgeny Zakharov, and collaborators. The team embarked on this project in the hopes of finding a suitable amplicon sequencing alternative to costly Sanger technology. Short-read sequencers have not succeeded for this application because “the short read lengths and high error rates of most platforms constrain their utility for amplicon sequencing,” they note. “While recent studies have established that Illumina and Ion Torrent platforms can analyze 1 kb amplicons with good accuracy, their need to concatenate short reads creates risks to data quality linked to the recovery of chimeras and pseudogenes.” In addition, cost improvement of these platforms compared to Sanger is just three- to four-fold.
They turned to the Sequel System and circular consensus sequencing (CCS) of amplicons which were indexed with all combinations of 100 distinct forward and 100 distinct reverse primer barcodes. CCS covers the same amplicon several times in a single read to ensure high accuracy, followed by consensus calling of molecules with the same barcode pair. For a rigorous evaluation, the scientists simultaneously analyzed barcoded amplicons from the mitochondrial cytochrome c oxidase I gene from 10,000 separate DNA extracts and representing more than 5,000 Arthropoda species, in a single SMRT Cell. The PacBio system was thus tested with a range of previously difficult sequencing aspects, from homopolymers to varied GC content, evenness of coverage across isolates, and more. SMRT Sequencing results were compared to those from Sanger technology.
The study found that the Sequel System delivered excellent accuracy and that the technique was robust. “Across this range of templates, SMRT sequencing showed no points of failure,” the scientists report. “SMRT sequences also had a major advantage over their Sanger counterparts as they regularly provided complete coverage for the target amplicon.” Unidirectional Sanger reads, for example, were frequently truncated and bidirectional reads varied noticeably in length, generally reflecting homopolymer runs.
While this project focused on shorter amplicons, the team notes that Sanger technology has known limitations for templates longer than 1 kb because of the need to analyze overlapping amplicons. Even in CCS mode, SMRT Sequencing reads are long enough that multi-kilobase templates can easily be covered several times.
The team reports that sequencing capacity makes the Sequel System particularly attractive for this application. “Because it can characterize amplicon pools from 10,000 DNA extracts in a single run, the SEQUEL reduces costs 40-fold from Sanger analysis,” they write. “Exploitation of this capacity is aided by the fact that data processing is simple.” Unlike Sanger data, which calls for visual inspection of results, or short-read data, “SMRT sequences can be processed with an automated pipeline,” they add.
Today is International Ataxia Awareness Day (IAAD), and we’re proud to be participating in this worthy cause. Ataxia is a group of rare, degenerative neurological diseases with a number of different presentations; many involve muscle tremors, loss of motor skills, and difficulty walking. As many as 150,000 people in the United States have some form of ataxia.
Because there are so many different types of ataxia, one of the most important early steps for those affected is getting an accurate diagnosis. There are several hereditary ataxias, and genetic testing is increasingly useful for pinpointing the exact type affecting a patient.
An ongoing challenge for accurate genetic testing, however, is that some ataxias are associated with repeat expansions — making these genetic regions difficult to characterize with short-read or even Sanger sequencing technologies. Recently, scientists have made significant leaps forward by applying SMRT Sequencing. Work from Tetsuo Ashizawa from Houston Methodist Research Institute and his colleagues Birgitt Schuele and Karen McFarland from the Parkinson’s Institute and University of Florida has been particularly impressive. In 2015, they published results from sequencing the complete repeat expansion ranging from ~5.3 kb-7.0 kb in several patients, finding novel interruption motifs that may help explain a specific phenotype. Just this month, they reported using a Cas9 method to snip out the region of interest and long-read PacBio sequencing to characterize it, again speculating that the presence or absence of repeat interruptions has an impact on pathology. “Single molecule sequencing paired with SMRT/Cas9 capture approach allowed us to characterize the genetic composition of the complete repeat expansion which revealed a novel phenotype-genotype correlation for Parkinson’s disease and ATXN10,” Ashizawa and his collaborators wrote.
We congratulate Ashizawa and all the other scientists working so hard to improve outcomes for patients with ataxia. Like any rare disease community, these patients lack resources available to more common disease groups, such as robust patient/caregiver organizations and extensive research funding. On this special day, we ask that you help raise awareness for the people who live with ataxia, and the scientists focused on finding new treatments.
To participate in #IAAD17 and learn more about the symptoms, treatment, diagnosis and different types of ataxia, please visit the excellent website of the National Ataxia Foundation. To hear more about Ashizawa and Schuele’s work, check out this video:
We’re pleased to release a new data set along with an allele phasing GitHub software workflow for those interested in exploring SMRT Sequencing data from an Alzheimer’s disease candidate gene study. Our team collaborated with Integrated DNA Technologies (IDT) to design a 35-gene panel targeting candidate Alzheimer’s disease genes identified as potential genetic risk loci across many GWAS and linkage studies. Long-read PacBio sequencing was applied to brain and skeletal tissue from two individuals diagnosed with Alzheimer’s disease and a wide range of variants were detected, from SNPs to indels, and larger structural variations up to several kilobases in size. Additionally, alleles were successfully phased which provides a more comprehensive understanding of the biological significance of the variants present in the samples. Here’s an example screenshot of a BIN1 gene phased into two phase blocks across a 62,641 bp region:
The samples were sequenced using the Sequel System (Sequel Chemistry 1.2) and analyzed with our newly updated Phasing Consensus Analysis for Targeted Sequencing Data GitHub repository. Data sets and related files are available on our PacBio DevNet. Captures of 7 kb genomic fragments for brain and skeletal muscle tissues were each sequenced on a single SMRT Cell, yielding roughly 8 GB of mappable data to the human reference genome.
For more about this data collection, don’t miss the upcoming webinar, “Characterizing Alzheimer’s disease candidate genes and transcripts with targeted, long-read, single-molecule sequencing” hosted by IDT on Wednesday, September 27th. We will be deep diving into the project and illustrate how coupling genomic and transcriptomic captures with xGen® Lockdown® probes enable informative results and insights beyond SNPs.
Register now to attend at 7:00 am PDT/10:00 am EDT / or at 11:00 am PDT/2:00 pm EDT.
A new publication from scientists at The Rockefeller University and PacBio presents reference-grade, phased diploid genome assemblies for two important avian models for vocal learning, Anna’s hummingbird and zebra finch. Results are expected to help establish genome quality standards for the G10K and B10K sequencing projects, in addition to providing a better foundation for neuroscience studies.
Published in GigaScience, “De Novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads” comes from lead author Jonas Korlach, senior author Erich Jarvis, and collaborators. The team undertook this project to improve the quality of genome assemblies available for these birds, demonstrating that key genes of interest were completely represented in single contigs. Existing assemblies produced with Sanger or short-read sequencing were incomplete and highly fragmented, precluding the comprehensive scientific view required for a deeper understanding of vocal learning.
By incorporating SMRT Sequencing, the team not only raised the bar for assembly quality but also phased the genomes using FALCON-Unzip, a diploid assembly tool. The new zebra finch assembly represented “a 108-fold reduction in the number of contigs and a 150-fold improvement in contiguity compared to the current Sanger-based reference,” the authors write. For hummingbird, the PacBio assembly led to “a 116-fold reduction in the number of contigs and a 201-fold improvement in contiguity over the reference.” Both assemblies had contig N50s greater than 5 Mb. “These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references,” the scientists report, “including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult to sequence regions, complex repeat structure errors, and allelic differences between the two haplotypes.”
The team assessed gene content of the assemblies with CEGMA and BUSCO comparisons. In both cases, the number of complete or nearly complete genes increased. They also used RNA-seq to evaluate the reference genomes, finding that the PacBio long-read assemblies increased “total transcript read mappings compared to the Sanger-based reference … suggesting more genic regions available for read alignments,” they write.
Finally, the scientists conducted in-depth interrogations of four genes particularly important for vocal learning. EGR1, for instance, has gaps in previous zebra finch and hummingbird reference genomes. In both SMRT Sequencing assemblies, though, the gene was fully resolved and spanned in a complete contig. There were similar improvements for DUSP1, FOXP2, and SLIT1.
“We found that the long-read diploid assemblies resulted in major improvements in genome completeness and contiguity, and completely resolved the problems in all of our genes of interest,” the scientists report. “We now, for the first time, have complete and accurate assembled genes of interest that can be pursued further without the need to individually and arduously clone, sequence, and correct the assemblies one gene at a time.”
For more, check out our recent release of Iso-Seq data for hummingbird and zebra finch.