This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
Scientists from the University of Hong Kong recently reported results of a head-to-head comparison of long-read and short-read platforms for sequencing and assembly of a bacterial genome. They determined that only SMRT Sequencing was capable of generating highly accurate, complete assemblies. “Completing bacterial genomes should no longer be regarded as a luxury, but rather as a cost-effective necessity,” the team reports.
“PacBio But Not Illumina Technology Can Achieve Fast, Accurate and Complete Closure of the High GC, Complex Burkholderia pseudomallei Two-Chromosome Genome” was published in Frontiers in Microbiology by lead author Jade Teng, senior author Patrick Woo, and collaborators. For this project, scientists compared performance of the PacBio RS II Sequencing System with the Illumina HiSeq 1500. Their target was Burkholderia pseudomallei, which has at least 68% GC content as well as “highly repetitive regions and substantial genomic diversity,” the authors report.
After sequencing, the team attempted both hybrid and single-source assemblies. Working with Illumina data alone “resulted in a draft genome with more than 200 contigs,” they note, pointing out that the platform’s reliance on PCR amplification is inherently problematic for GC-rich genomes. Three different short-read assemblers were not able to improve results. The hybrid assembly of both sequencers’ data was also “not successful,” producing 74 contigs, the team reports.
Assembling only PacBio data, which was generated from a single SMRT Cell, led to a very different result. The approach “achieved complete closure of this two-chromosome B. pseudomallei genome without additional costly bench work and further sequencing, demonstrating its utility in the complete sequencing of bacterial genomes, particularly those that are well-known to be difficult-to-sequence,” the scientists write. The chromosome contigs of the assembly aligned to the organism’s reference genome with better than 99.9% accuracy. Importantly, the assembly accurately characterized “the number of CDSs and their distributions in each subsystem, four ribosomal operons, the highest number of core and virulence proteins (coverage of query protein sequence and amino acid identity ≥80%), and MLST gene loci,” the team adds.
The Illumina assembly, on the other hand, was unable to resolve these elements. “Extraordinarily high coverage of Illumina reads were observed in several collapsed repeat regions, including regions containing varying copies of mobile element proteins and ribosomal operon,” Teng et al. report. “We reasoned that Illumina sequencing was not able to resolve these repeat regions as their sequence reads were not long enough to span different kinds of repeats with unique flanking sequences.”
The scientists also included an assessment of project cost. “To completely sequence a bacterial genome using Sanger sequencing or the second generation sequencing platforms, the main bulk of the cost, labor and time is spent in the gap-filling phase,” they write. “It has been estimated that when using these second generation sequencing platforms, around 95% of the money and time are spent in completing the last 1% of the bacterial genome.” But the calculation is very different for SMRT Sequencing. “Although the cost per base is more expensive for the PacBio RS II platform compared to short-read sequencing technology, no additional manual work after de novo assembly is required,” the team concludes, “and the benefit of obtaining an accurate number of individual replicons and an intact assembly of repetitive regions and mobile genetic elements justify the initial cost.”
Earlier this year, scientists from Korea reported results from a transcriptome study of Pacific abalone. In this paper, the team used SMRT Sequencing to demonstrate that alternative splicing and gene expression have sex-specific signatures in these organisms.
“Alternative Splicing Profile and Sex-Preferential Gene Expression in the Female and Male Pacific Abalone Haliotis discus hannai” comes from lead authors Mi Ae Kim and Jae-Sung Rhee, senior author Young Chang Sohn, and collaborators. They focused on abalone, a marine gastropod, because of its importance to Korean aquaculture: the species they studied is estimated to represent about 10,000 metric tons of production each year.
As H. discus hannai has grown in economic importance, new genomics resources have become available, including a genetic linkage map and some RNA-seq data. This latest project was designed to glean more information about abalone to provide a clearer view of its biological function. Scientists chose to focus on sex-specific transcriptomes, using the Iso-Seq method to study gene content in male and female members of the species. They analyzed several tissue types — including gonads, muscle, gills, and more — and defined 15,110 protein-coding genes in females and 12,145 in males. Of those, 519 genes in female and 391 genes in male produced alternatively spliced transcripts.
To validate these findings, the team investigated expression profiles for six genes known to be sex-preferential in related organisms. Two of the three female-specific genes and all three male-specific genes were highly expressed in their respective samples. “Taken together, these studies strongly suggest the intactness of the sex-specific isoform DB of the Pacific abalone,” the authors write.
“The information obtained in this study represents the first significant contribution to sex-specific genomic resources, as well as isoform information,” the scientists conclude. “These data will provide an essential genomic reference that could be used for further diverse genetics- and physiology-based research using abalones.”
If you weren’t at the 36th International Society for Animal Genetics Conference in Dublin, you missed more than a chance to drink Guinness and practice an Irish brogue. The PacBio team had a great time at ISAG, learning about the latest in animal science and updating attendees on the advantages of SMRT Sequencing for generating high-quality genome assemblies and annotations.
The conference drew more than 750 scientists from around the world, and we were truly impressed by the quality of research they presented in talks and posters. Long-read PacBio sequencing is already making a difference for scientists in this community, many of whom are focused on improved breeding programs. Genome assemblies powered by SMRT Sequencing were presented for many economically important species, including chicken, sheep, goat, pig, cattle, horse, camel, and Atlantic herring. There were also several presentations featuring PacBio long-read sequencing data for immune region haplotypes such as the leukocyte receptor complex and the major histocompatibility complex.
We hosted a morning seminar that demonstrated how SMRT Sequencing provides comprehensive views of animal genomes and transcriptomes. John Williams from the University of Adelaide presented a preliminary assembly of the water buffalo genome generated with Sequel System data. A member of the International Buffalo Genome Consortium, Williams described limitations in contiguity and completion for previous sequencing efforts that used short-read data for this important livestock animal. Seeking a reference-grade assembly, the scientists turned to PacBio long reads, using FALCON-Unzip to phase more than half of the diploid genome. Though the assembly is not yet polished, Williams reported a stellar contig N50 of 18.7 Mb. The other seminar speaker was our own Emily Hatas, who discussed the chicken genome annotation generated by Richard Kuo at the Roslin Institute. In that project, scientists used the Iso-Seq method with SMRT Sequencing to identify 64,000 transcripts, including more than 17,000 long non-coding RNAs that had not been previously annotated.
As sponsors of the event, we also had fun encouraging attendees to snap creative photos with the toy animals we gave away at our booth. Check out the variety of clever snapshots!
We were delighted to be back at the University of Maryland this summer for our annual East Coast User Group Meeting. The day-long event, preceded by half-day workshops on sample prep and bioinformatics, exceeded our expectations. From the packed session hall to the terrific science and great discussions, the UGM facilitated the exchange of best practices and new suggestions for optimizing SMRT Sequencing performance for a variety of applications. Below is a recap of the day’s highlights, with several of the presentations available to download.
PacBio scientist Aaron Wenger presented the Structural Variant Calling application that is included in the SMRT Link v5.0 software release. The application utilizes the read aligner NGM-LR, and features both a command-line tool called pbsv and a web interface. Noting that most of the genetic difference between any two people lies in structural variation, he showed that short-read sequencers cannot detect the vast majority of these important variants. Wenger demonstrated that even low-coverage SMRT Sequencing can be used to discover structural variants; in an experiment, 10-fold coverage revealed almost 100% of homozygous variants and nearly 90% of heterozygous variants in a human individual.
Michael Schatz from Johns Hopkins University gave a talk entitled “In Pursuit of Perfect Genome Sequencing” in which he walked through three key metrics for evaluating genome quality: correctness (basepair accuracy), completeness (no gaps in the sequence), and contiguity (sequence ordered as on the physical chromosomes). Schatz compared the leading sequencing technologies available today, and explained that PacBio SMRT Sequencing is the most capable technology for all three metrics.
Continuing the human genome theme, Ricardo Mouro Pinto from Massachusetts General Hospital spoke about using SMRT Sequencing to quantify CAG repeat instability in Huntington’s disease. Caused by a CAG repeat expansion, Huntington’s occurs when a person’s genome harbors 40 or more copies. Pinto noted that typically, the longer the repeat, the younger the person is at disease onset. The Huntington’s locus is difficult to enrich because it is resistant to PCR amplification. By using Cas9 digestion to perform non-amplification-based target enrichment followed by PacBio sequencing, Pinto’s team was able to capture wild type and disease alleles with no amplification bias. He noted that results are preliminary, and he hopes to expand the number of samples studied to get a better handle on CAG instability.
Representing the plant community, Hamid Ashrafi and Hamed Bostan from North Carolina State University tag-teamed a presentation on the blueberry genome and transcriptome. The fruit plant naturally occurs in diploid, tetraploid, and hexaploid genomes. The scientists generated a high-quality diploid assembly using SMRT Sequencing and noted that long reads were essential to get through the highly repetitive genome. Next, they used Iso-Seq to study several types of tissue from diploid, tetraploid, and hexaploid blueberry plants, finding many transcripts missed by short-read sequence data. Using both genome and transcriptome approaches was particularly important, Ashrafi noted, because SNPs explain only a small portion of natural variation for this plant, and he believes that alternative splicing and structural variants likely contribute a much larger proportion of variation. The team is still analyzing results but said that switching to long reads was “a dream come true.”
On the microbial front, Jethro Johnson from The Jackson Laboratory for Genomic Medicine gave a talk on full-length 16S rRNA sequencing, which is useful for taxa identification. By genotyping or using short-read data, Johnson said, so much of the information in the variable regions of 16S is missed that it often is impossible to accurately classify organisms. So, Johnson turned to SMRT Sequencing and circular consensus sequencing (CCS), which generates highly accurate long reads. Johnson applied CCS for a mock bacterial community of 36 species and found that SMRT Sequencing offered accurate results for identification. In studies of fecal samples, PacBio sequencing was able to provide a unique identification in cases where short-read sequencing generated ambiguous results. The team is now expanding SMRT Sequencing results to include internal transcribed spacer regions.
In a separate presentation, Phillip Tai from the University of Massachusetts Medical School highlighted the use of long-read sequencing for genome population sequencing of adeno-associated viruses. These harmless viruses have gained new interest recently as a vector for gene therapies, so Tai’s lab is interested in analyzing large groups of them to filter out any that would not be ideal vectors. By applying SMRT Sequencing to recombinant AAVs, they generate complete resolution of the vector genome, including the difficult-to-sequence inverted terminal repeats. This accomplishment could have tremendous value in the gene therapy field, he said.
The meeting also included some new tools and protocols from the community. New England Biolabs’ Bo Yan presented SMRT-cappable-seq, a method for characterizing operons across an entire bacterial genome. It involves capping the 5’ end of bacterial primary transcripts and using SMRT Sequencing to produce full-length transcripts. Yan said the protocol increases library prep efficiency and accurately defines and links the transcription start site and transcription termination site (something short reads cannot do). A validation project in E. coli revealed 840 novel operons, extending 40% of annotated operons in RegulonDB. In another talk, Manuel Tardaguila from the University of Florida discussed SQANTI, a new tool to perform quality control for long-read transcripts. The pipeline performs classification, curation, and quantification of transcripts to filter out any artifacts and ensure that scientists analyze only the highest-quality results. SQANTI incorporates PacBio data, a reference genome, and other resources to conduct its rigorous evaluation.
We’d like to thank our hosts for the meeting, the Genomics Resource Center, Institute for Genome Sciences at the University of Maryland, as well as our partners: Advanced Analytical Technologies, Diagenode, and Sage Science. And, of course, thanks to all the scientists who took time out of their busy schedules to make this event a success!
A paper from scientists at the National Marrow Donor Program, Center for International Blood and Marrow Transplant Research, Fred Hutchinson Cancer Research Center, and other institutions reports the use of SMRT Sequencing to characterize the challenging killer cell immunoglobulin-like receptor (KIR) region in eight human genomes. By sequencing full-length fosmids, they found previously unreported haplotype structures.
“Revealing Complete Complex KIR Haplotypes Phased By Long-Read Sequencing Technology” comes from lead author David Roe, senior author Martin Maiers, and collaborators. They targeted the KIR region — which has implications in autoimmune disease, transplantation, infections, and more — because it has historically been very challenging to sequence. Containing as many as 16 genes and pseudogenes, the highly homologous KIR haplotypes are shaped by tandem duplications, deletions, and frequent recombination. “These characteristics of homology, repetitiveness, and structural diversity have made the region difficult to haplotype,” the scientists note. “A sequencing approach that precisely captures the complexity of KIR haplotypes for functional annotation is desirable.”
To that end, they incorporated SMRT Sequencing, which produces reads long enough to span fosmids. “Using this method, we have for the first time comprehensively sequenced and phased sixteen KIR haplotypes from eight individuals without imputation,” the authors report. “Sixteen haplotypes from eight individuals were completely and [unambiguously] sequenced except for two haplotypes whose KIR3DL3 genes were not captured in the fosmid and a small gap in one of the haplotypes, located in a repetitive insertion spanning over 100,000 bp.” Haplotypes were as short as 69 kb and as long as 269 kb, and included four novel structures. The team also uncovered a new gene fusion as well as previously unreported structural variants.
One of the most important elements for resolving this difficult region was eliminating the need to shotgun shear the fosmids prior to sequencing. Because the longest SMRT Sequencing could span full-length fosmids, “it is therefore possible to span an entire fosmid insert with a single continuous read,” the scientists write. “Bypassing the shearing … helped improve the phasing accuracy of the individual fosmid sequences and the high-quality sequences of complete fosmids easily tiled into full haplotypes.”
This workflow made it possible to phase centromeric and telomeric regions, among other accomplishments. “Such completely de novo assembled sequences not only provide the ability to discover and annotate KIR gene alleles at the highest resolution, but also provide value as references, evolutionary informers, and source material for imputation,” the team writes.
We enjoyed attending the annual meeting of the Society for Molecular Biology and Evolution in Austin earlier this month. Some 1,500 people attend SMBE, which this year offered cutting-edge sessions on evolutionary genomics, microbiome dynamics, epigenetics, and much more.
There were several posters and presentations featuring SMRT Sequencing data, most focused on using highly accurate, long-read data to generate or improve reference genomes. The high-quality assemblies we saw are enabling evolutionary biologists to investigate genomic structure and function in a variety of organisms as well as characterize structural variants and other complex polymorphisms that are difficult to detect with conventional technologies.
If you couldn’t attend SMBE, we encourage you to check out these resources for a glimpse of how this community is making using of SMRT Sequencing:
This preprint from Mahul Chakraborty et al. reports a new, reference-grade assembly of an African D. melanogaster strain.
This poster from Fabrizio Ghiselli et al. describes a study of a unique bivalve with two mitochondrial genomes.
This poster from Patrick Reilly et al. shows results from several high-quality de novo assemblies produced with SMRT Sequencing.
This paper from Tracey Ruhlman et al. was the basis for an SMBE poster about characterizing repeat structures in the plastid genome of the flowering plant, Monsonia emarginata.
This poster from Sarah Kingan and Aaron Wenger from PacBio presents structural variation data generated with SMRT Sequencing.
A news article in Science magazine nicely captures the improvements in the quality of genome assemblies made possible by long-read sequencing and other methods. “New technologies boost genome quality” was written by Elizabeth Pennisi and includes interviews with a number of leading scientists.
One of those is Erich Jarvis, the neuroscientist at Rockefeller University who is best known in the genomics community for his work on vocal learning with songbirds and has been instrumental in both the G10K and B10K programs. As he told Pennisi, “The genome quality makes a huge difference in the type of science we can do.” For sequencing through repetitive or other challenging regions, he added, “the long read is always more accurate.”
“He and many other genomics experts are launching a quiet revolution aimed at building better genomes, one made possible by newer sequencing technologies, novel methods for locating sequences on chromosomes, and improved software for piecing DNA together,” Pennisi reports. “In the past 6 months, these approaches have led to a flood of high-quality animal and plant genomes in preprints and published papers.”
The USDA’s Tim Smith is also included, with a look at the impressive goat genome assembly he and his team recently produced. For that effort, he used SMRT Sequencing and complementary approaches, such as Hi-C and optical mapping, for optimal quality and contiguity. The finished assembly, Pennisi notes, “consists of chromosome-length pieces of DNA and only has 492 gaps, a 500-fold improvement over the first goat genome, done in late 2012.”
The article features a chart showing improvements in assembly metrics for three recently published genomes: hummingbird, goat, and maize. We have followed each of those assemblies closely, but there’s something about seeing all the data in one quick chart to really underscore how significant these advances have been — and in such a short period of time. If you have a moment, it’s definitely worth a look.
In a new bioRxiv preprint, scientists from Johns Hopkins present a major step forward in accuracy and completeness for the wheat genome. Their new assembly, generated largely from PacBio data, demonstrates the importance of using long, highly-accurate reads for resolving extremely complex, repetitive genomes.
“The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum,” comes from lead author Aleksey Zimin, senior author Steven Salzberg, and collaborators. In launching this project, the team aimed to overcome a longstanding challenge for the wheat research community. “Common bread wheat, Triticum aestivum, has one of the most complex genomes known to science, with 6 copies of each chromosome, enormous numbers of near-identical sequences scattered throughout, and an overall size of more than 15 billion bases,” they write. “Multiple past attempts to assemble the genome have failed.”
The first publication for this species, which came out in 2012, only assembled one-third of the genome. In 2014, a short-read assembly managed to capture two-thirds of the genome in a highly fragmented assembly, while a subsequent short-read-based effort delivered more sequence but in millions of contigs.
For this project, scientists adopted SMRT Sequencing to produce reads long enough to span repetitive elements that are a hallmark of the plant’s genome. The result was phenomenal: “Ours is the first assembly that contains essentially the entire length of the genome, with more than 15.3 billion bases, and its contiguity is more than ten times better than the partial assemblies published in the past,” the authors report.
The team took two approaches to creating the assembly: a hybrid Illumina-PacBio version, and an all-PacBio version. Ultimately, they merged both to create the final assembly, which has a contig N50 of 232.6 kb; the longest contig is 4.5 Mb. The PacBio-only version, produced with the FALCON assembler, relied on 36-fold genome coverage to generate an assembly of 12.94 Gb with a contig N50 of 215.3 kb. “The key factor in producing a true draft assembly for this exceptionally repetitive genome was the use of very long reads, averaging just under 10,000 bp each, which were required to span the long, ubiquitous repeats in the wheat genome,” the scientists note.
Evaluating the assembly’s quality was a tall order given the state of previous assemblies. The scientists compared it to an assembly for a diploid ancestor of bread wheat and found that 99.8% of the smaller genome aligned to the new assembly, offering “strong support for its accuracy” as well as its completeness, they write.
One of the most interesting findings of this effort was the delineation of that ancestral plant’s contributions to the bread wheat genome (known as the wheat D genome). “By aligning this assembly to the draft genome of Aegilops tauschii, the progenitor of the wheat D genome, we were able to cleanly separate the D genome component from the A and B genomes of hexaploid wheat, which is reported here for the first time,” the team explains.
Ultimately, the scientists believe the new wheat assembly offers a significant boost to the wheat community, which has never had the benefit of a well-annotated, high-quality genome for crop improvement efforts. “This represents by far the most complete and contiguous assembly of the wheat genome to date,” the scientists write, “providing a strong foundation for future genetic studies of this important food crop.”
In an effort to improve precision medicine in Chinese populations, Novogene announced plans to build a database of structural variants in 1,000 Chinese individuals using PacBio SMRT Sequencing. Databases which catalog SNVs and small indels have proven invaluable for precision medicine, serving as population controls for rare disease research and providing a list of variants for genetic association studies. Yet, most of the base pairs that differ between two human genomes are in structural variants which are not adequately represented in current databases. Furthermore, current databases do not represent the genetic background of all ethnic populations, particularly the Chinese who comprise one-fifth of the world’s population.
Novogene will perform the sequencing using a fleet of up to 10 PacBio Sequel Systems, which can produce reads with an average length of 10,000 – 18,000 bp. Long-reads are better able to map to repetitive regions of the genome and fully span large variants. Previous studies have shown long-read sequencing has five times higher sensitivity for discovery of structural variants as compared to short-read sequencing approaches1. Structural variants are already known to cause many human diseases, including Carney Complex, Potocki-Lupski Syndrome, ALS, and Smith-Magenis syndrome. Thus, complete measurement of structural variation is required for precision medicine.
Novogene’s structural variant database will include a variety of disease types across the population cohort. In a statement, Novogene Founder and CEO Ruiqiang Li said, “This more revealing and informative database should greatly improve our understanding of disease mechanisms and contribute to the development of novel diagnostic and therapeutic approaches.”
- Huddleston, J. et al. (2016) Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Research (5), 677- 685
In a Nature Genetics paper, scientists used SMRT Sequencing to detect and compare structural variations in several yeast strains in order to understand evolutionary genome dynamics. They found different rates of evolution among domesticated and wild strains, and suggest that “the influence of human activities” could explain this.
“Contrasting evolutionary genome dynamics between domesticated and wild yeasts” comes from lead author Jia-Xing Yue, senior author Gianni Liti, and collaborators at the Université Côte d’Azur, the Wellcome Trust Sanger Institute, and other institutes. Choosing long reads to facilitate accurate detection of structural variants, they used PacBio sequencing and generated “end-to-end genome assemblies for 12 strains representing major subpopulations of the partially domesticated yeast Saccharomyces cerevisiae and its wild relative Saccharomyces paradoxus,” the scientists report. “The raw PacBio de novo assemblies of both nuclear and mitochondrial genomes showed compelling completeness and accuracy, with most chromosomes assembled into single contigs, and highly complex regions accurately assembled.”
According to the team, the final 12 assemblies provided “unprecedented resolution” for analyzing subtelomeric regions, which yielded a detailed look at evolutionary genome dynamics. “In chromosomal cores, S. paradoxus shows faster accumulation of balanced rearrangements (inversions, reciprocal translocations and transpositions), whereas S. cerevisiae accumulates unbalanced rearrangements (novel insertions, deletions and duplications) more rapidly,” the scientists write. “In subtelomeres, both species show extensive interchromosomal reshuffling, with a higher tempo in S. cerevisiae.” The accelerated evolution in baker’s yeast is likely to be at least partly a function of human activity and the human-associated environments to which the organisms have been exposed, they add.
The authors note that this study is an indicator of the utility of SMRT Sequencing for population genomics. Ultimately, they say, their results offer an intriguing new explanation for “why S. cerevisiae, but not its wild relative, is one of our most biotechnologically important organisms.”
The Global Ant Genomics Alliance (GAGA) recently announced that it has adopted SMRT Sequencing as its technology of choice for generating high-quality genome assemblies. The alliance, made up of more than 50 scientists at dozens of institutions around the world, aims to sequence 200 ant species to provide a comprehensive look at genomic diversity across ant genera and to provide the scientific community with a foundation of data to enable decades worth of research.
GAGA teamed up with genomic service provider Novogene, which agreed to purchase 10 Sequel Systems earlier this year. In the announcement, GAGA noted that SMRT Sequencing enables “high quality genome assemblies with very few gaps. It is likely to become the dominant technology for de novo reference genome sequencing.”
GAGA also stated in the announcement that choosing SMRT Sequencing “should make sure that ant genomes generated under GAGA will meet future quality standards of journals and repositories.” This is particularly useful in working toward GAGA’s goals of generating a genome-based phylogeny for ants, studying the adaptive significance of physical caste phenotypes, and understanding how ants have adapted to different methods of resource acquisition.
For more information on the alliance, check out GAGA’s website. We look forward to seeing lots of reference-grade de novo genome assemblies coming from this group soon!
A new publication in Clinical Cancer Research from scientists at the Mayo Clinic, University of Minnesota, and other institutions presents results from a study to evaluate androgen receptor (AR) isoforms as biomarkers for chemotherapy resistance in prostate cancer patients. The team used the Iso-Seq method with SMRT Sequencing to better characterize the structures of AR variants, discovering that the exon structure of this prostate cancer driver had previously been misreported due to the limitations of short-read sequencing.
“Androgen receptor variant AR-V9 is co-expressed with AR-V7 in prostate cancer metastases and predicts abiraterone resistance” comes from lead authors Manish Kohli and Yeung Ho, senior author Scott Dehm, and collaborators. The team aimed to expand on previous discoveries about androgen receptor transcription factors that confer resistance to targeted therapies in cases of prostate cancer. Androgen receptor variant AR-V7 was already known to promote resistance, but scientists wanted to see if other elements contributed to this effect.
For the project, researchers combined data from short-read sequencing and SMRT Sequencing, using long reads to capture full-length transcripts. They discovered a significant error in previous studies that highlighted AR-V7. Specifically, AR-V9 includes a cryptic exon that had been thought to be unique to AR-V7. “This work re-annotates AR-V9 mRNA structure, and finds that the role of AR-V9 in therapeutic resistance has been obscured by extensive overlap in mRNA sequence with AR-V7,” the scientists write. “The finding that high AR-V9 mRNA expression in metastases was predictive of primary resistance to the androgen synthesis inhibitor abiraterone indicates that monitoring and inhibition of AR-V9 may be needed to overcome therapeutic resistance.”
The problem with short-range information of any type is that it precludes direct observation of the full transcript, the scientists note. Generating information about small pieces of a transcript necessitates inferring expression levels and relationships, “as is the case for short-read RNA-seq data, quantitative RT-PCR with primers flanking splice junctions, or hybridization of probes to single exons,” they add.
The team analyzed expression in circulating tumor cells collected from 12 patients with castration-resistant prostate cancer (CRPC) who had been treated with androgen receptor-targeted therapies or an androgen receptor antagonist. PacBio sequencing revealed that the 3’ terminal exon for AR-V9 is 2.4 kb, much longer than previous annotations had found. This exon was shared with AR-V7. “Since AR-V7 and AR-V9 proteins are both constitutively active, the overall levels and functional impact of AR-Vs in prostate cancer may be greater than would be anticipated from analyses of either AR-V alone,” the scientists report.
They concluded that “high AR-V9 mRNA expression in CRPC metastases was predictive of primary resistance to abiraterone acetate.”
Two recent papers underscore the importance of using PacBio full-length RNA sequencing to interrogate transcriptomes for major crops. Together, these publications offer compelling evidence that information considered essential for crop improvement programs is too often missed by short-read sequencers.
A team of scientists from the Earlham Institute and other institutions report the assembly and annotation for wheat, an allohexaploid genome. From lead authors Bernardo Clavijo and Luca Venturini, senior author Matthew Clark, and collaborators, “An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations” came out in Genome Research.
For this project, scientists used the Iso-Seq method of generating SMRT Sequencing data for full-length transcript isoforms using six tissue types. The results allowed the team to discover thousands of genes missed in a previous annotation and corrected thousands more existing gene models. With a much-improved annotation, the scientists were able to identify a set of disease-resistance genes, gluten protein genes that are important for baking quality, and genes associated with useful traits such as plant height and grain yield. Together with the genome assembly they produced, the team reports these are “powerful resources for trait analysis and breeding of this key global crop.”
Separately, “A chromosome conformation capture ordered sequence of the barley genome” was published in Nature by lead author Martin Mascher and senior author Nils Stein from the Leibniz Institute of Plant Genetics and Crop Plant Research and collaborators. They used Iso-Seq data as part of the annotation effort for their new barley reference genome assembly, resulting in about 40,000 high-confidence genes from a total of 16 different tissues. The scientists note that SMRT Sequencing was used to generate full-length transcript data “to support gene calling in general, and the identification of alternative splice forms in particular.” Their analysis of gene families led to finding “lineage-specific duplications of genes involved in the transport of nutrients to developing seeds and the mobilization of carbohydrates in grains.”
Learn more about using the Iso-Seq method for plant and animal genome annotation and alternative splicing identification.
An article published today in Genetics in Medicine from Jason Merker, Euan Ashley, and colleagues at Stanford University reports the first successful application of PacBio whole genome sequencing to identify a disease-causing mutation. (Check out Stanford’s news release here.) The authors describe an individual who presented over 20 years with a series of benign tumors in his heart and glands. The individual satisfied the clinical criteria for Carney complex, but after eight years of genetic evaluation, including whole genome short-read sequencing, experts were still unable to pinpoint the underlying genetic mutation and confirm a diagnosis.
Ultimately, the authors turned to the Sequel System to evaluate structural variants, large genetic differences that involve at least 50 base pairs and are uniquely discoverable with long-read sequencing. This quickly led to the identification of the causative mutation: a 2.2 kb deletion that affects PRKAR1A, the gene involved in Carney complex. This case demonstrates the ability of long-read sequencing on the Sequel System to reveal genetic variation that is inaccessible with short-read technologies and highlights the potential to apply PacBio sequencing to precision medicine .
A human genome has around 20,000 structural variants (differences ≥50 bp) spanning 10 Mb, more base pairs than single nucleotide variants and small indels put together. Because structural variants tend to lie in repetitive regions of the genome and/or are larger than short-read sequencers can span, the vast majority (80%) are identified only by long-read sequencing. This means even so-called “whole” genome sequencing with short reads misses much of the variation in a human genome. 
Figure 1. Structural variation in the human genome. (a) Types of structural variation. (b) Differences between two typical human genomes. (c) Structural variants detected in a typical human genome with PacBio sequencing compared to short-read sequencing.
Carney complex, a multiple neoplasia syndrome, is exceedingly rare, with fewer than 750 cases ever reported. Most individuals with the syndrome have a mutation that inactivates one of the two copies of the gene PRKAR1A. However, in the case reported today, clinical sequencing of PRKAR1A did not reveal any mutations. Then, short-read whole genome sequencing was applied to look for mutations throughout the genome, but it was uninformative. Ashley, Merker and colleagues were then driven to apply PacBio long-read sequencing to evaluate structural variants missed by previous methods.
The Sequel System was used to generate approximately eight-fold coverage of the human genome. Reads were mapped with NGM-LR , and structural variants were called with PBHoney , yielding 6,971 deletions and 6,821 insertions. These were filtered for rare, genic variants associated with disease genes, which left only six candidates for manual evaluation. One of the six variants was a heterozygous 2.2 kb deletion that removes the first coding exon of PRKAR1A. The variant was evaluated with Sanger sequencing in the individual and his parents, which demonstrated that the deletion is a de novo mutation not present in the parents.
Approximately two-thirds of individuals with presumed genetic disorders remain undiagnosed even after short-read exome and whole genome sequencing. It is hypothesized many of the undiagnosed cases are explained by variants missed by short-read sequencing technologies, most notably structural variants, variants in GC-rich regions of the genome, and repeat expansions . The study published today provides a proof-of-principle demonstrating that PacBio long-read sequencing identifies previously overlooked structural variants, even at relatively low sequencing coverage. We are excited by upcoming studies that will evaluate many more cases to elucidate the improvement in diagnostic yield from long-read sequencing, and to demonstrate that precision medicine requires a comprehensive view of genetic variation.
 Merker JD, et al. (2017). Genetics in Medicine.
 Huddleston J, et al. (2017). Genome Research, 27(5):677-685.
 English AC, et al. (2015). BMC Genomics, 16:286.
 Biesecker LG, et al. (2011). Genome Biology, 12(9):128.
Last month, we co-hosted the 2nd annual SMRT Leiden conference with Leiden University Medical Center. SMRT Leiden featured three days of excellent presentations, including one day focused on bioinformatics. If you missed it, we’ve prepared this quick recap to cover the highlights. In addition, several of the presentations are available to download, and you can check out tweets from day 1 and day 2.
The meeting kicked off with a clinical angle: Eric Schadt from the Icahn School of Medicine at Mount Sinai gave a keynote talk about capturing the clinically actionable genome. Noting that we are in an age of data explosion, Schadt presented ideas for how to take advantage of that to improve human health — and ultimately to model individual health trajectories for optimal decision-making in the clinic. At Mount Sinai, Schadt said genetic testing is becoming more comprehensive, citing examples like a pan-ethnic carrier screen and pregnancy-related testing that starts before conception and follows the infant after birth. SMRT Sequencing is important for these efforts because of its excellent accuracy and long reads, which enable phasing variants and resolving complex regions. By combining technologies, Schadt said his team improved carrier screening to deliver meaningful results to more than 60% of patients, compared to fewer than 7% with traditional testing. Schadt’s colleague Robert Sebra also gave a clinical talk, in which he said that the ideal approach will be whole genome sequencing with long reads to capture challenging genes, pseudogenes, and other important but complex elements. While that is not yet practical, he noted that previous efforts in the lab to sequence whole human genomes took a year and 1,000 SMRT Cells on the PacBio RS II; with the Sequel System, that now takes 50 SMRT Cells and can be completed in two weeks.
Two keynote presentations focused on genome evolution. Shinichi Morishita from the University of Tokyo spoke about bacterial metagenomics, for which PacBio sequencing improved the detection rate for mobile elements and methylation motifs. He also works on centromeres, for which he uses PacBio sequencing with the Hi-C method. Jason Underwood from the University of Washington presented the use of long reads to compare apes and humans in order to find elements specific to humans. His team is using SMRT Sequencing to generate high-quality primate genomes, such as the recent Susie3 assembly, and to annotate them. These projects have improved structural variation detection and increased discovery of human-specific events. Underwood said high-quality PacBio assemblies would be available in the next year or two for gibbon, bonobo, and rhesus macaque.
The Max Planck Institute’s Stefan Mundlos kicked off the afternoon with a keynote about using topologically associated domains, CRISPR, and other approaches to elucidate skeletal disease. Following that, several presentations focused on the use of SMRT Sequencing to resolve challenging regions in the human genome. Adam Ameur from Uppsala University is using PacBio sequencing for targeted and whole-genome methods to resolve repeats, low frequency mutations, and more. As part of the Swedish 1000 Genomes Project, his team has sequenced two whole genomes with SMRT Sequencing so far, finding about 20,000 structural variants in each one — 80% of which were missed by short-read sequencing. From NUI Galway, Brian McStay presented on the genomic architecture of regions on human acrocentric chromosomes. These regions are difficult to sequence due to repetitive DNA, but he was able to target and sequence them successfully with NimbleGen capture and SMRT Sequencing. Our own Tyson Clark spoke about using amplification-free targeted enrichment for analyzing genomic regions associated with repeat expansion disorders.
A number of great talks focused on plants, animals, and microbes. Felix Bemm from MPI Tübingen focused on Arabidopsis, in which structural variation was being missed with short-read sequencers. By incorporating PacBio sequencing, his team was able to explore NLR complexity; they also produced 10 platinum-grade genomes for a deep dive into structural variants. The University of Rochester’s Amanda Larracuente is studying Y chromosome dynamics in Drosophila. By adding SMRT Sequencing data to their pipeline, her team improved coverage for elusive Y genes and now have as much as 40% of the Y chromosome in contigs. Wasp parasites captured our attention in a talk from Ken Kraaijeveld at VU Amsterdam. He studied asexual and sexually reproducing parasites to understand the differences in mutation accumulation in their genomes, finding that transposable elements may play a role in reduced recombination.
From the University of Oslo, Ave Tooming-Klunderud spoke about targeted sequence capture in a cod study. Focusing on a 300 kb region of hemoglobin genes, the team analyzed eight species and optimized the sample prep protocol with barcoding, which resulted in using just nine SMRT Cells. Richard Kuo from the University of Edinburgh presented data from using the Iso-Seq method to understand chicken transcriptomes; the approach improved detection of lncRNAs, transcripts that were missed in previous annotations, and splicing diversity. Finally, Thomas Otto from the Wellcome Trust Sanger Institute gave a keynote talk about long-read sequencing of parasite genomes, with a focus on Plasmodium falciparum. Otto noted that the first assembly for this genome cost $18 million (that was back in 2002), and today on the PacBio RS II System it only takes five SMRT Cells. Because the genome has only 19% GC content, SMRT Sequencing is more successful at calling intergenic regions that can’t be mapped using short-read data.
We really enjoyed two talks about immune-related genes. Marvyn Koning from our LUMC host spoke about B cells and the adaptive immune system. Sequencing has been difficult because of the high mutation rate across many locations, but Koning developed a method called ARTISAN PCR to anchor primers in one region that didn’t change. With PacBio sequencing, the approach yields much higher accuracy than short-read sequencing. Julie Karl from the University of Wisconsin-Madison talked about sequencing the complex MHC region in macaques. For this work, SMRT Sequencing has been essential to achieve the accuracy needed for a genomic region that’s even more complex than the human MHC locus.
We were treated to some proteogenomic talks as well. In a keynote presentation, Gloria Sheynkman from the Dana-Farber Cancer Institute spoke about approaches to understand the complexity of splice diversity and the proteins they produce. One method is ORF-seq, which measures the isoforms in various functional groups and relies on SMRT Sequencing to characterize the isoforms. And NKI’s Gosia Komor presented a proteogenomic analysis of alternative splicing for a colorectal cancer biomarker study. With the Iso-Seq method, the team is building up the reference set of isoforms to find those associated with cancer risk.
Finally, our own Lance Hepler offered a look at new applications for SMRT Sequencing, including new software for detection of minor variants and structural variants and multiplexed whole genome sequencing for microbes. The new Juliet tool for characterizing minor variant frequency and pbsv for increased structural variant sensitivity will both be included in the SMRT Link 5, due to be released this summer. Hepler also noted that with the multiplexing protocol a single SMRT Cell on the Sequel System will be able to sequence up to 12 microbes with genomes of ~4.5 Mb; the protocol works for the PacBio RS II System as well.
We are thankful to all of the fantastic speakers who shared their research, for our gracious host Yahya Anvar and the entire LUMC as well as everyone who attended the event. We look forward to seeing you again next year in Leiden!
A large group of scientists published a new reference genome assembly for maize. It was generated with SMRT Sequencing and other technologies, and represents a major leap forward in accurately portraying and annotating the genome of this important crop.
“Improved maize reference genome with single-molecule technologies” comes from lead author Yinping Jiao, senior author Doreen Ware, and collaborators at Cold Spring Harbor Laboratory, the USDA ARS, and many other institutions. They embarked on the project because the existing reference for maize, based on Sanger technology and released in 2009, “is composed of more than 100,000 small contigs, many of which are arbitrarily ordered and oriented, markedly complicating detailed analysis of individual loci and impeding investigation of intergenic regions crucial to our understanding of phenotypic variation and genome evolution,” the authors explain. A higher-quality assembly would be extremely useful for crop breeding and selection programs as well as basic research.
The new reference is based on PacBio sequencing data, which led to a preliminary assembly with fewer than 3,000 contigs and a contig N50 of 1.2 Mb. Scientists then layered in data from an optical map, a BAC-based minimum tiling path, and a high-density genetic map. The end result: a high-quality 2 Gb assembly with just 2,522 gaps. “The new maize B73 reference genome has 240-fold higher contiguity than the recently published short-read genome assembly of maize cultivar PH207,” they report.
To assess the new assembly, the team compared it to the previous Sanger-based reference. That “revealed more than 99.9% sequence identity and a 52-fold increase in the mean contig length, with 84% of the BACs spanned by a single contig from the long reads assembly,” the authors write. ChIP-seq analysis showed that centromeres in the new assembly were mostly intact and correctly placed. The new assembly fixed many known mis-oriented regions in the reference genome, and an updated annotation consolidated gene models with the support of 111,000 full-length transcripts from SMRT Sequencing. “Our reference assembly also vastly improved the coverage of regulatory sequences, decreasing the number of genes exhibiting gaps in the 3-kb region(s) flanking coding sequence from 20% to <1%,” the team adds.
The scientists interrogated transposable elements, which are well known and important in the maize genome. The previous maize annotation had few intact representations of these elements; for long terminal repeat retrotransposon copies, not even 1% were complete. The team incorporated “a new homology-independent annotation pipeline” and uncovered 1.2 Gb of intact retrotransposons, about half of which were “nested retrotransposon copies disrupted by the insertion of other transposable elements,” they note. “Characterization of the repetitive portion of the genome revealed more than 130,000 intact transposable elements, allowing us to identify transposable element lineage expansions that are unique to maize.” This information will contribute to a better understanding of the diversity and evolution of maize varieties.
In closing, the scientists write, “Our improved assembly of the B73 genome, generated using single-molecule technologies, demonstrates that additional assemblies of other maize inbred lines and similar high-quality assemblies of other repeat-rich and large-genome plants are feasible.”
In a preprint available from bioRxiv, scientists from the University of Lausanne and Swiss Institute of Bioinformatics present the first SMRT Sequencing results from isolates of the fungal pathogen Candida glabrata. “Comparative Genomics Of Two Sequential Candida glabrata Clinical Isolates” comes from Luis Andre Vale-Silva, Emmanuel Beaudoing, Van Du T. Tran, and Dominique Sanglard.
The study involved two C. glabrata samples collected at different times from an HIV-positive patient diagnosed with oropharyngeal candidiasis. Scientists initially turned to short-read sequencing to analyze the genomes, which were of particular interest because C. glabrata is known to rapidly develop resistance to antifungal therapies. However, because sequence data had to be aligned to a reference genome, the assemblies “did not reflect actual genome rearrangements of strains DSY562 and DSY565,” the scientists report. “We therefore undertook an alternative genome sequencing approach using PacBio technologies enabling de novo assembly of large reads.”
The PacBio assemblies featured contig N50s longer than a megabase for each isolate, “highlighting the high quality of the assemblies,” the authors note. “Assembled contigs almost reconstituted the entire set of chromosomes that is known from the [reference] genome.” They generated data for both nuclear and mitochondrial genomes.
The team was eager to learn more about adhesins in these strains, since adherence to host cells is an important factor in increased virulence of C. glabrata. “It is estimated that C. glabrata contains 63 ORF with adhesin properties,” the scientists write, underscoring their interest in genome-wide data.
Based on SMRT Sequencing data, “we determined the presence of more than 100 adhesin-like genes in both DSY strains, which was not yet anticipated from other genome-wide studies,” the authors report. “This number exceeds by far the numbers published for [the reference genome] and therefore suggests that an expansion of this gene family occurred in our isolates.”
Vale-Silva et al. note that further studies will be needed to determine whether this pattern holds up for other strains. “Since no equivalent C. glabrata genome assembly has been yet published using a PacBio approach, we can still not confirm whether or not the investigated isolates constitute a unique case,” they conclude. “Now that a more extensive repertoire of adhesins is available from our studies, such analysis may be undertaken in the future.”
Interested in microbial genomics? Check out our latest SMRT Grant opportunity today.
The Joint Genome Institute recently announced results from a project that used SMRT Sequencing to generate high-quality genome assemblies and detect epigenetic modifications for fungal species that represent the earliest branches of that kingdom’s phylogeny. The work was done as part of the 1000 Fungal Genomes Project, which aims to better characterize a diverse range of fungal species.
Published in Nature Genetics, “Widespread adenine N6-methylation of active genes in fungi” comes from lead author Stephen Mondo, senior author Igor Grigoriev, and collaborators at JGI and other institutions. The major finding is that N6-methyldeoxyadenine (6mA) is seen at the earliest stages of fungal evolution, in groups that have not been studied much in genomics. “By and large, early-diverging fungi are very poorly understood compared to other lineages. However, many of these fungi turn out to be important in a variety of ways,” Mondo said.
The study involved analyzing 16 fungal genomes with SMRT Sequencing, which generates genome-wide epigenetic data while it sequences the DNA. Scientists discovered that 6mA, which is present at low levels in many plant and animal species, was much more common in these early-diverging fungi. As many as 2.8% of adenine bases were methylated, “far exceeding levels observed in other eukaryotes and more derived fungi,” they report in the paper. The previous highest 6mA rate was observed in Chlamydomonas reinhardtii, an alga with 0.4% of its adenines methylated.
The team also found that the presence of 6mA and 5-methylcytosine (5mC) is inversely correlated, and that 6mA appears to boost gene expression while 5mC suppresses it. “Our analysis has shown that 6mA modifications are associated with expressed genes and is preferentially deposited based on gene function and conservation, revealing 6mA as a marker of expression for important functionally-relevant genes,” Grigoriev said.
In the paper, the authors write, “Our results show a striking contrast in the genomic distributions of 6mA and 5-methylcytosine and reinforce a distinct role for 6mA as a gene-expression-associated epigenomic mark in eukaryotes.”
To learn more, watch this presentation from Mondo on 6mA in fungal genomes.
This week the HudsonAlpha Institute for Biotechnology and the University of Georgia are co-hosting CROPS 2017, a meeting focused on genomic technologies and their use in crop improvement and breeding programs. The three-day event attracts over 200 attendees involved in research and breeding for a range of important crop species. PacBio was proud to be a sponsor of the conference.
HudsonAlpha’s Jeremy Schmutz kicked off the meeting with an introductory talk about trends in plant genomics, expanded transcriptome resources, and the improved representation of all plant genomes with many new genome assemblies. Schmutz, who also works with the Joint Genome Institute, highlighted efforts to generate higher-quality, more complete reference genomes for economically important plants used for food, fiber, or biomass. He told attendees that SMRT Sequencing has made a significant difference in those efforts. His team has generated high-quality plant assemblies with the PacBio RS II Sequencing System and more recently with the Sequel System for several cotton genomes as well as Brachypodium, peanut, sorghum, and more.
According to Schmutz, the big difference with SMRT Sequencing is that the assemblies it produces are of high enough quality to be useful for functional studies such as genotype-phenotype associations, which are essential for breeding and selection programs. “We can now generate high-quality reference genomes for most plants,” he said. “PacBio has made that possible.” Schmutz noted that SMRT Sequencing has been successful even for very challenging plant genomes with highly repetitive elements, GC-rich regions, areas of high and low complexity, and of course varying degrees of ploidy.
Schmutz, his colleague Jane Grimwood and their team at the Genome Sequencing Center have made the transition to the higher-capacity Sequel System, which enables comparable results with lower project costs. Schmutz said they are generating, on average, 4 Gb to 8 Gb per SMRT Cell. For a project sequencing the 2.6 Gb Brazilian cotton genome, he said, the preliminary assembly is at least as good as a previous cotton assembly generated on the PacBio RS II — but data collection took just five weeks, compared with almost five months for the older assembly. Even in this preliminary stage, the Sequel System cotton assembly has an impressive contig N50 of 2.2 Mb.
Schmutz noted that the point of the CROPS meeting wasn’t to present new assemblies for their own sake; it’s all about how these resources are being used by the plant community. By including scientists studying many different crop species, he hopes to accelerate the uptake of new genome-based approaches to as many research groups as possible. The meeting covered topics such as breeding and selection strategies, functional work to identify the mechanisms underlying important traits, and automated phenotyping. “We really focus on the translation of genomics directly into crop improvement platforms,” Schmutz said.
A sweeping new report on Klebsiella pneumoniae sequence data from scientists at the Houston Methodist Research Institute, Weill Cornell Medical College, and other institutions found more diversity than expected in strains of the pathogen in a Texas population. The publication also indicates the emergence of a virulent, antibiotic-resistant strain of this organism.
Published in mBio, “Population Genomic Analysis of 1,777 Extended-Spectrum Beta-Lactamase-Producing Klebsiella pneumoniae Isolates, Houston, Texas: Unexpected Abundance of Clonal Group 307” comes from lead author Wesley Long, senior author James Musser, and collaborators.
K. pneumoniae is a dangerous source of infection, often acquired in hospitals and increasingly resistant to antibiotics. Scientists launched this study to contribute new genomic information that might be used to inform new therapeutics. They sequenced nearly 1,800 isolates collected from patients in the Houston Methodist Hospital system over four years, and then selected five key strains for deeper analysis with SMRT Sequencing.
Previous Klebsiella studies in the U.S. had determined that clonal group 258 was dominant in this country. In this project, however, scientists found that this group represented just a quarter of isolates. More than 35% of strains belonged to clonal group 307, with isolates collected in a number of hospitals. The remaining cases represented a number of different strain types. “We discovered that CG307 strains have been abundant in Houston for many years,” the scientists report, noting that this strain is as virulent as pandemic K. pneumoniae strains. “Our results may portend the emergence of an especially successful clonal group of antibiotic-resistant K. pneumoniae.”
The team used SMRT Sequencing to generate reference-grade genome assemblies and annotations for five strains “chosen to represent regions of the phylogenetic tree for which existing reference genomes deposited in [publicly] available databases were lacking,” the authors report. “In addition, genomes containing the blaNDM-1 and OXA-48 genes… were chosen to allow more in-depth analysis of these important strains.”
All five strains were represented in closed genome assemblies, with two to five plasmids for each. Analysis revealed that a reference strain previously collected in Pittsburgh and one of the Houston isolates “are lineally descended from a common ancestor organism,” the scientists write.
Sequencing efforts were followed up with transcriptome analysis and mouse models to produce data that could be relevant for the development of new therapies. The team also used the whole genome data to generate “classifiers that accurately predict clinical antimicrobial resistance for 12 of the 16 antibiotics tested,” they write. “We conclude that analysis of large, comprehensive, population-based strain samples can assist understanding of the molecular diversity of these organisms and contribute to enhanced translational research.”