This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
A team of scientists from Australia, Canada, and the US published fascinating new work that may help explain gene expression patterns seen in prostate cancer. In the course of the project, they used SMRT Sequencing and found a novel fusion transcript linking two genes with high sequence identity.
“Identification of a novel fusion transcript between human relaxin-1 (RLN1) and human relaxin-2 (RLN2) in prostate cancer” was published in Molecular and Cellular Endocrinology by lead author Gregor Tevz, senior author Colleen Nelson, and a number of collaborators. In it, the scientists attempted to untangle expression signals from two relaxin genes, which were formed by a duplication event sometime before humans and apes branched off. The genes play a role in reproduction and are most highly expressed in ovaries and prostate. “Outside normal physiology, RLN2 is a promoter of cancer progression in several different types of cancers,” the scientists note.
Previous studies were unable to distinguish between the two genes, so this team deployed long-read sequencing and the Iso-Seq method from PacBio to sort out reads from RLN1 and RLN2 in LNCaP cells. Using their results along with publicly available data, they made a number of discoveries. For one thing, they found that most prostate cancer cell lines underrepresent RLN1, which is highly expressed in both normal and cancerous tissue. “LNCaP cells best reflect the RLN1 expression observed in [prostate cancer] and is the most relevant cell line for the use in further studies of RLN1 biology,” the team reports.
They also detected a novel fusion transcript that incorporates large swaths of both RLN1 and RLN2, but were able to design primers to distinguish the fusion from the genes. “The fusion transcript encodes a putative RLN2 with a deleted secretory signal peptide indicating a potentially biologically important alteration,” the scientists write. They determined that RLN1 and the fusion transcript are inversely regulated by androgens, and suggest that follow-up studies will be helpful to elucidate the mechanisms governing this response.
While we’re on the subject of cancer, don’t forget that the abstract deadline for the 2016 AACR annual meeting is coming up on December 2. We’re already looking forward to hearing about more great discoveries at that conference!
We’re excited about a new Nature paper from the winners of our 2014 “Most Interesting Genome in the World” SMRT Grant program. “Single-molecule sequencing of the desiccation tolerant grass Oropetium thomaeum” comes from lead authors Robert VanBuren and Doug Bryant along with senior author Todd Mockler at the Donald Danforth Plant Science Center, as well as a number of collaborators at other institutions. In it, the authors report a virtually complete genome of Oropetium thomaeum, a grass with an estimated genome size of 245 Mb and the handy ability to regrow even after extreme drought once water becomes available.
The scientists believe that a better understanding of the plant’s genome could shed light on the mechanisms underpinning these so-called resurrection plants, and ultimately enable the engineering of crop plants to withstand severe drought and stress.
For this study, the team worked with about 72x coverage of the Oropetium genome generated by the PacBio system. That’s “equivalent to <1 week of sequencing time and <$10k in reagents,” according to the paper. Based on HGAP and Quiver, the resulting assembly covered 99% of the genome in 625 contigs, with an accuracy of 99.99995% and a contig N50 length of 2.4 Mb.
VanBuren et al. note that the contiguity of the assembly sets it apart from draft genomes produced from short-read sequencers. “Most NGS-based genomes have on the order of tens of thousands of short contigs distributed in thousands of scaffolds,” the scientists write. Because the assemblies are so fragmented, “they are missing biologically meaningful sequences including entire genes, regulatory regions, transposable elements (TEs), centromeres, telomeres and haplotype-specific structural variations.”
Instead, SMRT Sequencing is pushing new limits to characterize those elements in the Oropetium genome, with its predicted 28,446 protein-coding genes and a significant proportion of repeat regions. The authors noted that “the largest tandem array contains five identical and one partial 9 kb repeats collectively spanning 51 kb; this is approaching the theoretical limit given the current read-length distributions of PacBio.” The assembly includes telomere and centromere sequence, long terminal-repeat retrotransposons, tandem duplicated genes, and other difficult-to-access genomic elements. In addition, the scientists produced the full chloroplast genome in a single contig that includes “~25 kb of inverted repeat regions which typically collapse into a single copy during assembly,” they report.
“The Oropetium genome showcases the utility of SMRT sequencing for assembling high-quality plant and other eukaryotic genomes,” the scientists note, “and serves as a valuable resource for the plant comparative genomics community.”
Mendelspod host Theral Timpson recently interviewed Professor Steven Marsh, Director of Bioinformatics at the Anthony Nolan Research Institute, a UK-based organization dedicated to improving the outcomes of bone marrow transplantation and host to the world’s first bone marrow registry. Prof. Marsh and his team have dramatically improved the resolution of HLA typing — one of the methods used for matching compatible donors with transplant recipients — using long, accurate reads from PacBio sequencing. Their fascinating conversation covers the past, present, and future of HLA typing — highlights are below.
Short History of HLA Typing — There’s a Lot More Diversity than We Thought
When Marsh entered the field 30 years ago, HLA typing was performed with serology, and there were just 119 known HLA antigens. “We thought 119 was a lot of diversity,” he says. With the advent of genomic tools in the 1990s, researchers have had to evolve their practices for typing as more and more became known about the nature of HLA genes. “We’ve really realized that these genes are not just polymorphic, they are really hyper-polymorphic,” he says. The HLA B gene alone has 4,000 variants, Marsh notes. “The only way to do proper HLA typing in this day and age is to do sequencing,” he says.
Enter Long Reads and Exquisite Haplotypes
Using PacBio sequencing technology, Anthony Nolan aimed to extend its sequencing from a couple of gene exons to cover the full HLA genes and capture phasing information. “We’re seeing exquisite haplotypes … all the way through the HLA region,” Marsh says, noting that this gives them “very high resolution typing and very high allelic specificity.”
Marsh says he has been offered free NGS machines from other vendors, but for him, “those technologies would be a distraction.” The MHC/HLA genes are very GC-rich, he explains, making it difficult to use short-read sequencing technologies because of their high systematic error rates. “You cannot assign phase across the whole gene sequence for some allele combinations,” he says.
For Marsh, the future lies with the long-read sequencing capabilities of PacBio. “For me, it’s groundbreaking technology,” he says. One example of the unique capabilities provided to the Anthony Nolan team by PacBio sequencing is 3.5 kb contiguous sequences for HLA Class I genes, including all of the exons and introns, as demonstrated in a recent publication.
Complete, High-Resolution Typing — The Way Forward
Marsh is using the PacBio platform exclusively for his sequencing program and is already seeing the benefits of high-resolution typing. His goal: to speed up the process and improve matching preciseness to save lives. Anthony Nolan is the first group in the world to take this strategy to the clinic, and is using multiplexing to make the process more cost effective. “We really believe in [the PacBio technology], and we believe it will make an impact for patients,” he says.
The scientists at Anthony Nolan continue to gain deeper knowledge about HLA genes and have future plans to expand their focus to other relevant immune related regions, such as the KIR. They will also continue to explore other important genes comprised within the MHC locus such as MIC-A etc.
Click here to listen to the full podcast.
This week the Festival of Genomics comes to the West Coast, and we’re excited to be a founding sponsor of the Front Line Genomics organization. Not only is it our first chance to show off the new Sequel System in our home state, but there will also be a number of great talks reporting SMRT Sequencing results. Here are the some of the presentations to consider if you’re attending the event:
Wednesday, November 4
Ali Bashir, Icahn School of Medicine at Mount Sinai
Will Salerno, Baylor College of Medicine
Robert Sebra, Icahn School of Medicine at Mount Sinai
Thursday, November 5
Tyson Clark, Pacific Biosciences
Maria Nattestad, Cold Spring Harbor Laboratory
We hope you’ll stop by booth #11 to get the tour of our new Sequel System. You can also see the PacBio team lacing up our running shoes for a good cause on Wednesday morning at 10:00 a.m.: once again we’ll be participating in the Race the Helix event, a fundraiser for the Greenwood Genetic Center. (At the first Festival of Genomics, held in Boston in June, our Race the Helix team won the best-dressed award — quite the feat when you’re sprinting on a treadmill for all you’re worth!)
A new publication reports the discovery and analysis of a nightmare bacterium that’s genetically resistant to all commercially available classes of antibiotics.
The paper, “Stepwise evolution of pandrug-resistance in Klebsiella pneumonia,” came out this month in Scientific Reports from Nature. Lead authors Hosam Zowawi and Brian Forde, along with senior author David Paterson and several collaborators, studied an isolate recovered from the urine of an 87-year-old patient who was hospitalized in the United Arab Emirates last year. They used SMRT Sequencing to characterize the strain and its genetic mechanisms for drug resistance.
That strain, MS6671, “was found to be non-susceptible to all antibiotics tested, which includes cephalosporins, penicillins, carbapenems, aztreonam, aminoglycosides, ciprofloxacin, colistin, tetracyclines, tigecycline, chloramphenicol, trimethoprim-sulfamethoxazole and fosfomycin,” the authors report. They note that carbapenem-resistant Enterobacteriaceae (CRE), including Klebsiella and E. coli, are lethal in almost half of patients with bloodstream infections. “The ‘golden era’ when modern medicine saved lives through antibiotic treatment is under serious threat,” they add.
The scientists sequenced the isolate with the PacBio RS II system and performed de novo genome assembly using HGAP and Quiver. The genome includes a circular chromosome about 5.5 Mb long, as well as five circular plasmids and a linear plasmid prophage, the team reports. The circular chromosome is similar to that of a strain of Klebsiella known for its hypervirulence.
Assembly in hand, the team sought the genetic basis for the strain’s broad resistance to antibiotics. They detected a number of acquired antibiotic resistance genes, a novel variant of a gene that appears to confer heightened resistance to carbapanems, and repeated insertions of mobile elements linked to resistance to colistin, an antibiotic used as a last resort. “Our findings provide the first description of pandrug-resistant CRE at the genomic level, and reveal the critical role of mobile resistance elements in accelerating the emergence of resistance to other last resort antibiotics,” the scientists write. According to the paper, this is the first time that anyone has demonstrated resistance to colistin due to the insertion of a carbapanem-resistant mobile element. The authors attribute that partly to SMRT Sequencing, which can accurately sequence complex resistance elements that would confound short-read sequencers.
“Critically, elucidation of the complete K. pneumoniae MS6671 genome using long-read sequencing enabled the context of multiple, identical carbapenem resistance elements to be determined,” the team reports. “Based on this analysis we propose a model for the development of pandrug-resistance in this K. pneumoniae isolate, whereby mobile resistance determinants are responsible for driving additional resistance.”
Zowawi et al. report that as of six months after the patient’s hospitalization, no other cases of this strain had been detected at the facility. “However, the occurrence of this strain in the Arabian Gulf is of great significance,” they write, noting that travel from the region to India, Europe, and the US is common. “The potential for international transfer of multidrug-resistant bacteria emphasizes the need for global surveillance efforts as one part of a strategy to control antibiotic resistance.”
Carbapanem-resistant bacteria have already been cited by health authorities as an urgent threat against human health. “The emergence of this highly resistant strain, in a clone that has proven capable of causing outbreaks, raises this threat level even higher,” the authors conclude.
Richard Roberts, Nobel Laureate and Chief Scientific Officer of New England Biolabs, offers his thoughts on the utility of methylation data for understanding prokaryotes. In his words:
“Please run SMRT Analysis to detect methylation in your prokaryotic PacBio data.
Most bacteria and archaea encode DNA methylases, many of which are known components of restriction-modification systems. Usually, these are quite specific in terms of the sequences they recognize; the restriction component becomes a key defense mechanism preventing phages, plasmids, and other DNA elements from infecting the cell.
Until recently, it was quite difficult to determine the recognition sequences of these methylases. For most organisms, we had no idea whether the genes we could detect in the genome were active or not. Now, thanks to the properties of the DNA polymerase used during SMRT Sequencing, we can accurately locate the positions of m6A and m4C along the genome and sometimes can deduce the position of m5C. By analyzing the sequence context of these modified bases, we can deduce motifs that are the recognition sequences for the various methylases encoded in the genomes. Increasingly, we can then accurately match the genes with the motifs they produce to enable precise, experimentally-determined annotation for those genes.
Further progress in this area will depend on our ability to gather as much experimental data as we can; to improve the algorithms for calling the motifs accurately from the raw PacBio reads; and to improve our ability to match the DNA methylase genes in a genome with the PacBio motifs that are found experimentally. The public availability of motif data produced by running SMRT Analysis after each PacBio run can be enormously beneficial. Even better, if the raw sequence reads are also available, then this can help the development of better algorithms for data interpretation.
There is another terrific use of the methylation data for anyone interested in trying to transform these strains: While the presence of methylated motifs — and hence methylase genes — does not mean that an active restriction system is present, very often it does, offering some information about how one might protect DNA to be used for transformation before it is introduced into the cell.
I encourage everyone to think ‘methylation’ when using PacBio systems to sequence bacterial and archaeal genomes. The current results of such methylation analysis can be found in REBASE by clicking on the blue PacBio icon. This also has a link through which you can submit your methylation motifs to REBASE.”
Following on the heels of characterizing 18 Mst77Y genes that were tandemly duplicated within a 96 kb region (Krsticevic FJ, et al., 2015), scientists from institutes in Brazil, Austria, and the United States recently published a study in which they also used the Drosophila melanogaster data release from PacBio to characterize a region of the Y chromosome that had never before been accessible.
In a paper published in PNAS, entitled “Birth of a new gene on the Y chromosome of Drosophila melanogaster,” lead author Antonio Bernardo Carvalho, senior author Andrew Clark, and collaborators detail their find of a gene duplicated from an autosome. “We emphasize the utility of PacBio technology in dealing with difficult genomic regions,” the authors write. “PacBio produced a seemingly error-free assembly of the FDY region, something that has eluded us for years of hard work.”
The 55 kb region, which consists of several pseudogenes as well as the newly discovered functional FDY gene, has been challenging to sequence and assemble since it exists only on the Y chromosome and is full of highly repetitive sequence. Some 75% of its length, the scientists report, is made up of transposable elements.
Their discovery was worth the wait. Unlike mammalian Y chromosomes, which are thought to evolve primarily by gene loss, the Drosophila Y chromosome appears to be the result of millions of years of gene gains. The team demonstrates that the new gene they detected, named FDY for flagrante delicto Y, was formed about 2 million years ago in a single duplication event of the gene vig2 and its flanking sequence from chromosome 3R. That flanking sequence originally included four other genes, “but they became pseudogenes through the accumulation of deletions and transposable element insertions, whereas FDY remained functional, acquired testis-specific expression, and now accounts for ∼20% of the vig2-like mRNA in testis,” the scientists report. Today, FDY shares 98% sequence identity with its vig2 parent.
The paper details the team’s effort to sequence the FDY region, using RT-PCR, clonal sequencing, and publicly available genome assemblies. Most existing assemblies did not fully cover the region. “Fortunately … the PacBio [MHAP] assemblies covered not only FDY, but also substantial flanking regions,” the scientists write. With that resource, they had their first view of the full sequence of the region. By comparing it to Sanger and Illumina sequence data, they concluded that the PacBio assembly is complete and accurate.
Carvalho et al. went on to figure out when FDY likely appeared in the genome. Their sequence divergence analysis suggests that the duplication occurred once, about 2 million years ago. The gene was found in samples of D. melanogaster from around the world, but does not appear in the fly’s closest relatives.
“Hence a female-biased gene (vig2) gave rise to a testis-biased gene (FDY),” the authors write. “This seems to be a case of gene duplication followed by neofunctionalization, the first reported, to our knowledge, for the Drosophila Y.”
During the final days of the ASHG meeting last week in Baltimore, a number of scientists offered great presentations based on data generated with SMRT Sequencing, including an entire session on building platinum genomes. We’ve rounded up the highlights here:
Karyn Meltz Steinberg from Washington University’s McDonnell Genome Institute spoke about building a platinum human assembly from single-haplotype genomes. Her team defines “platinum” as covering at least 98% of the sequence with every contig associated with a chromosome. They use long-read PacBio sequencing for de novo sequencing and assembly, followed by scaffolding with BioNano Genomics or Dovetail Genomics technology. When necessary, they then perform PacBio sequencing of BACs for targeted regions, such as gap-filling. Using CHM13 as an example, she shared several examples of specific genomic regions and assembly challenges, both for short- and long-read data. By combining BioNano mapping with PacBio sequence data, they produced a hybrid assembly with 254 contigs, compared to 1,590 contigs for the initial PacBio assembly lacking BioNano mapping.
Bobby Sebra from the Icahn School of Medicine at Mount Sinai talked about an effort to resolve regions in the human genome — such as complex structural variants — that have not been addressed by NGS or Sanger sequencing. Working with the NA12878 genome, Sebra and his colleagues combined PacBio and Illumina sequence data with BioNano mapping. The resulting assembly filled 28 gaps in the latest human reference genome and featured a multi-megabase contig N50 length. The comparison to GRCh38 confirmed previous studies suggesting that tandem repeats and other structural variants are underrepresented in the reference genome; long-read sequencing can effectively characterize these regions. Sebra noted that many challenging regions in the human genome have implications for pharmacogenomics or disease associations, and that detailing these regions carefully will be important for clinical utility of genomics.
In that same session, Justin Zook from the National Institute of Standards and Technology presented on progress at the Genome in a Bottle consortium, including some upcoming reference genomes from Han Chinese and Ashkenazi Jewish family trios. These new genomes have been generated with a number of sequencing technologies, including ones from PacBio, BioNano, 10X, Complete Genomics, Oxford Nanopore, and others. GIAB has already released some reference materials, which scientists are using to help validate variant calls for their own genome assemblies. Zook mentioned tools produced by the CDC and underway at the Global Alliance to allow scientists to compare sequencing data to what other projects have reported. They’re also working on analysis tools to show confidence scores for structural variant calls.
In a separate session, Kiana Mohajeri from the University of Washington reported on a region of chromosome 8 that features the largest known inversion variant in the human genome; it spans several megabases and includes several segmental duplications. Seeking to determine the evolutionary history of this region and to better understand the variation found in human genomes, the team sequenced more than 70 BAC clones with SMRT Sequencing. They produced a gap-free 6.2 Mb tiling path with 99.999% accuracy — a far more complete and contiguous sequence than the human reference genome has for this region. The tiling path shows four inversion-associated repeats with 98% sequence identity flanking the internal inversions. By comparing the region to other primate genomes, they theorize that it was formed between 200,000 and 800,000 years ago, but note that the oldest of the repeats appears to be 19 million years old.
During the Wednesday afternoon sessions of last week’s ASHG conference, several speakers provided helpful insights about their use of SMRT Sequencing for a range of applications. Highlights included the following:
Yao Yang, a researcher at the Icahn School of Medicine at Mount Sinai, discussed the development of an assay to genotype the CYP2D6 gene to inform drug dosing in patients. CYP2D6 metabolizes 20-25% of all medications, including antidepressants, anti-psychotics, and opiates. There are more than 100 known variants, which include gene deletions and duplications. Variants can have profound impacts on how patients metabolize drugs, with some individuals being ultra-rapid metabolizers and others being poor drug metabolizers. The development of a simple and reliable typing assay has been challenging because CYP2D6 has a highly homologous pseudogene. Yang and his collaborators developed a targeted full-length PCR protocol to amplify both the gene and pseudogene, as well as companion bioinformatics tools to remove random errors (ALEC) and predict the expected phenotype from genotype data (CYP VCF Translator). He shared results showing that the assay is highly reproducible and capable of recapitulating known genotypes in well characterized samples like NA12878. Furthermore, they uncovered novel CYP2D6 alleles in NA16688 and ASIAN048, which had variants in an intron and exon, respectively. Most impressively, they were able to resolve alleles in NA17084 and other samples in which results from other technologies were discrepant. They foresee using this same approach to develop targeted assays for other similarly challenging genes, and eventually combining these into a multiplexed gene panel of clinically important genes.
Stuart Cantsilieris from the Eichler lab at University of Washington presented work demonstrating the utility of long-read sequencing for understanding the range of structural variant alleles present in the Complement Factor H Gene cluster. The CFHR locus is a well-known hotspot for structural variation, but short-read data has, at best, only provided a rough map of the density of structural variants in the region, and can’t resolve haplotypes or define precise breakpoints. The team sequenced BAC clones with the PacBio RS II — not only to resolve a human alleles, but also to chart the evolution of the region and map bases under the most selective pressure by sequencing a range of non-human primates. Based on what they learned with the PacBio sequenced alleles, they developed molecular inversion probes (MIP) to enable rapid screening of CFHR genes in patient cohorts. Structural variation in the CFHR locus is linked to a number of diseases, including age-related macular degeneration (AMD) and lupus. Interestingly, the same variant can confer risk to one disease and protection from the other. With the MIP screening tool, they have collected variant information from a large patient cohort and hope to better understand how the revealed genotypes relate to patient phenotypes.
Hagen Tilgner from Stanford University explained how long read technology enables new insights into transcriptome studies that were not previously possible. Using PacBio long reads for sequencing a mixed human tissue sample, Tilgner and colleagues identified many novel isoforms encoding for proteins. Later, by sequencing a human trio sample, they were able to phase distant SNPs using PacBio reads that were not possible using short-read technology. Finally, using long reads, they analyzed human brain samples and found many significant exon pairings. Furthermore, the paired exons were mostly in coding regions. Tilgner emphasized that with long reads, a “phased [brain] proteome will now become possible,” potentially leading to novel biological discoveries.
Maria Nattestad from Cold Spring Harbor Laboratory described using a PacBio system to sequence the genome and the transcriptome of the SK-BR-3 breast cancer cell line. For the genome sequencing, the PacBio data produced a mean read length of 9 kb and a max read length of 71 kb with an average of 72X coverage. To understand the complex genomic rearrangements, Nattestad and her colleagues developed several tools to detect long range structural variations. Using SMRT Sequencing, they were able to identify the extremely complex and variable translocation occurring between the Her-2 oncogene locus on chromosome 17 and chromosome 8. Importantly, the translocation between chromosome 17 and 8 produced several fusion genes that were validated via PacBio Iso-Seq transcriptome sequencing. Using Iso-Seq analysis, they identified 17 fusion genes that were supported by both DNA and RNA evidence. Out of the 17 fusion genes, 13 were previously reported and four were novel. “The genome informs the transcriptome,” she said, where PacBio long reads help identify complex genome translocations and gene fusions.
The PacBio workshop at ASHG 2015 featured talks from two leaders in human genomics, Rick Wilson of Washington University and Richard Gibbs from Baylor University. Mike Hunkapiller, CEO of Pacific Biosciences, opened the workshop with a historical perspective of human genome sequencing, starting with the Human Genome Project. While advances have been made in technology, throughput and cost reductions, the quality of genomes hasn’t kept pace with decreases in cost, he noted. This is why Hunkapiller was particularly proud to share the news of the company’s launch of the Sequel System – which offers SMRT Sequencing and long reads at seven times greater throughput over the PacBio RS II and roughly half the cost, making it feasible to use the system for de novo assembly of high-quality human genomes. He also stated that the platform has the capacity to scale over time to handle increasingly higher-density SMRT Cells, pointing toward a future where de novo human genomes will become both practical and routine.
Rick Wilson titled his talk “Of reference genomes and precious metals” and walked the audience through definitions and standards for the various quality levels for de novo assembled human genomes, e.g., platinum, gold, and silver. He noted that this was a good topic for this session because of the important role PacBio has played in the community’s work to create reference-grade genomes. For example, PacBio technology has enabled them to sequence additional genomes (CHM1, CHM13) to a very high quality level. Although these sequences were essential for further refining the GRCh38 reference build, he stated that the current reference genome is still not optimal for some highly polymorphic and complex regions of the genome, and does not adequately represent diverse ancestries sufficiently.
Wilson outlined their definition of a ‘gold’ genome as a high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions – created with PacBio reads to perform de novo assembly, a scaffold created using BioNano and/or Dovetail aligned to reference, and BACs to fill targeted regions and shore up gaps. The list of gold genomes in progress includes the Yuroban, Puerto Rican Han Chinese, CEU, and Luhya. A ‘platinum’ genome is a contiguous, haplotype-resolved representation of the entire genome, two of which currently exist for the CHM1 and CHM13 hydatidiform moles. While ‘silver’ definition standards are to be determined, this category is generally non-trio genomes produced with PacBio and BioNano mapping, and no BAC library.
Richard Gibbs talked about the transition to genomic medicine, which hasn’t been as simple as people would like due to such issues as the incomplete reference genome, the difficulty in characterizing some variation, and the lack of knowledge about the function of some genes. At Baylor, most of the human genome sequencing is done for children with Mendelian disorders. He said that among 7,000 samples processed using short-read exome sequencing, only about 25% of these cases are solved. The relatively low diagnosis rate is likely due to structural variation and other regions not captured by short reads.
He discussed some ways to get to structural variation including PacBio sequencing and PBJelly and Parliament analysis routines, using as little as 10-fold PacBio coverage. Using these methods they are closing gaps in the genomes of various species, for example – he noted that in the sheep genome they have closed 70% of gaps with PacBio reads. He also mentioned the use of PBHoney to identify inconsistencies between reads and the reference, and that long-range capture strategies using a combination of Nimblegen and PacBio are ‘going beautifully so far.’
To close the workshop, Jonas Korlach, Chief Scientific Officer at PacBio, built on Hunkapiller’s comments by talking about the technology waves that have followed the initial human genome sequencing project, where we are today, and where we are going. Today, we are in what Korlach calls the 4th wave, where more comprehensive whole-genome re-sequencing is occurring, and we are nearing the 5th, when we will actually be able to free ourselves from reference genomes and sequence everything de novo.
Korlach also touched on some of the new developments PacBio is working on, which include amplification-free target enrichment methods, using Cas9 enzyme for targeting, and sequencing native DNA. Other progress will come through the ability to use PacBio sequencing to phase alleles and more comprehensively capture all size and types of variants into haplotigs (contiguous haplotype-sequence blocks). Barcoding samples for isoform (Iso-Seq) sequencing and allele-specific methylation analyses are also in the works.
Watch the recording of the entire workshop session.
Next-generation sequencing has many people excited about the prospect of the $1,000 genome, however recent discoveries show that short-read sequencing technologies miss important genomic elements, driving scientists to look for an alternative approach. Mark Gerstein, co-director of the computational biology and bioinformatics program at Yale, argues that the true $1,000 human genome has yet to arrive. In a recent conversation with Mendelspod host Theral Timpson, he discussed some of the important, deep technical questions that must first be addressed. This is the second in a series of podcast interviews focused on long-read sequencing, and we have included some highlights from the conversation below.
Pseudogenes and the non-coding portion of the human genome
For Gerstein, “the majority of $1,000 genome sequencing focuses mostly on SNPs — on the 3 to 4 million single nucleotide variants that distinguish every person.” While valuable, these SNPs do not represent the full extent of variation. “The non-coding regions are our most glaring omission — there’s a lot more to know,” Gerstein said. “There’s so much of our genome and we’re focusing myopically on genes.”
On the topic of pseudogenes, once thought of as junk DNA, Gerstein said, “these pseudogenes also carry a much better record of our history. Since many are functional, large chunks can be transcribed and maybe even translated. In essence, transcribed pseudogenes function as non-coding RNA, carrying out a regulatory function — not as a protein-coding gene, but as a non-coding RNA.” We also now know that the human genome has about as many pseudogenes as genes, and sometimes even more since they are under much less constraint, Gerstein noted.
What is making it possible to look at these more difficult regions is the increasing sophistication of sequencing technologies. “Integrating Pacific Biosciences technology has a lot of promise in genomic sequencing, by allowing us to fill in regions and provide high-quality sequencing,” Gerstein said.
Big data and the challenge of privacy
The mass sequencing of the human genome is producing vast quantities of sequence data. Gerstein told Timpson that there are “three challenges in thinking about how to organize large amounts of genomic information: agreeing on our fundamental, biological understanding of the right structure of the genome; genome interpretation; and privacy, which dramatically complicates queries. This is a dominating issue and will likely circumscribe our progress.”
Gerstein broke down the privacy challenge into two distinct topics: “There’s the legal bit that has to do with ethics, legal, and regulatory structures,” he said, “and there’s the technical bit. Is it meaningful to encrypt millions of genomes? Can you query across them?” He made the point that genomics has long been the poster child for open data, but that privacy issues are now introducing new hurdles. “It runs into a wall when it comes to an individual patient’s genome or records, which is more about individual protections, which is thornier,” he said. “I hope we’ll get to a point where people think they should own their own data — not a company, a doctor, or a hospital.”
The Mendelspod interview series on this topic continues, and we will keep you posted with highlights as new podcasts come out.
As part of our effort to support the National Institutes of Health and the Genome Reference Consortium (GRC) in creating platinum genomes for the research community and improving the reference genome, in 2014 we generated 54X SMRT® Sequencing coverage of the CHM1 cell line, derived from a human haploid hydatidiform mole, using our P5-C3 chemistry, and made it publicly available through the SRA database at NCBI.
The CHM1 dataset was quickly taken up by researchers eager to use long, unbiased reads to identify regions of the genome prone to structural variation and to fill in sequence gaps in the GRC-maintained human genome reference. Mark Chaisson and Evan Eichler used PacBio® CHM1 data to resolve 26,079 euchromatic structural variants at the base-pair level, 85% of which were novel. Furthermore, they were able to close or extend 55% of the remaining gaps in GRCh37 [Chaisson et.al. (2015) Resolving the complexity of the human genome using single molecule sequencing. Nature. 517, 608-611]. At the Advances in Genome Biology and Technology (AGBT) 2015 GRC workshop, Karen Meltz Steinberg and Tina Graves-Lindsay from the McDonnell Genome Institute at Washington University presented the use of PacBio CHM1 data as part of GRC efforts to fill in gaps in GRCh38. During her talk, Graves-Lindsay presented a high-level comparison of several assemblies of the PacBio CHM1 data using a number of newly developed long-read assembly tools, including MHAP by Adam Phillippy, Dazzler by Gene Myers, and Falcon by Jason Chin.
As PacBio CEO Mike Hunkapiller was listening to the talks, he realized that by upgrading the dataset, he could support not only the community’s effort to create a high-quality haploid human genome assembly and improve the reference, but also foster innovative genome assembly tools. Jason Chin notes, “Right now there are many approaches to whole genome assembly which are similar but have subtle differences. We need to evaluate what methods are the best for moving the field forward. Having a common dataset is useful to compare methods.” As it seemed the developer community had converged on this haploid cell line as a useful lingua franca for comparing different assembly pipelines, CHM1 data with the improved read length and accuracy of the newest P6-C4 chemistry would give bioinformaticians a new benchmarking opportunity, while advancing the goals of a platinum haploid genome assembly and resolving gaps and errors in the reference assembly.
Following up on Hunkapiller’s promise at AGBT, PacBio released a second CHM1 dataset to NCBI in September with ~60x coverage using P6-C4 chemistry. The dataset was generated with the new 30 kb sample prep protocol, and has a read length N50 of 19 kb. In the intervening months, several bioinformatics groups have been working with the new data and Chin has now uploaded his assembly results to NCBI to share with the community. The new assembly has a contig N50 of 26.9 Mb, with half of the genome contained within 30 contigs (contig L50). Regarding summary statistics, however, Chin emphasizes, “Genome assembly is a complex process, and no single statistic can sufficiently describe the results. Many different aspects of an assembly need to be evaluated to ensure high-quality results, including overall contiguity, completeness, the prevalence of mis-assemblies, and base-level accuracy. Releasing the whole assembly will allow all the experts within the community to fully understand the strengths and weaknesses of different approaches and determine how to move the field forward.”
Chin’s current assembly was created with Myers’ Daligner and the Falcon assembler developed at PacBio.
Figure 1. Jason Chin’s new CHM1 assembly resolves the q arms of chromosomes 2 and 6 into very few contigs, with max contigs 107 Mbp and 109 Mbp long, respectively.
A highlight of the CHM1 assembly Chin submitted to NCBI is the near-complete assembly of the q arms of chromosome 6 in a contig 109 Mbp long. Another contig of 107 Mb spans more than two-thirds of the chromosome 2 q arm. Using the same publicly available dataset, Phillippy and Sergey Koren, now at NHGRI, are planning to submit their own CHM1 assembly to NCBI in the next month. This assembly will be generated using different assembly tools, namely the MHAP method developed by Konstantin Berlin and co-authors [Berlin, K et al. (2015) Assembling large genomes with single-molecule sequencing and locality sensitive hashing. Nature Biotech. 33,623-630], paired with Celera® Assembler. We think it will be very useful for the community to have access to both assemblies to comment on the strengths and weaknesses of the different approaches, or to compare these assemblies to their own efforts. These two submissions can be seen as part of a communal work in progress toward finding the best and most general approaches to large genome assembly. In addition, we hope other researchers will be able to use this dataset to further their own assembler development work.
There are multiple ways to learn more about all the work being done with the updated CHM1 data during ASHG.
- Register to hear our workshop on Wednesday, October 7, from 1:00-2:30 PM EDT either in person or streaming, where Rick Wilson will highlight work he has done at the McDonnell Genome Institute developing high-quality references using both the CHM1 and CHM13 cell lines in a talk entitled “Of Reference Genomes and Precious Metals” (Sheraton Inner Harbor Hotel, Chesapeake Ballroom I/II/III, 3rd Floor).
- Attend the GRC workshop ahead of ASHG on Tuesday, October 6, from 1:00-4:00 PM (Convention Center, Room 349, Level 3).
- Attend the DNAnexus workshop on Thursday, October 8, from 1:00-2:30 PM (Convention Center, Room 345, Level 3), where Tina Graves-Lindsay will share her work combining PacBio and BioNano CHM1 and CHM13 data to generate assemblies with extremely high scaffold N50s.
- See Karyn Meltz Steinberg give a talk during the Platinum Genomes session on Friday, October 9, at 2:15 PM (Convention Center, Room 316, Level 3) entitled “Building a Platinum Assembly From Single Haplotype Human Genomes Generated From Long Molecule Sequencing,” in which she will present work resolving regions of the genome associated with large, repetitive sequences and exhibiting complex allelic diversity.
To download all the CHM1 P6C4 raw data in compressed, archived, hdf5 format, click here.
To review and download individual run data, click here.
For Chin’s most recent CHM1 assembly contigs using the above data, click here.
We’re looking forward to the year’s biggest scientific meeting focused on human genetics next week — the American Society of Human Genetics (ASHG) 2015 annual meeting, taking place in Baltimore, Maryland. SMRT® Sequencing will be featured in 36 scientific presentations, as well as our lunchtime workshop. Even if you’re not attending you can attend our workshop virtually to learn more about our newest SMRT Sequencer – the Sequel™ System — and about the latest uses of SMRT Sequencing for human biomedical applications.
Our workshop, “Addressing Hidden Heritability through Long-Read Single Molecule, Real-Time (SMRT) Sequencing” will be held on Wednesday, October 7, from 1:00-2:30 p.m. Eastern Time at the Sheraton Inner Harbor Hotel, Baltimore. The event will be hosted by Michael Hunkapiller and Jonas Korlach from Pacific Biosciences, and include talks by Richard Gibbs from Baylor College of Medicine and Richard Wilson from Washington University in St. Louis. Sign up to attend in person or virtually.
The depth and breadth of scientific talks presented at ASHG this year demonstrate how long-read SMRT Sequencing is opening up new frontiers by helping the genome sequencing community create gold-standard genome references, characterize complex regions, resolve structural variation, and unlock isoform diversity. Highlights include the following podium presentations:
Wednesday, October 7
Long read single-molecule real-time (SMRT) full gene sequencing of cytochrome P450 2D6 (CYP2D6) #27
- Yang, Icahn School of Medicine at Mount Sinai
Comprehensive genome and transcriptome structural analysis of a breast cancer cell line using PacBio long read sequencing #14
- Nattestad, Cold Spring Harbor
Evolution and structural diversity of the complement factor H related gene cluster #47
- Cantsilieris, University of Washington School of Medicine
Friday, October 9
Building a platinum assembly from single haplotype human genomes generated from long molecule sequencing #227
- Meltz Steinberg, Washington University
Building a Better Human Genome Reference and Targeting Structure using Single Molecule Technologies #228
- Sebra, Icahn School of Medicine at Mount Sinai
Genome in a Bottle: You may have sequenced, but how well did you do? #229
J.M. Zook, National Institute of Standards and Technology
A Diploid Personal Human Genome Reference from Diverse Sequence Data – A Model for Better Genomes #232
K.C. Worley, Baylor College of Medicine
Saturday, October 10
Full-length mRNA sequencing uncovers a widespread coupling between transcription and mRNA processing #71
S.Y. Anvar, Leiden University Medical Center
Additional Activities at ASHG
Research conducted using SMRT Sequencing will also be featured in at least 28 poster presentations. Check out the full list of PacBio-related ASHG meeting research and activities. Attendees can also visit the PacBio booth (#907) to learn more about the new Sequel System.
We hope to see you in Baltimore!
We are excited to announce our newest Single Molecule, Real-Time sequencer, the Sequel™ System. Watch this short video to learn about this exciting evolution in SMRT® Sequencing.
The Sequel System provides higher throughput, more scalability, a reduced footprint and lower sequencing project costs compared to the PacBio® RS II System, while maintaining the benefits of SMRT technology. The core of the Sequel System is the capacity of its redesigned SMRT Cells, which contain one million zero-mode waveguides (ZMWs) at launch, compared to 150,000 ZMWs in the PacBio RS II. Active individual polymerases are immobilized within the ZMWs, providing windows to observe and record DNA sequencing in real time.
With about seven times as many reads per SMRT Cell as the PacBio RS II, customers should be able to realize lower costs and shorter timelines for sequencing projects, with approximately half the up-front capital investment compared to previous technology. The Sequel System occupies a smaller footprint — less than one-third the size and weight — compared to the PacBio RS II. Since the new system is built on the company’s established SMRT Technology, most aspects of the sequencing workflow are unchanged.
The Sequel System is ideal for projects such as rapidly and cost-effectively generating high-quality, whole-genome de novo assemblies for larger genomes, such as human, plants, and animals. It can provide characterization of a wide variety of genomic variation types, including those in complex regions not accessible with short-read or synthetic long-range sequencing technologies, while simultaneously revealing epigenetic information. The system can also be used to generate data for full-length transcriptomes and targeted transcripts using the company’s IsoSeq™ protocol. The Sequel System’s increased throughput should also facilitate applications of SMRT Technology in metagenomics and targeted gene applications for which interrogation of larger numbers of individual DNA molecules is important.
In today’s press release, Michael Hunkapiller, Ph.D., CEO of Pacific Biosciences, commented: “We are extremely proud to introduce the Sequel System, which provides access to the existing benefits of SMRT Sequencing, including long reads, high consensus accuracy, uniform coverage, and integrated methylation information — a set of core attributes first pioneered with the PacBio RS. The system’s lower cost and smaller footprint represent our continued commitment to leveraging the scalability of our technology and the unique characteristics of SMRT Sequencing.”
“We will continue to support our PacBio RS II customers, and we expect to introduce improvements in sample prep, sequencing chemistry, and software that will extend the performance of that system, as we have done each year since the initial commercialization of the PacBio RS in 2011 and the PacBio RS II in 2013. We expect to make similar, substantial performance improvements each year for the Sequel System,” added Dr. Hunkapiller. “In addition, the Sequel architecture provides the ability to scale throughput by substantially varying the number of ZMWs on future SMRT Cells, thereby optimizing throughput and operating costs for specific applications.”
On Display at the ASHG Annual Meeting
We will showcase the Sequel System in our booth (#907) at the American Society of Human Genetics annual meeting taking place in Baltimore, Maryland, beginning October 6, 2015.
Whether or not you are attending the meeting, you can still attend our workshop, “Addressing Hidden Heritability through Long Read Single Molecule, Real-Time (SMRT) Sequencing,” on Wednesday, October 7, from 1:00-2:30 p.m. The event will be hosted by Michael Hunkapiller and Jonas Korlach from Pacific Biosciences, and include talks by Richard Gibbs from Baylor College of Medicine and Richard Wilson from Washington University in St. Louis. Those attending the conference in Baltimore can register here. We will also offer live streaming and access to the recording.
For more information about the Sequel System, go to pacb.com/sequel.
In the first podcast of a new series on the applications of long-read sequencing, Mendelspod host Theral Timpson interviewed Marc Salit, leader of the Genome Scale Measurements Group at the National Institute of Standards and Technology. Their conversation focused on how and why NIST is involved in establishing baseline measurements for the human genome.
Salit, along with Justin Zook and their team at NIST, are managing the Genome in a Bottle (GIAB) Consortium to develop reference materials, data, and methods needed to assess whole human genome sequencing. Their goal is to establish a physical reference genome as a standard against which subsequent measurements can be compared, providing a foundation for the translation of whole genome sequencing into clinical practice.
As part of the GIAB Consortium’s efforts, we’re working with NIST (alongside many others from the PacBio community) to contribute both long-read sequencing data and analysis methods to help Salit and his team achieve this vision. Already NIST has SMRT® Sequencing data for the NA12878 genome as well as an Ashkenazi Jewish trio from the Personal Genome Project. They recently released this data in a bioRxiv pre-print entitled “Extensive sequencing of seven human genomes to characterize benchmark reference materials.”
Here are a few highlights from the conversation:
NIST’s progress to date
In Salit’s estimation, characterization of the first GIAB reference material (NIST RM 8398) is about two-thirds complete for small variants. “We can confidently call reference alleles, SNPs, and indels in about 77% of the genome,” he said. “We’re only starting to have confidence and methodology to understand how confident we should be about structural variations.” The remaining parts of the genome are more difficult to characterize, due to long homopolymer regions, long duplicated regions, or highly repetitive regions that are challenging to access with short-read data. Another issue is larger, more complex genomic regions where there is enormous genetic variation across populations, which makes it difficult to determine if the assembly process gets it right with 100% certainty.
As part of the characterization process, the GIAB team is integrating data in a systematic arbitration fashion, from a variety of technologies and platforms. “We try to find evidence that backs up the call from each platform, then look for unambiguous calls where you’ve got strong supporting evidence that there are no technical artifacts or systematic sequencing errors,” Salit said. He is attempting to take the best evidence to make confident calls across the genome for complex and structural variants in particular. Long-read, highly accurate sequencing data from PacBio is having a real impact in this effort. “It’s made a major difference with what part of the genome we can see,” Salit told Timpson. “Instead of looking at the world through a soda straw, we’re now looking at the world at least through a paper towel tube.”
Salit and his team at NIST are working toward a future in which whole genome sequencing will be used as a clinical test to answer the tough questions around therapeutic decision-making. Their focus is on ensuring that those answers are demonstrably safe and efficacious for clinical applications, clearing the path for regulated applications of sequencing technology. For Salit, NIST holds unique power in its ability to convene the scientific community to address questions around basic measurement science: What are we measuring? Are we all measuring the same thing? How do we demonstrate that? Ultimately, GIAB is building an infrastructure “on top of which the century of biology will come into being,” Salit said. “We build the sewers of science, and when they’re working, nobody notices. When they’re not working, everybody notices.”
New podcasts in this Mendelspod series will be coming soon, and we’ll keep you posted with highlights as they become available.
|Genome in a Bottle consortium|
The National Institute of Standards and Technology held its latest Genome in a Bottle workshop last month in Gaithersburg, Md., and we were honored to attend. NIST has performed pivotal work to establish reference materials for the genomics community, starting with its RNA spike-in standards (ERCC spike-in controls) and continuing now with the GIAB consortium. These standards are essential for quality control and we’re pleased to be working with NIST to help ensure the highest accuracy in human genome sequencing.
Last year, GIAB released its first reference standard, based on the well-studied NA12878 human genome (NIST RM 8398). At this year’s meeting, attendees from clinical sequencing groups shared their experiences with this reference material, which they are using to support clinical testing and validation. The reference is designed to help users make high-confidence variant calls across many types of variation. In addition, attendees from the FDA described how they anticipate using the reference material to assess device performance as part of the regulatory review process for diagnostics.
The workshop also featured a research track reporting on efforts surrounding the newest GIAB reference material, which will be based on an Ashkenazim Jewish trio from the Personal Genome Project. Through the GIAB project consortium, NIST has characterized the genomes from this trio using measurements from 11 different technologies, including those from BioNano Genomics, Complete Genomics (paired-end and Long Fragment Read), Thermo Fisher Scientific (the Ion Proton system and SOLiD sequencing), Oxford Nanopore, Pacific Biosciences, 10X Genomics (GemCode Platform), and Illumina (paired-end, mate-pair, and synthetic long read sequencing). All of the data has been made public through a paper that recently published on bioRxiv, including the data from PacBio® sequencing.
The GIAB data analysis communities have been actively working to analyze these public data sets and report on a variety of results. Speakers including Adam Phillippy, Ali Bashir, Shinichi Morishita, and Will Salerno presented early results from their efforts to fully characterize the trio genomes using the data, including de novo assembly, structural variation profiling, SNV calling, haplotype reconstruction, and methylation analysis of the epigenome. Based on the sheer number of technologies being used to decode the genomes of this family, it seems these three individuals will soon have some of the most deeply analyzed genomes in the world!
Marc Salit, who leads the Genome Scale Measurement Group at NIST, said the institute plans to integrate these data into a consolidated set of high-quality measurements and make them publicly available through NCBI. The previously published reference material, Salit told attendees, has already been used to support 510(k) diagnostic device filings to the FDA and to demonstrate validation by clinical sequencing labs during CAP/CLIA inspections.
Meeting attendees were also invited to a session with the NIST GIAB steering committee, where stakeholders agreed that the most important priority was to completely characterize the NA12878 and AJ trio genomes to ensure high-confidence calls across all categories of genetic variation spanning the whole genome.
The next GIAB meeting will take place January 28-29, 2016, at Stanford University, and we look forward to participating again and continuing our contributions to this community.
We’re pleased to announce the winner of our recent “SMRTest Microbe” grant competition. Congratulations to Dr. Erin Price at the Menzies School of Health Research in Australia! The grant program, co-sponsored by PacBio and the Institute for Genome Sciences (IGS), was very competitive, with more than 100 submitted proposals.
Dr. Price will receive SMRT® Sequencing and analysis from IGS — using up to 4 SMRTbell™ libraries and 8 SMRT Cells — to characterize the mechanisms behind the emergence of antibiotic resistance in Burkholderia pseudomallei, a highly pathogenic bacterium that causes the potentially deadly disease melioidosis. Dr. Price and colleagues have recently uncovered the development of meropenem resistance in local cases of B. pseudomallei infection, along with evidence that this resistance is linked to at least two mortalities in Australia so far.
Prior short-read-based attempts to sequence and assemble the genomes of these meropenum-resistant B. pseudomallei have suffered from the inability to scaffold across highly repetitive and paralogous loci, low genome complexity, high GC content, and genomic inversions. Dr. Price plans to use the long reads generated by SMRT Sequencing to overcome these assembly issues and close the genomes. As noted in her proposal, complete genome sequences “will provide significant insights into the molecular basis of meropenem resistance in this dangerous pathogen.”
Last month we hosted a SMRT® Informatics Developers Conference, bringing together 150 developers with a passion for improving tools and resources. Our team came back brimming with enthusiasm for tools that will be released in the coming months, and humbled by the commitment we saw from the bioinformatics community to help scientists make SMRT Sequencing data increasingly useful. Thanks to the National Institute of Standards and Technology for hosting our meeting on their campus right before the Genome in a Bottle workshop.
The big news we shared with attendees is that the PacBio® System will now output industry-standard BAM files instead of our usual HDF5 format — check out the new specifications.
Our keynote presentation came from sequencing veteran Gene Myers of the Max-Planck Institute. He talked about building efficient assemblers, the importance of random error distribution in sequencing data, and resolving tricky repeats with very long reads. He also encouraged developers to release assembly modules openly, and noted that data should be straightforward to parse since sharing data interfaces is easier than sharing software interfaces.
Much of the day-long event was allocated to networking time — providing opportunities for developers to catch up, brainstorm, and exchange ideas. Breakout sessions covering different analysis applications allowed developers to provide updates on a number of tools due out before the end of the year, including ones for Structural Variant Detection, Iso-Seq™ analysis and epigenetic analysis. It’s exciting to see a real expansion in the suite of community-driven tools available for SMRT data.
What we heard most throughout the event was that we should do this more often, and we’ve taken that to heart. We’re hoping to hold developer conferences semiannually, and will keep you posted as plans take shape for the next one.
In the meantime, check out these resources from the meeting:
• SMRT Informatics Developers Conference – Kevin Corcoran, Senior Vice President, Market Development, Pacific Biosciences
• Making the Most of Long Reads – Gene Myers, Ph.D., Founding Director, Systems Biology Center, Max Planck Institute
• PacBio SMRT Analysis 3.0 Preview – David Alexander, Ph.D., Pacific Biosciences
• MinHash for Overlapping and Assembly – Sergey Koren, Ph.D., National Biodefense Analysis and Countermeasures Center
• The “Art” of Shotgun Sequencing – Jason Chin, Ph.D., Pacific Biosciences
• PBHoney: Detecting SVs with Long-Read Sequencing – Adam English, Ph.D., Baylor College of Medicine
• Structural Variation with PacBio Data – Ali Bashir, Ph.D., Mount Sinai School of Medicine
• The Iso-Seq™ Method: Transcriptome Sequencing Using Long Reads – Elizabeth Tseng, Ph.D., Pacific Biosciences
• CONVEX: De novo Transcriptome Error Correction by Convexification – David Tse, Ph.D., Stanford University
• Transcriptome Analysis using Hybrid-Seq – Kin Fai Au, Ph.D., University of Iowa
• Understanding Methylome, Metagenome, Structural Variants using SMRT Sequencing – Shinichi Morishita, Ph.D., University of Tokyo
• Storify link to see all tweets from the event (#SMRTBFX)
Also, the Google groups for the events are available to continue the conversation – these forums are open to anyone who wants to join:
• De Novo Assembly – SMRT_denovo
• Structural Variation – SMRT_sv
• Iso-Seq – SMRT– SMRT_IsoSeq (note, this is a change)
• Kinetics – SMRT_kinetics
Roche recently posted this recording of a webinar walking through long fragment capture with SMRT® Sequencing. “Long Genomic DNA Fragment Capture and SMRT Sequencing Enables Accurate Phasing of Cancer and HLA Loci” is a great backgrounder for scientists interested in using the Roche NimbleGen SeqCap EZ System for target enrichment prior to sequencing on the PacBio® system.
The webinar features Denise Raterman from Roche NimbleGen and our own bioinformatics expert Lawrence Hon. Raterman provides a detailed review of the SeqCap EZ workflow, pointing out the specific steps that differ for SMRT Sequencing. The method can be used to capture up to 200 Mb of DNA. Hon presents a step-by-step guide for using SMRT Portal and other tools, including data from the MHC region and a targeted oncology panel, that demonstrate the even coverage generated across multi-kilobase genomic regions. He also explains the bioinformatics workflow for phasing and analyzing haplotypes.
The webinar concludes with a robust Q&A section, including details on hybridization time, probe design, de novo assembly, protocol development, DNA input volumes, HLA typing, and more. For additional details on the SeqCap EZ workflow, check out this blog post.