This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
We’re packing our bags for Orlando and the 17th annual Advances in Genome Biology and Technology (AGBT) conference! While we’ll miss the usual Marco Island setting, this year’s talks and posters look as appealing as ever. And as a meeting sponsor, we’ll be right in the thick of it — with a workshop, party, and coffee-lounge-style hospitality suite for AGBT attendees.
It’s a thrill to see that more than 40 talks and posters will showcase SMRT Sequencing data, many for human biomedical research applications. Customer presentations include a talk from the National Center for Biotechnology Information on evolving approaches to improve reference genome assemblies, a presentation from the Washington University School of Medicine on the first African reference genome assembly, and a talk from the University of Washington on long-read sequence assembly of the gorilla genome.
In addition, our CSO Jonas Korlach will present data from the new Sequel System in a plenary talk on Saturday afternoon. You can get a sneak peek at the new sequencer in our hospitality suite, Del Lago #2, or just enjoy some coffee, conversation, and some recharging (personal and devices).
This year our lunch workshop will be co-hosted by Roche (Friday, 12 p.m., Coquina North). “Revealing the Unknowns in Medical Research with Long-Read SMRT Sequencing” will include talks from Euan Ashley of Stanford University, Robert Sebra from the Icahn School of Medicine at Mount Sinai, Ben Murrell at the University of California, San Diego, and Ulf Gyllensten of Uppsala University. Can’t make it to the workshop? We’ll be streaming it live, so keep an eye on the blog for details or sign up here for info.
If you happen to be up early on Thursday or Friday (or if you just didn’t get to bed the night before), join us on the sand courts next to the Quench Poolside Bar and Grill for a friendly game of beach volleyball.
Finally, don’t miss our Friday night party. Celebrate with us under the stars on the swanky Fairway Lawn. Many thanks to our event partners — BioNano Genomics, DNAnexus, New England Biolabs, and Sage Science — for helping us make this cool event possible.
Here’s a full list of presentations and posters featuring PacBio data at the meeting, as well as our activities and events. We hope to see you in Orlando!
We recently introduced our Genome Galaxy Initiative in partnership with Experiment, through which we’re helping scientists fund genomic research for the benefit of science and society. One of the first explorers of this initiative is Cory Gall, a graduate student at Washington State University who wants to curb the onset of a disease that may be linked to ticks in Africa.
Gall brings our attention to the rising incidence of acute febrile illness occurring in the Mnisi community in South Africa, where close proximity of community animals (including dogs and cattle) and wild animals in a nearby national park may be contributing to disease dynamics and influencing human health. Some evidence suggests that ticks may play a role in the illness, which is still relatively uncharacterized, and Gall seeks public support on Experiment for an expedition to understand whether ticks are indeed the culprits of disease transmission. He hopes to study the local animals and identify the responsible pathogen by using full-length 16S SMRT Sequencing, among other approaches, which will allow him to identify without ambiguity the microbial species that may be contributing to this disease.
Experiment is like Kickstarter for science: it’s a crowdfunding platform that lets scientists connect directly with the public to raise money for projects that might not fit traditional funding avenues, or ones that will generate the preliminary data needed for larger grant proposals. More than 380 scientific projects have been funded through Experiment so far, with supporters pledging more than $5 million. Contributions of any size, however small, are welcomed and are only accepted when the funding goals are achieved. Scientists funded this way can publish their results and share them at conferences as usual, but they also engage their crowdfunding community with updates and discussions along the way.
Gall first used crowdfunding for a scientific project two years ago, and he says the benefit was far more than the $3,000 he raised for research into tick-borne disease: he connected directly with a community that was truly passionate about the work he wanted to do, and he got a crash course in communicating complicated science to a lay audience. “As government funding is becoming more difficult to get, scientists are going to need to start exploring alternatives,” he says. By making projects relatable to a general audience, scientists can open the doors to new funding opportunities for their research. While crowdfunding can’t replace large government or charitable grants, it offers flexibility for riskier projects and ultimately makes those big grant proposals more competitive by fleshing them out with preliminary data.
Gall jokes that his research was a great fit for crowdfunding “because everyone hates ticks.” Still, he worked hard on short, accessible explanations of the project and what he hoped to learn — an effort that paid off in the end, and a lesson he has taken into his new project.
Follow Cory (@Bearded_Science) on his tick hunting expedition in South Africa. Make your impact as a science patron to Genome Galaxy and learn more about ticks and the rising cases of acute febrile illness in South Africa.
On the heels of her remarkable paper tracing influenza evolution in a single host last spring, New York University’s Elodie Ghedin has come out with a new publication in Nature Genetics that offers a higher-resolution view of how the flu spreads through a population.
From lead author Leo Poon at the University of Hong Kong and senior author Ghedin, “Quantifying influenza virus diversity and transmission in humans” reports the results of an international collaboration to track the Hong Kong flu pandemic of 2009.
The authors began with the premise that much about the genetically diverse influenza A virus is unknown, since it has primarily been studied at the population level. That type of approach successfully captures dominant flu strains, but may be limited in its ability to elucidate minor viral variants.
Ghedin and her collaborators used SMRT Sequencing to specifically detect these viral haplotypes in each individual host. “To characterize virus variants that achieve sustainable transmission in new hosts, we examined within-host virus genetic diversity in household donor-recipient pairs from the first wave of the 2009 H1N1 pandemic when seasonal H3N2 was co-circulating,” the team writes, noting that minor variants were found in all patients. “Although the same variants were found in multiple members of the community, the relative frequencies of variants fluctuated, with patterns of genetic variation more similar within than between households.”
The team determined that these minor variants, viewed as complete haplotypes for the first time with long-read sequencing, are more important to flu transmission than was previously believed. This critical finding could shape the future design of flu vaccines, which typically do not target minor variants.
Whole genome sequencing was performed on viral strains captured by nasal swabs from flu patients and their household members during the 2009 pandemic. A phylogenetic analysis detailed the transmission pattern between infected patients, both within and between households in the community. “We were able to look at the variants and could link individuals based on these variants,” Ghedin said in a statement. “What stood out was also how these mixes of major and minor strains were being transmitted across the population during the 2009 pandemic — to the point where minor strains became dominant.”
Read the GenomeWeb article describing this work.
Blog readers know that we are committed to supporting open-access research, from working with the informatics developer community to develop improved tools to releasing SMRT Sequencing data so scientists can mine it themselves. We’re proud to launch a new program that takes this commitment to the next level: the Genome Galaxy Initiative.
This initiative stems from twin trends in the scientific community: rapidly increasing demand for SMRT Sequencing, even for very small projects, and increasing awareness that alternative funding sources are important to keep pushing genomics forward. We’ve partnered with scientific crowdfunding platform Experiment to connect researchers directly with the public to help get their SMRT Sequencing projects funded.
Experiment launched in 2012 with a mission to fill gaps in the funding pipeline by helping scientists raise money for small projects that don’t fit typical NIH categories or that could produce the early-stage data needed for traditional grant applications. Scientists are able to directly engage the public through Experiment to generate support for their research, and communicate often with updates on project status and findings to interested backers.
The Genome Galaxy Initiative, based on the Experiment platform, supports expedited, open-access genomic projects. It’s a central location for SMRT Sequencing-based projects seeking crowdfunding, and fosters a community of scientists and patrons interested in asking research questions that can only be answered with long-read sequencing. As high-quality genome assemblies from the PacBio RS II and the Sequel System have become even more affordable and accessible, partnering with Experiment is a great fit. Through this program, even more scientists will have access to the most comprehensive view of genomes, transcriptomes, and epigenomes from SMRT Sequencing.
In celebration of the program, we are incorporating the Genome Galaxy Initiative into this year’s first SMRT Grant program. Once all entries have been submitted (the deadline is January 31), a scientific review committee will select the top five finalists for our “Explore Your Most Interesting Genome” grant program. Those finalists will post their projects on Experiment and campaign for public support, and the community will vote for the SMRT Grant winner. The four runners-up will have an immediate second opportunity to “win” by crowdsourcing funds to kick off their projects on Experiment.
We’ll keep you posted on the program, and let you know as new SMRT Sequencing projects get their chance to star in the Genome Galaxy Initiative via Experiment.
Follow @genomegalaxy to be a part of this initiative.
For Jim Lupski, his long-standing interest in the field of genomics is both personal and professional. His personal interest dates from his teenage years, when he was diagnosed with a rare genetic disease called Charcot-Marie-Tooth (CMT) neuropathy. As a clinician and scientist, he made it his mission to find the genetic basis of CMT, and in 1991 published his discovery of the CMT1A duplication, pioneering the field of structural variation and particularly copy number variation. Today he is a practicing pediatrician and a professor of molecular and human genetics at Baylor College of Medicine, where he is the Principal Investigator of the NHGRI Center for Mendelian Genomics. Highlights from his recent conversation with Mendelspod host Theral Timpson are below.
Moving Beyond the Gene-centric View
During the early days of Lupski’s career, genetics was based on a gene-centric/Mendelian view of biology. Though this enabled him and his team to identify the most common causative CMT gene — PMP22, related to 70% of the cases — the genetic cause of his own disease eluded him until genomics technology became more advanced, enabling researchers to look at more than one disease locus at a time. Lupski likens the advent of genomics to the impact of Einstein’s theories on Newtonian science: “When Einstein physics came around, it didn’t say that the Newtonian view was wrong, it just generalized the concept of relativity,” he told Timpson. “I think genomics more generalizes the concept of disease traits because we can look at more than one locus at a time.” A genome-wide view allows researchers to delve into the subtleties of genetic disease, including potential driver vs. modifier alleles, as well as the occurrence of de novo mutations which can’t be tracked through lineages.
Generating Data Faster than We Can Analyze It
“Gene discoveries [are] happening at a much greater pace … rather than reporting a gene a year, or a gene a month, or even a gene a week, we’re talking about hundreds of genes at a time,” said Lupski. He believes that data generated by clinical implementation may soon outstrip data from research laboratories, and that data is being generated faster than scientists can analyze. “We’ve got to dig into these data,” he added. He notes that when clinicians uncover a causative gene, they often leave the rest of the genome unexplored. Conversely, in the 75% of cases where no plausible explanation is found in a patient’s genome, patients should be given the option to share the data with researchers for basic research, as they do at the Center for Mendelian Genomics at Baylor.
The Benefit of Long Reads
Lupski’s genome has been published three times. The most recent assembly was done with PacBio sequencing, which identified significantly more structural variation than the other technologies, including three times more copy number variations compared to what was found with 10 different whole genome sequencing runs using short-read methods. Lupski imagines a clinic where he could use PacBio long reads, layering short-read sequencing data on top for precision. “De novo assembly in the clinic, to me, would be the goal that you would really strive for,” he said.
When Doreen Ware and her team’s latest genome project is complete, the plant science community will have a critical new tool that once seemed virtually impossible: a robust reference assembly for the maize genome. This resource will support breeding of hardier, higher-yielding lines of maize, the number one crop plant in both the U.S. and China.
Ware, a computational biologist with the USDA at Cold Spring Harbor Laboratory, says climate change and protecting the environment are major challenges facing agriculture. “We know that we must increase yield in order to meet the expected 9 billion people in less than 25 years,” she says. “Having the genome and the genome content allows us to accelerate the improvement of maize varieties and the germplasm.”
Maize, with about 2.3 Gb of sequence, seems almost designed to evade genomic characterization. Previous sequencing efforts with Sanger and other technologies were stymied by the plant’s complex universe of transposable elements and highly repetitive regions. The genome differs significantly between plants: the gene complement of any two maize plants, even of the exact same variety, can differ as much as five percent. For population-level crop improvement, Ware says, this field needs not one reference assembly but many.
Her team is well on its way to delivering the first truly high-quality reference assembly for maize, with plans in the works to produce several more in the near future. This impressive achievement was based on pairing two technologies: Single Molecule, Real-Time (SMRT) Sequencing from PacBio, and Next-Generation Mapping from BioNano Genomics’ Irys System. Together, these approaches have enabled Ware and her team to produce an assembly with greater contiguity and accuracy than has ever been possible for this challenging genome, providing the first-ever look at important regulatory and structural elements that could influence breeding approaches.
The team began with PacBio sequencing, generating a de novo assembly of stunning quality. The previous version of the publicly available assembly for maize (B73 RefGen_v3), which was based on Sanger-sequenced BACs and 454 sequencing data, was broken up into about 140,000 contigs; the PacBio assembly has just 3,300 contigs. Contig N50 lengths rose from about 19 kb in the previous assembly to more than 1 Mb in the PacBio assembly.
“The contigs that we’re looking at now have almost no unknown bases in them,” Ware says. In every chromosome, existing sequence length was extended, filling in gaps that had previously peppered the assembly. The new assembly includes most of the centromeric sequence across all 10 chromosomes and even represents portions of the telomeres.
The next step was layering BioNano Mapping into the new assembly. Again, the results represented a significant improvement made possible only by pairing the technologies: The total contig count was reduced to just 768, while the contig N50 length increased to more than 9.5 Mb.
“What the PacBio assembly and the BioNano map allowed us to do is to achieve improved contigs and scaffolding,” Ware says. She says that each tool offers value — nucleotide-level resolution and long reads with PacBio sequencing, and massive scaffolds with BioNano Mapping — but that together they’re even more effective. “The combination gives you a better product than either technology does alone,” she adds.
Being able to view repeat content gives the team its first clear look at gene regulation. “Which transposable element is near a gene may determine the methylation status surrounding that gene, and the methylation status can directly impact expression,” Ware notes. “With the complete reference, we can start to understand how these genes are regulated.”
Now, the team is plowing ahead with plans to sequence three or four additional maize genomes in the coming year. “We don’t want just one reference, we want hundreds of references,” Ware says. “Having more genomes will let us see the full complement of genes within the species, how much of the regulatory regions are conserved, and how diverse they are.”
To learn more about the new maize reference assembly, read the full case study here.
In an article entitled “Long-read single-molecule real-time (SMRT) full gene sequencing of cytochrome P450-2D6 (CYP2D6)” in Human Mutation, authors Wangiong Qiao, Yao Yang, Stuart Scott and other colleagues at the Icahn School of Medicine at Mount Sinai demonstrate a new way of analyzing the CYP2D6 gene using PacBio long reads. This gene has been shown to have a central role in drug metabolism and is believed to be directly involved in the metabolism of ~25% of all commonly used drugs. Given its importance, CYP2D6 genotype testing is now being widely used to predict how efficiently patients will metabolize drugs such as codeine, antidepressants, or antipsychotics.
Studying CYP2D6 presents many challenges. It is highly polymorphic, “with over 100 variant star (*) alleles catalogued, many of which are associated with reduced or no enzyme activity,” the authors report. In addition, it is highly prone to copy number variation. Both gene duplications and deletions can occur, with pseudogenes maintaining high sequence homology to functional copies. As a result, “Accurate prediction of CYP2D6 metabolizer status necessitates direct analysis of the duplicated gene copy (or copies) when an increased copy number is detected, particularly when identified concurrently with normal activity and loss-of-function alleles in compound heterozygosity,” the authors write. Such copy number changes can alter the interpretation of CYP2D6 phenotypes.
In their publication, the authors describe targeted long-read sequencing of the CYP2D6 gene along with upstream and downstream gene copies by using targeted long-range PCR and barcoding for multiplexing. The analysis consisted of demultiplexing, read alignment using BWA-MEM software, and error correction using Amplicon Long-read Error Correction (ALEC) that was developed by the authors. The team began by validating their SMRT Sequencing pipeline with 10 previously characterized DNA samples, showing that not only were they able to correctly call genotypes, but their approach provided additional information about variants that had been missed by other platforms. Specifically, SMRT Sequencing enabled the team to further refine the genotypes, reclassify diplotypes (two haplotypes, i.e., multiple genotypes on homologous chromosomes), characterize allele-specific duplication, and discover novel alleles.
After validation work, they then applied their method to 14 samples that previously had been found to yield inconclusive or unreliable results. SMRT Sequencing was able to reconcile the discrepancies that had been seen from the other platforms and provide new data. “In addition to confirming consensus diplotypes, CYP2D6 SMRT sequencing enabled suballele resolution, genotype refinement, duplicated allele characterization, and discovery of a novel tandem arrangement,” the scientists report.
The authors conclude, “Long-read CYP2D6 SMRT sequencing is an innovative, reproducible, and validated method for full-gene characterization, duplication allele-specific analysis and novel allele discovery, which will likely improve CYP2D6 metabolizer phenotype prediction for both research and clinical testing applications.”
You can also watch Dr. Stuart Scott describing this research in a presentation he gave at the American Society of Human Genetics meeting in 2014.
We’re looking forward to the International Plant and Animal Genome conference, taking place January 9-13 in San Diego. PAG features leading plant and animal scientists from around the world, and we’re continually impressed by their new discoveries and creative approaches to understanding large and complex genomes.
This year PAG attendees will have a number of opportunities to learn more about how SMRT Sequencing reveals new information about even well-characterized plant and animal genomes. We’ll be exhibiting in booth #421 — and showing off our new Sequel System — so please stop by and tell us about your work.
We’ll also be hosting a workshop: “Discover a More Complete View of Genetic Diversity.” It will be held on Tuesday, January 12, at 1:30 p.m. in the San Diego meeting room of the Town and Country Hotel. Here’s the speaker lineup:
- Marty Badgett, PacBio
SMRT Sequencing in 2016: Technology updates & developments
- Oliver Ryder, San Diego Zoo Institute for Conservation Research
Conservation genomics of a critically endangered Hawaiian bird: A high quality genome assembly of the ’alala will assist in population management and reintroduction
- Jenny Gu, PacBio
The Genome Galaxy Initiative – expedited open-access genomic investigations
- Doreen Ware, USDA at Cold Spring Harbor Laboratory
Single-molecule sequencing of the maize genome and transcriptome
- Alan Archibald, The Roslin Institute
An improved reference pig genome sequence to enable research and prediction
- Shwen Ho, King Abdullah University of Science and Technology
Empowering genetics research in Chenopodium quinoa with single molecule genomics
Finally, we’re making the most of having so many great scientists in one place by hosting a meeting for bioinformatics developers at the end of PAG. Our SMRT Informatics Developers Conference will take place from 12:00 – 5:30 p.m. on Wednesday, January 13, in the San Diego room. The event is about developing and improving data analysis tools for PacBio SMRT Sequencing data, with an emphasis on collaboration and brainstorming. Our speakers will focus on de novo assembly and Iso-Seq data analysis:
Sergey Koren, National Human Genome Research Institute
De novo assembly:
Jason Chin, PacBio
Alex Hastie, BioNano Genomics
Kin Fai Au, University of Iowa
Ana Conesa, University of Florida
Meisam Razaviyayn, Stanford University
Serghei Mangul, University of California, Los Angeles
Elizabeth Tseng, PacBio
Learn more about or register for this free event. We look forward to seeing you in sunny San Diego!
We’re already looking forward to next month’s Personalized Medicine World Conference. Long before “precision medicine” was an industry catchphrase, PMWC was bringing together stakeholders from genomics companies and academic research, regulatory agencies, clinical groups, pharma/biotech, and more. Launched in 2009, the meeting has prompted important discussions as well as insight about how to move the field forward in a thoughtful way.
From January 24th to the 27th, some 1,200 PMWC attendees will descend on the Computer History Museum in Mountain View, Calif. The event will kick off with a reception honoring the four awardees of this conference: Merck’s Roger Perlmutter and UCSF’s Laura Esserman are being given the PMWC Luminary Award for accelerating personalized medicine into clinical use, while Irv Weissman from Stanford and Ralph Snyderman from Duke have won the PMWC Pioneer Award for ahead-of-their-time advances in personalized medicine.
Our own CSO Jonas Korlach will be speaking at the conference about using high-accuracy sequencing technology to advance our understanding of biology and disease. If you’ll be attending the meeting, be sure to check out the panel discussion ‘NGS and Clinical Interplay’ on January 25 at 9:00 a.m. and the NGS session on January 26 at 1:30 p.m. to hear his presentation entitled “Whole Genome? The Future of High-Quality, De Novo Human Genomes.”
Registration for the meeting is still open. If you’re interested in signing up, click here. We hope to see you there!
A team of scientists from Australia, Canada, and the US published fascinating new work that may help explain gene expression patterns seen in prostate cancer. In the course of the project, they used SMRT Sequencing and found a novel fusion transcript linking two genes with high sequence identity.
“Identification of a novel fusion transcript between human relaxin-1 (RLN1) and human relaxin-2 (RLN2) in prostate cancer” was published in Molecular and Cellular Endocrinology by lead author Gregor Tevz, senior author Colleen Nelson, and a number of collaborators. In it, the scientists attempted to untangle expression signals from two relaxin genes, which were formed by a duplication event sometime before humans and apes branched off. The genes play a role in reproduction and are most highly expressed in ovaries and prostate. “Outside normal physiology, RLN2 is a promoter of cancer progression in several different types of cancers,” the scientists note.
Previous studies were unable to distinguish between the two genes, so this team deployed long-read sequencing and the Iso-Seq method from PacBio to sort out reads from RLN1 and RLN2 in LNCaP cells. Using their results along with publicly available data, they made a number of discoveries. For one thing, they found that most prostate cancer cell lines underrepresent RLN1, which is highly expressed in both normal and cancerous tissue. “LNCaP cells best reflect the RLN1 expression observed in [prostate cancer] and is the most relevant cell line for the use in further studies of RLN1 biology,” the team reports.
They also detected a novel fusion transcript that incorporates large swaths of both RLN1 and RLN2, but were able to design primers to distinguish the fusion from the genes. “The fusion transcript encodes a putative RLN2 with a deleted secretory signal peptide indicating a potentially biologically important alteration,” the scientists write. They determined that RLN1 and the fusion transcript are inversely regulated by androgens, and suggest that follow-up studies will be helpful to elucidate the mechanisms governing this response.
While we’re on the subject of cancer, don’t forget that the abstract deadline for the 2016 AACR annual meeting is coming up on December 2. We’re already looking forward to hearing about more great discoveries at that conference!
We’re excited about a new Nature paper from the winners of our 2014 “Most Interesting Genome in the World” SMRT Grant program. “Single-molecule sequencing of the desiccation tolerant grass Oropetium thomaeum” comes from lead authors Robert VanBuren and Doug Bryant along with senior author Todd Mockler at the Donald Danforth Plant Science Center, as well as a number of collaborators at other institutions. In it, the authors report a virtually complete genome of Oropetium thomaeum, a grass with an estimated genome size of 245 Mb and the handy ability to regrow even after extreme drought once water becomes available.
The scientists believe that a better understanding of the plant’s genome could shed light on the mechanisms underpinning these so-called resurrection plants, and ultimately enable the engineering of crop plants to withstand severe drought and stress.
For this study, the team worked with about 72x coverage of the Oropetium genome generated by the PacBio system. That’s “equivalent to <1 week of sequencing time and <$10k in reagents,” according to the paper. Based on HGAP and Quiver, the resulting assembly covered 99% of the genome in 625 contigs, with an accuracy of 99.99995% and a contig N50 length of 2.4 Mb.
VanBuren et al. note that the contiguity of the assembly sets it apart from draft genomes produced from short-read sequencers. “Most NGS-based genomes have on the order of tens of thousands of short contigs distributed in thousands of scaffolds,” the scientists write. Because the assemblies are so fragmented, “they are missing biologically meaningful sequences including entire genes, regulatory regions, transposable elements (TEs), centromeres, telomeres and haplotype-specific structural variations.”
Instead, SMRT Sequencing is pushing new limits to characterize those elements in the Oropetium genome, with its predicted 28,446 protein-coding genes and a significant proportion of repeat regions. The authors noted that “the largest tandem array contains five identical and one partial 9 kb repeats collectively spanning 51 kb; this is approaching the theoretical limit given the current read-length distributions of PacBio.” The assembly includes telomere and centromere sequence, long terminal-repeat retrotransposons, tandem duplicated genes, and other difficult-to-access genomic elements. In addition, the scientists produced the full chloroplast genome in a single contig that includes “~25 kb of inverted repeat regions which typically collapse into a single copy during assembly,” they report.
“The Oropetium genome showcases the utility of SMRT sequencing for assembling high-quality plant and other eukaryotic genomes,” the scientists note, “and serves as a valuable resource for the plant comparative genomics community.”
Mendelspod host Theral Timpson recently interviewed Professor Steven Marsh, Director of Bioinformatics at the Anthony Nolan Research Institute, a UK-based organization dedicated to improving the outcomes of bone marrow transplantation and host to the world’s first bone marrow registry. Prof. Marsh and his team have dramatically improved the resolution of HLA typing — one of the methods used for matching compatible donors with transplant recipients — using long, accurate reads from PacBio sequencing. Their fascinating conversation covers the past, present, and future of HLA typing — highlights are below.
Short History of HLA Typing — There’s a Lot More Diversity than We Thought
When Marsh entered the field 30 years ago, HLA typing was performed with serology, and there were just 119 known HLA antigens. “We thought 119 was a lot of diversity,” he says. With the advent of genomic tools in the 1990s, researchers have had to evolve their practices for typing as more and more became known about the nature of HLA genes. “We’ve really realized that these genes are not just polymorphic, they are really hyper-polymorphic,” he says. The HLA B gene alone has 4,000 variants, Marsh notes. “The only way to do proper HLA typing in this day and age is to do sequencing,” he says.
Enter Long Reads and Exquisite Haplotypes
Using PacBio sequencing technology, Anthony Nolan aimed to extend its sequencing from a couple of gene exons to cover the full HLA genes and capture phasing information. “We’re seeing exquisite haplotypes … all the way through the HLA region,” Marsh says, noting that this gives them “very high resolution typing and very high allelic specificity.”
Marsh says he has been offered free NGS machines from other vendors, but for him, “those technologies would be a distraction.” The MHC/HLA genes are very GC-rich, he explains, making it difficult to use short-read sequencing technologies because of their high systematic error rates. “You cannot assign phase across the whole gene sequence for some allele combinations,” he says.
For Marsh, the future lies with the long-read sequencing capabilities of PacBio. “For me, it’s groundbreaking technology,” he says. One example of the unique capabilities provided to the Anthony Nolan team by PacBio sequencing is 3.5 kb contiguous sequences for HLA Class I genes, including all of the exons and introns, as demonstrated in a recent publication.
Complete, High-Resolution Typing — The Way Forward
Marsh is using the PacBio platform exclusively for his sequencing program and is already seeing the benefits of high-resolution typing. His goal: to speed up the process and improve matching preciseness to save lives. Anthony Nolan is the first group in the world to take this strategy to the clinic, and is using multiplexing to make the process more cost effective. “We really believe in [the PacBio technology], and we believe it will make an impact for patients,” he says.
The scientists at Anthony Nolan continue to gain deeper knowledge about HLA genes and have future plans to expand their focus to other relevant immune related regions, such as the KIR. They will also continue to explore other important genes comprised within the MHC locus such as MIC-A etc.
Click here to listen to the full podcast.
This week the Festival of Genomics comes to the West Coast, and we’re excited to be a founding sponsor of the Front Line Genomics organization. Not only is it our first chance to show off the new Sequel System in our home state, but there will also be a number of great talks reporting SMRT Sequencing results. Here are the some of the presentations to consider if you’re attending the event:
Wednesday, November 4
Ali Bashir, Icahn School of Medicine at Mount Sinai
Will Salerno, Baylor College of Medicine
Robert Sebra, Icahn School of Medicine at Mount Sinai
Thursday, November 5
Tyson Clark, Pacific Biosciences
Maria Nattestad, Cold Spring Harbor Laboratory
We hope you’ll stop by booth #11 to get the tour of our new Sequel System. You can also see the PacBio team lacing up our running shoes for a good cause on Wednesday morning at 10:00 a.m.: once again we’ll be participating in the Race the Helix event, a fundraiser for the Greenwood Genetic Center. (At the first Festival of Genomics, held in Boston in June, our Race the Helix team won the best-dressed award — quite the feat when you’re sprinting on a treadmill for all you’re worth!)
A new publication reports the discovery and analysis of a nightmare bacterium that’s genetically resistant to all commercially available classes of antibiotics.
The paper, “Stepwise evolution of pandrug-resistance in Klebsiella pneumonia,” came out this month in Scientific Reports from Nature. Lead authors Hosam Zowawi and Brian Forde, along with senior author David Paterson and several collaborators, studied an isolate recovered from the urine of an 87-year-old patient who was hospitalized in the United Arab Emirates last year. They used SMRT Sequencing to characterize the strain and its genetic mechanisms for drug resistance.
That strain, MS6671, “was found to be non-susceptible to all antibiotics tested, which includes cephalosporins, penicillins, carbapenems, aztreonam, aminoglycosides, ciprofloxacin, colistin, tetracyclines, tigecycline, chloramphenicol, trimethoprim-sulfamethoxazole and fosfomycin,” the authors report. They note that carbapenem-resistant Enterobacteriaceae (CRE), including Klebsiella and E. coli, are lethal in almost half of patients with bloodstream infections. “The ‘golden era’ when modern medicine saved lives through antibiotic treatment is under serious threat,” they add.
The scientists sequenced the isolate with the PacBio RS II system and performed de novo genome assembly using HGAP and Quiver. The genome includes a circular chromosome about 5.5 Mb long, as well as five circular plasmids and a linear plasmid prophage, the team reports. The circular chromosome is similar to that of a strain of Klebsiella known for its hypervirulence.
Assembly in hand, the team sought the genetic basis for the strain’s broad resistance to antibiotics. They detected a number of acquired antibiotic resistance genes, a novel variant of a gene that appears to confer heightened resistance to carbapanems, and repeated insertions of mobile elements linked to resistance to colistin, an antibiotic used as a last resort. “Our findings provide the first description of pandrug-resistant CRE at the genomic level, and reveal the critical role of mobile resistance elements in accelerating the emergence of resistance to other last resort antibiotics,” the scientists write. According to the paper, this is the first time that anyone has demonstrated resistance to colistin due to the insertion of a carbapanem-resistant mobile element. The authors attribute that partly to SMRT Sequencing, which can accurately sequence complex resistance elements that would confound short-read sequencers.
“Critically, elucidation of the complete K. pneumoniae MS6671 genome using long-read sequencing enabled the context of multiple, identical carbapenem resistance elements to be determined,” the team reports. “Based on this analysis we propose a model for the development of pandrug-resistance in this K. pneumoniae isolate, whereby mobile resistance determinants are responsible for driving additional resistance.”
Zowawi et al. report that as of six months after the patient’s hospitalization, no other cases of this strain had been detected at the facility. “However, the occurrence of this strain in the Arabian Gulf is of great significance,” they write, noting that travel from the region to India, Europe, and the US is common. “The potential for international transfer of multidrug-resistant bacteria emphasizes the need for global surveillance efforts as one part of a strategy to control antibiotic resistance.”
Carbapanem-resistant bacteria have already been cited by health authorities as an urgent threat against human health. “The emergence of this highly resistant strain, in a clone that has proven capable of causing outbreaks, raises this threat level even higher,” the authors conclude.
Richard Roberts, Nobel Laureate and Chief Scientific Officer of New England Biolabs, offers his thoughts on the utility of methylation data for understanding prokaryotes. In his words:
“Please run SMRT Analysis to detect methylation in your prokaryotic PacBio data.
Most bacteria and archaea encode DNA methylases, many of which are known components of restriction-modification systems. Usually, these are quite specific in terms of the sequences they recognize; the restriction component becomes a key defense mechanism preventing phages, plasmids, and other DNA elements from infecting the cell.
Until recently, it was quite difficult to determine the recognition sequences of these methylases. For most organisms, we had no idea whether the genes we could detect in the genome were active or not. Now, thanks to the properties of the DNA polymerase used during SMRT Sequencing, we can accurately locate the positions of m6A and m4C along the genome and sometimes can deduce the position of m5C. By analyzing the sequence context of these modified bases, we can deduce motifs that are the recognition sequences for the various methylases encoded in the genomes. Increasingly, we can then accurately match the genes with the motifs they produce to enable precise, experimentally-determined annotation for those genes.
Further progress in this area will depend on our ability to gather as much experimental data as we can; to improve the algorithms for calling the motifs accurately from the raw PacBio reads; and to improve our ability to match the DNA methylase genes in a genome with the PacBio motifs that are found experimentally. The public availability of motif data produced by running SMRT Analysis after each PacBio run can be enormously beneficial. Even better, if the raw sequence reads are also available, then this can help the development of better algorithms for data interpretation.
There is another terrific use of the methylation data for anyone interested in trying to transform these strains: While the presence of methylated motifs — and hence methylase genes — does not mean that an active restriction system is present, very often it does, offering some information about how one might protect DNA to be used for transformation before it is introduced into the cell.
I encourage everyone to think ‘methylation’ when using PacBio systems to sequence bacterial and archaeal genomes. The current results of such methylation analysis can be found in REBASE by clicking on the blue PacBio icon. This also has a link through which you can submit your methylation motifs to REBASE.”
Following on the heels of characterizing 18 Mst77Y genes that were tandemly duplicated within a 96 kb region (Krsticevic FJ, et al., 2015), scientists from institutes in Brazil, Austria, and the United States recently published a study in which they also used the Drosophila melanogaster data release from PacBio to characterize a region of the Y chromosome that had never before been accessible.
In a paper published in PNAS, entitled “Birth of a new gene on the Y chromosome of Drosophila melanogaster,” lead author Antonio Bernardo Carvalho, senior author Andrew Clark, and collaborators detail their find of a gene duplicated from an autosome. “We emphasize the utility of PacBio technology in dealing with difficult genomic regions,” the authors write. “PacBio produced a seemingly error-free assembly of the FDY region, something that has eluded us for years of hard work.”
The 55 kb region, which consists of several pseudogenes as well as the newly discovered functional FDY gene, has been challenging to sequence and assemble since it exists only on the Y chromosome and is full of highly repetitive sequence. Some 75% of its length, the scientists report, is made up of transposable elements.
Their discovery was worth the wait. Unlike mammalian Y chromosomes, which are thought to evolve primarily by gene loss, the Drosophila Y chromosome appears to be the result of millions of years of gene gains. The team demonstrates that the new gene they detected, named FDY for flagrante delicto Y, was formed about 2 million years ago in a single duplication event of the gene vig2 and its flanking sequence from chromosome 3R. That flanking sequence originally included four other genes, “but they became pseudogenes through the accumulation of deletions and transposable element insertions, whereas FDY remained functional, acquired testis-specific expression, and now accounts for ∼20% of the vig2-like mRNA in testis,” the scientists report. Today, FDY shares 98% sequence identity with its vig2 parent.
The paper details the team’s effort to sequence the FDY region, using RT-PCR, clonal sequencing, and publicly available genome assemblies. Most existing assemblies did not fully cover the region. “Fortunately … the PacBio [MHAP] assemblies covered not only FDY, but also substantial flanking regions,” the scientists write. With that resource, they had their first view of the full sequence of the region. By comparing it to Sanger and Illumina sequence data, they concluded that the PacBio assembly is complete and accurate.
Carvalho et al. went on to figure out when FDY likely appeared in the genome. Their sequence divergence analysis suggests that the duplication occurred once, about 2 million years ago. The gene was found in samples of D. melanogaster from around the world, but does not appear in the fly’s closest relatives.
“Hence a female-biased gene (vig2) gave rise to a testis-biased gene (FDY),” the authors write. “This seems to be a case of gene duplication followed by neofunctionalization, the first reported, to our knowledge, for the Drosophila Y.”
During the final days of the ASHG meeting last week in Baltimore, a number of scientists offered great presentations based on data generated with SMRT Sequencing, including an entire session on building platinum genomes. We’ve rounded up the highlights here:
Karyn Meltz Steinberg from Washington University’s McDonnell Genome Institute spoke about building a platinum human assembly from single-haplotype genomes. Her team defines “platinum” as covering at least 98% of the sequence with every contig associated with a chromosome. They use long-read PacBio sequencing for de novo sequencing and assembly, followed by scaffolding with BioNano Genomics or Dovetail Genomics technology. When necessary, they then perform PacBio sequencing of BACs for targeted regions, such as gap-filling. Using CHM13 as an example, she shared several examples of specific genomic regions and assembly challenges, both for short- and long-read data. By combining BioNano mapping with PacBio sequence data, they produced a hybrid assembly with 254 contigs, compared to 1,590 contigs for the initial PacBio assembly lacking BioNano mapping.
Bobby Sebra from the Icahn School of Medicine at Mount Sinai talked about an effort to resolve regions in the human genome — such as complex structural variants — that have not been addressed by NGS or Sanger sequencing. Working with the NA12878 genome, Sebra and his colleagues combined PacBio and Illumina sequence data with BioNano mapping. The resulting assembly filled 28 gaps in the latest human reference genome and featured a multi-megabase contig N50 length. The comparison to GRCh38 confirmed previous studies suggesting that tandem repeats and other structural variants are underrepresented in the reference genome; long-read sequencing can effectively characterize these regions. Sebra noted that many challenging regions in the human genome have implications for pharmacogenomics or disease associations, and that detailing these regions carefully will be important for clinical utility of genomics.
In that same session, Justin Zook from the National Institute of Standards and Technology presented on progress at the Genome in a Bottle consortium, including some upcoming reference genomes from Han Chinese and Ashkenazi Jewish family trios. These new genomes have been generated with a number of sequencing technologies, including ones from PacBio, BioNano, 10X, Complete Genomics, Oxford Nanopore, and others. GIAB has already released some reference materials, which scientists are using to help validate variant calls for their own genome assemblies. Zook mentioned tools produced by the CDC and underway at the Global Alliance to allow scientists to compare sequencing data to what other projects have reported. They’re also working on analysis tools to show confidence scores for structural variant calls.
In a separate session, Kiana Mohajeri from the University of Washington reported on a region of chromosome 8 that features the largest known inversion variant in the human genome; it spans several megabases and includes several segmental duplications. Seeking to determine the evolutionary history of this region and to better understand the variation found in human genomes, the team sequenced more than 70 BAC clones with SMRT Sequencing. They produced a gap-free 6.2 Mb tiling path with 99.999% accuracy — a far more complete and contiguous sequence than the human reference genome has for this region. The tiling path shows four inversion-associated repeats with 98% sequence identity flanking the internal inversions. By comparing the region to other primate genomes, they theorize that it was formed between 200,000 and 800,000 years ago, but note that the oldest of the repeats appears to be 19 million years old.
During the Wednesday afternoon sessions of last week’s ASHG conference, several speakers provided helpful insights about their use of SMRT Sequencing for a range of applications. Highlights included the following:
Yao Yang, a researcher at the Icahn School of Medicine at Mount Sinai, discussed the development of an assay to genotype the CYP2D6 gene to inform drug dosing in patients. CYP2D6 metabolizes 20-25% of all medications, including antidepressants, anti-psychotics, and opiates. There are more than 100 known variants, which include gene deletions and duplications. Variants can have profound impacts on how patients metabolize drugs, with some individuals being ultra-rapid metabolizers and others being poor drug metabolizers. The development of a simple and reliable typing assay has been challenging because CYP2D6 has a highly homologous pseudogene. Yang and his collaborators developed a targeted full-length PCR protocol to amplify both the gene and pseudogene, as well as companion bioinformatics tools to remove random errors (ALEC) and predict the expected phenotype from genotype data (CYP VCF Translator). He shared results showing that the assay is highly reproducible and capable of recapitulating known genotypes in well characterized samples like NA12878. Furthermore, they uncovered novel CYP2D6 alleles in NA16688 and ASIAN048, which had variants in an intron and exon, respectively. Most impressively, they were able to resolve alleles in NA17084 and other samples in which results from other technologies were discrepant. They foresee using this same approach to develop targeted assays for other similarly challenging genes, and eventually combining these into a multiplexed gene panel of clinically important genes.
Stuart Cantsilieris from the Eichler lab at University of Washington presented work demonstrating the utility of long-read sequencing for understanding the range of structural variant alleles present in the Complement Factor H Gene cluster. The CFHR locus is a well-known hotspot for structural variation, but short-read data has, at best, only provided a rough map of the density of structural variants in the region, and can’t resolve haplotypes or define precise breakpoints. The team sequenced BAC clones with the PacBio RS II — not only to resolve a human alleles, but also to chart the evolution of the region and map bases under the most selective pressure by sequencing a range of non-human primates. Based on what they learned with the PacBio sequenced alleles, they developed molecular inversion probes (MIP) to enable rapid screening of CFHR genes in patient cohorts. Structural variation in the CFHR locus is linked to a number of diseases, including age-related macular degeneration (AMD) and lupus. Interestingly, the same variant can confer risk to one disease and protection from the other. With the MIP screening tool, they have collected variant information from a large patient cohort and hope to better understand how the revealed genotypes relate to patient phenotypes.
Hagen Tilgner from Stanford University explained how long read technology enables new insights into transcriptome studies that were not previously possible. Using PacBio long reads for sequencing a mixed human tissue sample, Tilgner and colleagues identified many novel isoforms encoding for proteins. Later, by sequencing a human trio sample, they were able to phase distant SNPs using PacBio reads that were not possible using short-read technology. Finally, using long reads, they analyzed human brain samples and found many significant exon pairings. Furthermore, the paired exons were mostly in coding regions. Tilgner emphasized that with long reads, a “phased [brain] proteome will now become possible,” potentially leading to novel biological discoveries.
Maria Nattestad from Cold Spring Harbor Laboratory described using a PacBio system to sequence the genome and the transcriptome of the SK-BR-3 breast cancer cell line. For the genome sequencing, the PacBio data produced a mean read length of 9 kb and a max read length of 71 kb with an average of 72X coverage. To understand the complex genomic rearrangements, Nattestad and her colleagues developed several tools to detect long range structural variations. Using SMRT Sequencing, they were able to identify the extremely complex and variable translocation occurring between the Her-2 oncogene locus on chromosome 17 and chromosome 8. Importantly, the translocation between chromosome 17 and 8 produced several fusion genes that were validated via PacBio Iso-Seq transcriptome sequencing. Using Iso-Seq analysis, they identified 17 fusion genes that were supported by both DNA and RNA evidence. Out of the 17 fusion genes, 13 were previously reported and four were novel. “The genome informs the transcriptome,” she said, where PacBio long reads help identify complex genome translocations and gene fusions.
The PacBio workshop at ASHG 2015 featured talks from two leaders in human genomics, Rick Wilson of Washington University and Richard Gibbs from Baylor University. Mike Hunkapiller, CEO of Pacific Biosciences, opened the workshop with a historical perspective of human genome sequencing, starting with the Human Genome Project. While advances have been made in technology, throughput and cost reductions, the quality of genomes hasn’t kept pace with decreases in cost, he noted. This is why Hunkapiller was particularly proud to share the news of the company’s launch of the Sequel System – which offers SMRT Sequencing and long reads at seven times greater throughput over the PacBio RS II and roughly half the cost, making it feasible to use the system for de novo assembly of high-quality human genomes. He also stated that the platform has the capacity to scale over time to handle increasingly higher-density SMRT Cells, pointing toward a future where de novo human genomes will become both practical and routine.
Rick Wilson titled his talk “Of reference genomes and precious metals” and walked the audience through definitions and standards for the various quality levels for de novo assembled human genomes, e.g., platinum, gold, and silver. He noted that this was a good topic for this session because of the important role PacBio has played in the community’s work to create reference-grade genomes. For example, PacBio technology has enabled them to sequence additional genomes (CHM1, CHM13) to a very high quality level. Although these sequences were essential for further refining the GRCh38 reference build, he stated that the current reference genome is still not optimal for some highly polymorphic and complex regions of the genome, and does not adequately represent diverse ancestries sufficiently.
Wilson outlined their definition of a ‘gold’ genome as a high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions – created with PacBio reads to perform de novo assembly, a scaffold created using BioNano and/or Dovetail aligned to reference, and BACs to fill targeted regions and shore up gaps. The list of gold genomes in progress includes the Yuroban, Puerto Rican Han Chinese, CEU, and Luhya. A ‘platinum’ genome is a contiguous, haplotype-resolved representation of the entire genome, two of which currently exist for the CHM1 and CHM13 hydatidiform moles. While ‘silver’ definition standards are to be determined, this category is generally non-trio genomes produced with PacBio and BioNano mapping, and no BAC library.
Richard Gibbs talked about the transition to genomic medicine, which hasn’t been as simple as people would like due to such issues as the incomplete reference genome, the difficulty in characterizing some variation, and the lack of knowledge about the function of some genes. At Baylor, most of the human genome sequencing is done for children with Mendelian disorders. He said that among 7,000 samples processed using short-read exome sequencing, only about 25% of these cases are solved. The relatively low diagnosis rate is likely due to structural variation and other regions not captured by short reads.
He discussed some ways to get to structural variation including PacBio sequencing and PBJelly and Parliament analysis routines, using as little as 10-fold PacBio coverage. Using these methods they are closing gaps in the genomes of various species, for example – he noted that in the sheep genome they have closed 70% of gaps with PacBio reads. He also mentioned the use of PBHoney to identify inconsistencies between reads and the reference, and that long-range capture strategies using a combination of Nimblegen and PacBio are ‘going beautifully so far.’
To close the workshop, Jonas Korlach, Chief Scientific Officer at PacBio, built on Hunkapiller’s comments by talking about the technology waves that have followed the initial human genome sequencing project, where we are today, and where we are going. Today, we are in what Korlach calls the 4th wave, where more comprehensive whole-genome re-sequencing is occurring, and we are nearing the 5th, when we will actually be able to free ourselves from reference genomes and sequence everything de novo.
Korlach also touched on some of the new developments PacBio is working on, which include amplification-free target enrichment methods, using Cas9 enzyme for targeting, and sequencing native DNA. Other progress will come through the ability to use PacBio sequencing to phase alleles and more comprehensively capture all size and types of variants into haplotigs (contiguous haplotype-sequence blocks). Barcoding samples for isoform (Iso-Seq) sequencing and allele-specific methylation analyses are also in the works.
Watch the recording of the entire workshop session.