X

Quality Statement

Pacific Biosciences is committed to providing high-quality products that meet customer expectations and comply with regulations. We will achieve these goals by adhering to and maintaining an effective quality-management system designed to ensure product quality, performance, and safety.

X

Image Use Agreement

By downloading, copying, or making any use of the images located on this website (“Site”) you acknowledge that you have read and understand, and agree to, the terms of this Image Usage Agreement, as well as the terms provided on the Legal Notices webpage, which together govern your use of the images as provided below. If you do not agree to such terms, do not download, copy or use the images in any way, unless you have written permission signed by an authorized Pacific Biosciences representative.

Subject to the terms of this Agreement and the terms provided on the Legal Notices webpage (to the extent they do not conflict with the terms of this Agreement), you may use the images on the Site solely for (a) editorial use by press and/or industry analysts, (b) in connection with a normal, peer-reviewed, scientific publication, book or presentation, or the like. You may not alter or modify any image, in whole or in part, for any reason. You may not use any image in a manner that misrepresents the associated Pacific Biosciences product, service or technology or any associated characteristics, data, or properties thereof. You also may not use any image in a manner that denotes some representation or warranty (express, implied or statutory) from Pacific Biosciences of the product, service or technology. The rights granted by this Agreement are personal to you and are not transferable by you to another party.

You, and not Pacific Biosciences, are responsible for your use of the images. You acknowledge and agree that any misuse of the images or breach of this Agreement will cause Pacific Biosciences irreparable harm. Pacific Biosciences is either an owner or licensee of the image, and not an agent for the owner. You agree to give Pacific Biosciences a credit line as follows: "Courtesy of Pacific Biosciences of California, Inc., Menlo Park, CA, USA" and also include any other credits or acknowledgments noted by Pacific Biosciences. You must include any copyright notice originally included with the images on all copies.

IMAGES ARE PROVIDED BY Pacific Biosciences ON AN "AS-IS" BASIS. Pacific Biosciences DISCLAIMS ALL REPRESENTATIONS AND WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, BUT NOT LIMITED TO, NON-INFRINGEMENT, OWNERSHIP, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL Pacific Biosciences BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES OF ANY KIND WHATSOEVER WITH RESPECT TO THE IMAGES.

You agree that Pacific Biosciences may terminate your access to and use of the images located on the PacificBiosciences.com website at any time and without prior notice, if it considers you to have violated any of the terms of this Image Use Agreement. You agree to indemnify, defend and hold harmless Pacific Biosciences, its officers, directors, employees, agents, licensors, suppliers and any third party information providers to the Site from and against all losses, expenses, damages and costs, including reasonable attorneys' fees, resulting from any violation by you of the terms of this Image Use Agreement or Pacific Biosciences' termination of your access to or use of the Site. Termination will not affect Pacific Biosciences' rights or your obligations which accrued before the termination.

I have read and understand, and agree to, the Image Usage Agreement.

I disagree and would like to return to the Pacific Biosciences home page.

Pacific Biosciences
Contact:

PacBio blog

This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.

Search PacBio’s Blog

Wednesday, April 1, 2020

HiCanu for HiFi Reads Produces First Assembly of Human Segmental Duplications and Centromeres

In a new preprint, scientists from the National Human Genome Research Institute, the University of Washington, and other institutions describe HiCanu, a modified version of the Canu assembler designed specifically for PacBio HiFi reads. The team put the new assembler through its paces, reporting that it significantly outperformed traditional assembly methods — even getting through centromeres, segmental duplications, and other notoriously difficult regions.

As lead authors Sergey Nurk (@sergeynurk) and Brian P. Walenz, corresponding authors Sergey Koren (@sergekoren) and Adam Phillippy (@aphillippy), and collaborators report, “HiFi is a major leap forward in terms of long-read read accuracy.” They add, “As the accuracy of other long-read technologies have not exceeded 95%, the median accuracy of current HiFi reads can exceed 99.9% (>Q30), making them a promising data type for separating highly similar repeat instances and alleles.”

HiCanu applies homopolymer compression, overlap-based error correction, and tandem repeat masking to eliminate the few remaining errors in HiFi reads, resulting in 97% of reads matching perfectly to a curated reference sequence. This near-perfect accuracy helps to distinguish high-identity genomic repeats, as differences in HiFi reads can be trusted to be biological and not sequencing errors.

The new assembler generated draft assemblies of Drosophila and several human genomes. The HiCanu assemblies were all highly contiguous and extremely accurate. “On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultra-long Oxford Nanopore reads in terms of both accuracy and continuity,” the scientists write. The reported difference in accuracy is especially large: the HiCanu assembly has 831× fewer errors than the assembly of ultra-long Oxford Nanopore reads.

The team zoomed in on certain regions known to be challenging — including centromeres, segmental duplications, and the MHC locus. For CHM13, the scientists report, “This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions.”

A self dotplot of the HiCanu assembly of the chromosome 3 centromere of CHM13 shows megabase-scale repeat structures resolved by HiFi reads.

HiCanu also deftly handles haplotype phasing, with the authors stating that “HiCanu consistently recovers both haplotypes for the six canonical MHC typing genes in the human genome.”

The authors report several other advantages of HiCanu. First, assemblies generated by HiCanu do not require polishing. In fact, the authors “discourage polishing HiCanu HiFi assemblies, because… polishing pipelines may map reads back to the wrong repeat copies and actually introduce errors.” Second, HiCanu is computationally efficient: “The number of CPU hours required for assembly of a human genome is under 4,000, which could be completed on any modern cloud platform in less than a day for a few hundred dollars,” the team reports. “This is 30-fold less than recent Oxford Nanopore assemblies that required more than 100,000 CPU [hours].”

“We have demonstrated that HiCanu is capable of generating the most accurate and complete human genome assemblies to date,” the scientists write, pointing out that HiCanu could also be applied to non-human genomes, including metagenomic samples. “These results represent a significant advance towards the complete assembly of human genomes.”

Learn more about the computational approach behind HiCanu in this Medium post by Liz Tseng (@magdoll).

Read More »

Wednesday, March 25, 2020

Sequencing 101: The Evolution of DNA Sequencing Tools

Welcome to the Sequencing 101 blog series – where we will provide introductions to sequencing technology, genomics, and much more!

If you’re not immersed in the field of DNA sequencing, it can be challenging to keep up with the rapid evolution among all the platforms and technologies on the market. Let’s start with a quick overview of how these different technologies came about — and how each is used today.

The evolution of sequencing technology.

 

First Generation Sequencing – Starting the Era of Genomics

The process of Sanger sequencing

The process of Sanger sequencing.

DNA sequencing as we know it originated in the late 1970s, when Frederick Sanger at the MRC Centre in Cambridge developed a gel-based method that combined a DNA polymerase with a mixture of standard and chain-terminating nucleotides, known as ddNTPs. Mixing dNTPS with ddNTPs causes random early termination of sequencing reactions during PCR. Four reactions are run, each with the chain-terminating version of only one base (A, T, G or C). When visualized with gel electrophoresis, one reaction per lane, the fragments are sorted by length, allowing the DNA sequence to be read off base by base. This technique was revolutionary at the time, enabling sequencing of 500-1000 bp fragments. However, since the original method used radioactive ddNTPs and X-rays, it was less than ideal for widespread use.

By the 1980s, Sanger’s original method had been automated by scientists at Caltech and commercialized by Applied Biosystems. Radioactive ddNTPs were replaced with dye-labelled nucleotides and large slab gels were replaced with acrylic-finer capillaries. Scientists could now simply feed prepared DNA into a machine and view the results of fluorescence-based reactions on an electropherogram. This technology, which was continuously improved over the years, served as the bedrock of the Human Genome Project. Today, automated Sanger sequencing is still in use, primarily in clinical labs where it is acceptable to have low throughput, higher per-sample costs, and sequencing reads 500-1,000 bp in length.

But even after the Human Genome Project, the cost of automated Sanger sequencing — also known as capillary electrophoresis — remained too high to enable the kind of large-scale sequencing projects envisioned by scientists. By the mid-2000s, remarkable efforts had been made to bring down the costs of sequencing. Driven largely by grants from the National Human Genome Research Institute (NHGRI), labs around the world tested out new methods for higher-throughput sequencing, using concepts as diverse as electronics, physics, and magnetics.

 

Second Generation Sequencing – Short Reads Become Fast and Efficient

One key player in the advent of next-generation sequencing (NGS) was a UK-based company called Solexa, which was later acquired by Illumina. The key innovation of the Illumina platform was ‘bridge amplification’ which allows the formation of dense clusters of amplified fragments across a silicon chip. Amplification of the original single molecule into a large cluster of many copies is what makes it possible to detect a fluorescent signal as a single dNTP is added one at a time, as sequencing proceeds by synthesis. Over time, the number of clusters that could be read simultaneously grew tremendously, and Illumina instruments became the first commercially available massively parallel sequencing technology. Other tools developed around the same time, such as the Ion Torrent platform, became part of the NGS landscape as well. NGS platforms are the dominant type of sequencing technology used today. Their extreme capacity allows for sequencing at very low cost. They are limited, however, in read length; NGS platforms typically produce reads of ~50-500 bp in length. This makes them an excellent fit for resequencing projects, SNP calling, and targeted sequencing of very short amplicons.

 

Third Generation Sequencing – The Rise of Long Reads

However, short reads are not suitable for all sequencing projects. Another approach that was supported by the so-called $1,000 genome grants from NHGRI was Single Molecule, Real-Time (SMRT) Sequencing from PacBio. This technique uses miniaturized wells, known as zero-mode waveguides, in which a single polymerase incorporates labeled nucleotides and light emission is measured in real time. A different single-molecule approach to long-read sequencing, using pore-forming proteins and electrical detection, was adopted by Oxford Nanopore Technologies (ONT).

 

Watch this short video to learn how SMRT Sequencing works.

 

SMRT Sequencing has a number of advantages. Most notable, perhaps, is its ability to produce long reads — tens of thousands of bases long in a single read. These long reads make it possible to span large structural variants and challenging repetitive regions that confound short-read sequencers because their short snippets cannot be differentiated from each other during assembly. Another advantage is low GC bias, which allows PacBio Systems to sequence through extreme-GC at AT regions that cannot be amplified during cluster generation on short read platforms. A third advantage is the ability to detect DNA methylations while sequencing, since no amplification is done on the instrument.

 

Short-read sequencing compared to highly accurate long-read sequencing

Short-read sequencing produces reads 50-500 base pairs in length, which can lead to sequence gaps and incomplete assemblies, known as draft genomes. Highly accurate long-read sequencing from PacBio produces reads tens of kilobases in length, creating overlaps which allow for the generation of complete genome assemblies.

 

As scientists began to work with SMRT Sequencing — sometimes known as third-generation sequencing — they realized that it had particular value for applications including de novo genome sequencing, phasing, detection of structural variants, epigenetic characterization, and sequencing of the transcriptome without the need for assembly. Technology improvements over time increased the throughput and accuracy of SMRT Sequencing platforms, bringing their costs in line with NGS platforms for many types of projects. Now, SMRT Sequencing has industry-leading accuracy thanks to its HiFi sequencing, and it is being used around the world to produce reference-grade genomes for microbes, plants, animals, and people.

Read More »

Monday, March 23, 2020

AGBT 2020 Highlights: Reference-Grade Assemblies, Iso-Seq Data, and More

It was a pleasure to attend the annual Advances in Genome Biology & Technology meeting in sunny Marco Island, Fla., last month. The conference has a long history of supporting sequencing innovation, and during the 20th anniversary celebration this year, the tradition continued. Video and synopses from several presentations featuring SMRT Sequencing are below.

Adam Ameur (@_adameur) from Uppsala University spoke about the use of long-read PacBio sequencing to detect off-target edits from CRISPR/Cas9. In a method known as SMRT-OTS, Ameur’s team used a clever adaptation of the standard PacBio library preparation to enrich for molecules bound by a guide RNA, which were then sequenced to generate HiFi reads. The team also used HiFi reads generated on the Sequel II System to create a de novo assembly of the human cell line used in the experiments. They found 55 off-target sites for three guide RNAs, including inexact matches to the guide RNA. Ameur’s group has already generated preliminary data from editing living cells, an exciting next step for this work. For more detail, check out their recent bioRxiv preprint.

Watch Ameur’s full AGBT 2020 presentation: Studying CRISPR Guide RNA Specificity by Amplification-Free Long-Read Sequencing


A talk about human reference genomes came from Tina Graves-Lindsay at Washington University in St. Louis and the Genome Reference Consortium. “The human reference is a work in progress,” she told AGBT attendees, offering an update on her team’s many contributions to that progress. They have been using SMRT Sequencing — most recently to produce diploid assemblies — and submitting the resulting, high-quality assemblies to GenBank. They have moved to PacBio HiFi reads for human genome assemblies, she said, because accurate long reads eliminate the expensive error correction step in analysis and produce reference-grade assemblies with half the sequence coverage needed before. In one recent project using HiFi reads, Graves-Lindsay and her team generated a highly contigous diploid assembly with 87% represented in haplotigs. She also reported on a new pangenome reference project, which aims to include sequence data from 350 individuals and generate telomere-to-telomere assemblies.

Watch Graves-Lindsay’s full AGBT 2020 presentation: Generating High Quality Human Reference AssembliesTop of Form


Laura Mincarelli (@MincLaura) from the Earlham Institute gave a presentation that included the use of Iso-Seq data to uncover alternative splicing events in individual stem cells and progenitor cells. She noted that PacBio long-read data is advantageous for this approach because it helps measure cell traits by allowing users to view entire transcripts, not the snippets produced by other technologies. She reported that SMRT Sequencing led to another benefit: the detection of more than 2,100 novel exons in some 950 genes. This work is helping her understand the effects of aging in cells.

Finally, Brenda Oppert from the U.S. Department of Agriculture shared results from the generation of reference-grade insect genome assemblies as part of a larger project to understand potential insect-based food sources for humans. With severe food shortages looking possible in just a decade, she said, “We’ve got to start thinking outside of the box now.” Insects could be a promising alternative protein source, so Oppert has been sequencing their genomes with PacBio technology. “The long reads are absolutely essential for insects,” she said. In cases like the mealworm, for instance, 60 percent of the genome is satellites consisting of units of 142 nucleotides with less than 2 percent sequence divergence. Oppert reported that on the Sequel II System, a single SMRT Cell provides sufficient coverage to produce a high-quality assembly for most insects.

Watch Oppert’s full AGBT 2020 presentation: Feed the World: Developing Genomic Resources for Insects as Food


The PacBio team also had the opportunity to present several posters at AGBT:

Jason Underwood presenting at AGBT 2020

 

 

 

 

 

 

 

 

 

 

 

 

Learn more about SMRT Sequencing applications and HiFi reads for human biomedical research, plant and animal sciences and microbiology and infectious disease.

Read More »

Wednesday, March 18, 2020

Prokaryotic Methylation Detection on the Sequel II System

Since the first PacBio instrument was released in 2011, methylation detection has been one of the advantages of SMRT Sequencing. The kinetics of nucleotide incorporation change as the DNA polymerase moves across a methylated position on the DNA template strand, producing distinctive perturbation patterns (Figure 1) that can be recognized by methylation-calling software.

Figure 1: The arrows indicate the methylated positions on a 199 bp circular template. Bars indicate the ratio of the average intra-pulse distance (IPD) on the methylated template to that of the control template. Each methylation type produces a unique fingerprint.

With the advent of a simple method for detecting methylation in prokaryotes, researchers have demonstrated that in addition to functioning as a defense against phages, bacterial R-M systems can also drive important traits like antibiotic resistance, immune evasion, virulence and persistence in hosts.

Recent internal validation work has confirmed that detection of m6A and m4C in prokaryotic DNA and the R-M system target motifs they reside in continues to perform robustly on the Sequel II System. The detection of 5mC continues to require significantly higher coverage and is therefore not supported through the SMRT Analysis ‘Base Modification Analysis’ workflow.

Figure 2. Detection of methylation in E. coli K. ‘Type’ compares the IPD fingerprint of the reported motif to empirical models of m4C and m6A sequencing perturbations. ‘% Detected’ reports what fraction of motifs present in the assembly are above the specified Modification QV threshold. ‘Mean QV’ is a measure of confidence that the flagged base within the reported motif is methylated.

Our initial validation was done on E. coli K, sequenced as part of a 48-plex sequencing run on the Sequel II System (Figure 2). All three known m6A motifs were successfully detected. In addition, the high coverage weakly detected the known target of the Dcm m5C methylase, CCWGG. However, since m5C calling is not supported, it was erroneously tagged as m6A.

An important takeaway is that to obtain the cleanest motif-finding result, the ‘Minimum Qmod Score’, available as an advanced parameter in the ‘Base Modification Analysis’ application in SMRT Analysis, had to be increased manually. As shown by the red arrow in Figure 2, this value should be set such that it excludes most baseline noise while fully including the cloud of methylation signal. In this example, the ideal setting is Qmod = 200. While the optimal value of Qmod changes with sequencing coverage, we have found a value of 100 produces a good result in most cases when sequencing 48 microbes per SMRT Cell 8M.

To better assess performance across the full range of methylation patterns seen in microbes, we then analyzed data from 4 more challenging microbes. These more difficult examples confirm that the Sequel II System can detect both m6A and m4C at the same level of performance seen with our previous sequencing systems. The known R-M systems in Neisseria meningitidis FAM18 (Table 1), Treponima denticola A (Table 2), and Methanocorpusculum labreanum Z (Table 3) were largely recovered at high confidence. The few exceptions are likely due to competition between multiple methyltransferases that target overlapping motifs.

The most difficult test case was H. pylori J99, which carries 24 distinct R-M systems, targeting m6A, m4C, and m5C. We called 21/24 motifs precisely correctly. In one instance our motif caller was confounded by overlapping motifs, but the correct answer could be easily discerned by visual examination. The remaining two missed motifs involve m5C, which continues to be unsupported.

Table 1. m6A motifs of N meningitidis. N. Meningitidis also has six m5C motifs (CCTTC, GCGCGC, TCTGG, CCAGA, CCGG, RCCGGY) which were not detected. The low % detected for ACACC is likely the result of competition between methyltransferases for overlapping m5C sites (CCGG, RCCGGY).

 

Table 2. The motifs of all 9 R-M systems active in T. denticola were detected without error.

 

Table 3. R-M system recognition motifs of M. labreanum. The low percent detected for ACCNNNNNNRTGA / TCAYNNNNNNGGT is most likely due to competition between m6A modification and m4C modification of the overlapping GTAC motif.

 

Table 4. H. pylori J99 contains 24 active methyltransferases. The two motifs marked with an asterisk are split because our pattern-finding software was confounded by the partially overlapping CATG motif. GWCAYH (H = ‘not G’) + GWCACG (the missing G!) = GWCAYN (correct call; Y = A/T) – CATG (distinct R-M system target, called correctly).

We hope these results will give all our customers who study prokaryotic methylation the confidence to move forward with planning bacterial whole genome sequencing experiments on the Sequel II System, taking full advantage of the higher multiplexing capacity and reduced per sample cost.

Learn more bacterial whole genome sequencing and prokaryotic epigenetics on the Sequel II System.

Read More »

Monday, March 16, 2020

A Menagerie of New Genomes Released by International Ensembl Project

The new and updated species in Ensembl 99 from the Vertebrate Genomes Project (VGP)

 

Meerkats, yaks, geese, and lots of flies — oh my! A full menagerie of new and updated animal genomes has been released by the Ensembl project.

The Ensembl 99 release includes a variety of vertebrates, plants, mosquitos, and flies, as well as updates of human gene annotation and variation data.

Among them are 38 new species and two dog breeds (Great Dane and Basenji), as well as four updated genome assemblies. Many were created using PacBio sequencing data.

 

Thirteen of the new assemblies have been produced by the Vertebrate Genome Project (VGP):

  • Canada lynx
  • Greater horseshoe bat
  • Golden eagle
  • Kakapo
  • Jewelled blenny (pictured, right)
  • Pinecone soldierfish
  • Live sharksucker
  • Orbiculate cardinalfish
  • Gilthead seabream
  • River trout (also part of the Sanger 25 Genomes Project)
  • Zebra finch (updated assembly)
  • Asian bonytongue (updated assembly)
  • Fugu (updated assembly)

Part of Ensembl’s mission is to provide gene annotation for the genome assemblies produced by this long-term global collaboration.

 

The release also included six more mammalian genome assemblies:

  • Siberian musk deer
  • Chacoan peccary
  • Sperm whale
  • Meerkat (pictured, right)
  • Arabian camel
  • Domestic yak

Nine more bird genome assemblies:

  • Gouldian finch
  • Yellow-billed parrot
  • Burrowing owl
  • African ostrich (pictured, right)
  • Swan goose
  • Indian peafowl
  • Eurasian sparrowhawk
  • Golden pheasant
  • Ring necked pheasant

Ten fish:

  • Golden-line barbel
  • Blind barbel (pictured, right)
  • Horned golden-line barbel
  • German mirror carp
  • Hebao red carp
  • Hunaghe carp
  • Atlantic salmon (also part of the AquaFAANG project)
  • Blue tilapia
  • Round goby
  • Nile tilapia (updated assembly)

And four new reptiles:

  • Komodo dragon (pictured, right)
  • Common wall lizard
  • Eastern brown snake
  • Three-toed box turtle

As for mice and human genomes, Ensembl 99 includes updated annotations with many new genes and changes to existing ones – GENCODE M24 and GENCODE 33, respectively.

 

There are also 35 new metazoa genome assemblies, including:

  • 18 new Anopheles mosquito species
  • an update from the L3 to L5 assembly for Aedes aegypti
  • the vector of Zika virus Aedes albopictus
  • six Tsetse fly species
  • two Sand fly species
  • the freshwater snail vector of schistosomiasis (Biomphalaria glabrata)
  • the common bedbug (Cimex lectularius)
  • the Lyme disease tick (Ixodes scapularis)
  • common house fly (Musca domestica)
  • stable fly (Stomoxys calcitrans)

And there are some sweet additions to plant genomes:

  • Sweet cherry (Prunus avium)
  • Clementine (Citrus clementina)
  • Morning glory (Ipomoea triloba)
  • Wild sugarcane (Saccharum spontaneum)

Started in 1999 to annotate the human genome and make all data publicly and freely available via the web, the Ensembl project is based at the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI), located on the Wellcome Genome Campus near Cambridge, UK, and now involves hundreds of scientists from around the world.

 

See the full list of all Ensembl genomes and learn more about plant and animal sequencing on the PacBio Systems..

 

Read More »

Wednesday, March 11, 2020

SMRT Grant Winners: Three Scientists Selected to Use HiFi Sequencing to Tackle Genomic Challenges

Apply for the 2019 HiFi for All SMRT Grant to discover how highly accurate long reads can advance your science

 

PacBio highly accurate long reads, known as HiFi reads, offer all the benefits of long-read sequencing with accuracy comparable to short-read sequencing. To celebrate this new paradigm in sequencing technology, we hosted the 2019 HiFi for All SMRT Grant this past fall. This SMRT Grant was open to scientists worldwide and offered three winning projects each up to six SMRT Cells 8M and sequencing on the Sequel II System by our Certified Service Providers and co-sponsors.

In response to our call for projects across the range of SMRT Sequencing applications, we received many truly compelling proposals, which made selecting the winners quite a challenge. Today, we are thrilled to announce the three winners of this SMRT Grant and share a glimpse into how they will use HiFi sequencing to tackle a diverse set of scientific questions.

 

Holding on by a Claw: Elucidating the Genomics of the African Leopard

Winner: Ellie Armstrong (@_ellie_cat), Stanford University

A young leopard poses for a photo in the Okavango Delta, Botswana. Photo by Ellie Armstrong.

Synopsis: This project will generate a high-quality genome assembly for the African leopard, a big cat facing endangered status due to habitat loss, hunting, and illegal wildlife trade. Very little genetic information has been produced for leopards. A high-quality assembly will be important for conservation genomics and for investigating genetic and structural variation across leopard subspecies.

“We are thrilled to be working with PacBio to produce a high-quality assembly of the African leopard. Leopards are extremely elusive, making them a prime species for the development of genomic monitoring tools. This genome will allow us to investigate the distribution of genomic diversity of leopards, their evolutionary history, and gain insight into how they adapt to such a wide variety of landscapes.” – Ellie Armstrong

 

Sequencing for this project will be provided by Georgia Genomics and Bioinformatics Core.

 


 

Establishing the Largest Longitudinal HIV Sequence Database Ever Assembled

Winner: Daniel Sheward (@DannySheward), University of Cape Town

The HIV Diversity and Pathogenesis Group at the Institute of Infectious Diseases and Molecular Medicine, University of Cape Town, South Africa.

Synopsis: In a collaboration between the University of Cape Town (PI: Carolyn Williamson), the National Institute of Communicable Diseases of South Africa (PI: Penny Moore) and the Karolinska Institutet (PI: Ben Murrell), scientists will use PacBio sequencing for more than 1,000 samples collected from 150 women in the South African CAPRISA Acute Infection cohort to perform a longitudinal study of HIV infection. A highly multiplexed approach will allow for HiFi sequencing of the virus in all samples to generate the largest sequence database of longitudinally collected HIV samples. The information gleaned from this database is expected to contribute to the research community’s understanding of viral evolution, latency, immunology and vaccine development.

“We are extremely excited about this project. With HiFi sequencing on the Sequel II System, a project of this scope is finally feasible.” – Daniel Sheward

 

Sequencing for this project will be provided by the Earlham Institute.

 


 

Asian Reference Genome

Three ethnic groups (Chinese, Malays and Indians) of Singapore population. Image credit: Gloria Fuentes – The Visual Thinker LLP.

Winner: Jianjun Liu, Genome Institute of Singapore

Synopsis: With this SMRT Grant, scientists will sequence the genomes of three individuals — one each of Chinese, Indian, and Malay descent. This is part of a larger effort at the Genome Institute of Singapore to generate Asian population-specific reference genomes for improved variant calling for people in these populations. The assemblies produced through the SMRT Grant will be used to analyze structural variation, evaluate different genome assemblers and adapt the Institute’s methods for HiFi data. Ultimately, population-specific data will play an important role in the implementation of precision medicine for people of all ancestries.

“We are excited about this project and very thankful for support by PacBio and DNA Link. Genomic analysis of Asian populations has fallen behind the efforts in western populations. We hope that our effort can help to improve it by providing tools and resources that can empower the studies of Asian populations.” Jianjun Liu

 

Sequencing for this project will be provided by DNA Link, Inc.

 


Congratulations to all our HiFi for All SMRT Grant winners! And thank you to our co-sponsors for teaming up with PacBio to make these SMRT Grants possible. Explore the 2020 SMRT Grant Programs to apply to have your project funded.

Read More »

Thursday, March 5, 2020

Beyond Contiguity – Assessing the Quality of Genome Assemblies with the 3 C’s

With high-throughput long-read sequencing, it is now affordable and routine to produce a de novo genome assembly for microbes, plants and animals. The quality of a reference genome impacts biological interpretation and downstream utility, so it is important that researchers strive to achieve quality similar to “finished” assemblies like the human reference, GRCh38.

Until a time when sequence data and resulting assemblies can regularly achieve reference-quality, assemblies should be evaluated in the three key dimensions: Contiguity, Completeness, and Correctness. However, the most commonly used measures of genome quality only tackle two of the three C’s.

Contiguity is often measured as contig N50, which is the length cutoff for the longest contigs that contain 50% of the total genome length. In this era of long-read genome assemblies, a contig N50 over 1 Mb is generally considered good.

Completeness is often measured using BUSCO (Benchmarking Universal Single-Copy Orthologs) scores, which look for the presence or absence of highly conserved genes in an assembly. The aim is to have the highest percentage of genes identified in your assembly, with a BUSCO complete score above 95% considered good.

Correctness, the third and final C, is more challenging to measure. Correctness can be defined as the accuracy of each base pair in the assembly and is most often measured as concordance of an assembly to a gold standard reference. Of course, when sequencing a novel species there may not be a reference against which to measure. Furthermore, concordance is only a good measure for accuracy when the gold-standard itself is very high quality and when there is little biological divergence between the reference sample and assembly sample (Figure 1).

So, how does one properly measure the accuracy of a generated genome assembly? Well, we explored several methods you might find useful and broke them down by what type of orthogonal data is needed for each.

 

ASSESS FRAMESHIFTS
Data needed: Transcript Annotations

One measure of correctness is the number of frameshifting indels in coding genes. Frameshifts often disrupt the production of the protein encoded by the gene and are rare. Thus, most observed frameshifts are actually assembly errors.

This approach is similar to BUSCO but rather than utilizing a small conserved set of genes, a larger set of genes are analyzed. This requires a set of transcripts from the same (or very closely related) sample, which are commonly generated as part of a genome annotation project. PacBio RNA Sequencing, using the Iso-Seq method, is a good strategy for genome annotation.

The primary advantage of this approach is that you may be able to use an annotation or RNA sequencing data that is already in existence. The primary disadvantages of this approach are that it assesses only a small percentage of the genome (often less than 1%, often some of the most conserved regions) and may underestimate accuracy since not all frameshifts are errors.

 

DELVE INTO HIGH-CONFIDENCE REGIONS
Data needed: Reference Genome & Short Reads

Sometimes a reference genome is available for the same species but for a different individual than the one being assembled. In such cases, it is useful to define “high-confidence regions” where the reference is a good match to the sample and then assess the assembly only within those high-confidence regions. The Genome in a Bottle Consortium has applied such an approach for human samples.

How to generate a high-confidence region benchmark from Kingan, et al.

To build high-confidence regions as Kingan, et al. did for human, rice, and Drosophila, short-read sequencing data is mapped to the reference and used to exclude low-confidence regions including those with abnormal coverage or in close proximity to variants. Within the resulting high-confidence regions, concordance is a good measure of assembly accuracy.

The advantages of this approach are that it provides a good measure of assembly accuracy and explicitly identifies errors as discordances between an assembly and the high-confidence regions. Discordances can then be examined to determine how to improve the assembly.

The disadvantages are that it requires an independent reference genome from which to start, as well as additional short-read data. Also, the accuracy estimate can be somewhat optimistic by excluding “difficult” regions from evaluation or somewhat pessimistic if true biological variants are not removed from the benchmark.

 

EXPLORE BAC COMPARISONS
Data needed: BAC Sequences

In cases where a reference is not available but another set of high-quality sequences, such as Bacterial Artificial Chromosomes (BACs), exist for the same sample, you can measure concordance between your assembled contigs and the BAC sequences. This method was used by Vollger, et al. when validating the accuracy of one of the first human assemblies generated using PacBio highly accurate long reads, known as HiFi reads.

 

COUNT ERRORS WITH SHORT READS
Data needed: Short Reads

It is possible to measure accuracy even for a species with no existing reference genome by comparing the k-mers in an assembly to k-mers from short reads from the same individual. One tool to do this is yak from Heng Li. Another is merqury, developed by Arang Rhie in Adam Phillippy’s group.

K-mer spectrum comparison to identify errors

The advantages of this approach are that it does not require a reference genome and does not ignore difficult regions of the assembly. It also provides a way to measure completeness by flipping the comparison and looking for k-mers present in the short reads that are missing in the assembly. Merqury has the additional ability to track the coordinates of errors: it outputs files that can be loaded as IGV tracks so the user can visualize misassembles or other errors. Merqury has many additional functions like outputting spectra-cn plots and, for users with a trio, assessing contig phasing accuracy with statistics and plots.

Example of a homozygous variant that indicates an error using short reads.

Similar to the k-mer approach above, short-read data can be used to count errors by aligning short reads to the assembly and identifying single nucleotide differences. An error rate can then be calculated by dividing the total count of SNVs by the number of bases in an assembly covered by at least 3 short reads. The short-read data can be from a closely related individual, although estimates of correctness are most accurate when the same individual is used as Koren, et al. did with an F1 cross of two breeds of cattle.

Like the k-mer method, this approach requires no reference genome or transcript dataset and does not ignore difficult regions of the assembly. Unlike the k-mer approach, potential errors in the assembly can be identified and characterized in order to improve the assembly method.

 

Looking to the Future

Exciting progress in long-read sequencing and genome assembly has made it standard to produce contiguous, complete genomes. In order to generate genomes that are not simply assembled, but are also effectively used for downstream biology, we must address the third dimension of quality: correctness. The techniques discussed above make it easy to measure correctness with a variety of different orthogonal data types. We expect these approaches will identify which sequencing workflows produce the most accurate genomes and will nudge the field towards an era of reference-grade de novo assemblies.

 

Learn more about HiFi sequencing data for your organism of interest or get in touch with a PacBio scientist to scope out your project.

Read More »

Wednesday, March 4, 2020

Nice to See You, Telomere: Scientists Use SMRT Sequencing for Previously Intractable Regions of the Human Genome

Diagram depicting telomere shortening. Source: http://2014hs.igem.org/Team:TAS_Taipei/project/abstract

Telomeres and centromeres have long vexed genomic scientists. In the early days of genome sequencing, many researchers took it for granted that assembling these highly repetitive regions was essentially impossible.

That’s why a new preprint posted to bioRxiv is so exciting. Scientists from Weill Cornell Medicine and Colorado State University describe the use of PacBio long-read whole genome sequencing to analyze and assemble telomeres, characterizing the heterogeneity of these elements across three human genomes from the Genome in a Bottle collection (HG001, HG002, HG005).

Haplotype Diversity and Sequence Heterogeneity of Human Telomeres” comes from lead authors Kirill Grigorev (@LankyCyril) and Jonathan Foox (@jfoox), senior author Chris Mason (@mason_lab), and collaborators. They took on this project to overcome existing challenges with assembling telomeres and to establish a better protocol that others could replicate.

“Given their length and repetitive nature, telomeric regions are not easily reconstructed from short read sequencing, making telomere sequence resolution a very costly and generally intractable problem,” the authors write. “We describe a framework for extracting telomeric reads from single-molecule sequencing experiments, describing their sequence variation and motifs, and for haplotype inference.”

Short reads, which are typically no more than a few hundred bases, can read DNA in telomeric regions, but during alignment they struggle to differentiate the highly repetitive regions and to represent them accurately without collapsing several repeats into one. Highly accurate long PacBio CCS reads, known as HiFi reads, produced by SMRT Sequencing can represent tens of thousands of base pairs in one long stretch. This greatly reduces the alignment challenge, facilitating the accurate assembly of even the most repetitive regions in the genome.

PacBio HiFi reads, generated using the circular consensus sequencing (CCS) mode, capture human telomere sequence at the end of chromosome arms of a Genome in a Bottle human subject, HG002. Highly accurate long reads resolve novel sequence variation repeat motifs within human telomeric haplotypes. Image courtesy of Chris Mason.

“We find that long telomeric stretches can be accurately captured with long-read sequencing,” the scientists report. In the preprint, they describe the ability to observe sequence heterogeneity, discover novel and known non-canonical motifs, and create motif composition maps. Their framework, known as edgeCase, was validated with PacBio sequencing data sets from the Genome in a Bottle consortium.

While the team’s results confirmed that TTAGGG, the canonical repeat associated with telomeric regions, is the dominant motif, there was “a surprising diversity of repeat variations” including known and novel variants. This previously untapped diversity was masked by “the necessary bias towards the canonical motif during the selection of short reads,” the scientists suggest. “Telomeric regions with higher content of non-canonical repeats are less likely to be identified through the use of short reads, and instead, long reads appear to be more suitable for this purpose,” they add.

The team concludes: “The identified variations in long range contexts enable clustering of SMRT reads into distinct haplotypes at ends of chromosomes, and thus provide a new means of diplotype mapping and reveal the existence and motif composition of such diplotypes on a multi-Kbp scale.”

Read More »

Friday, February 28, 2020

A Rare Opportunity to Help Tackle Daughter’s Rare Disease

The rarest day on the calendar is February 29th — which makes it the perfect time to celebrate Rare Disease Day. On this day, we join millions of people around the world making time to honor the patients, caregivers, healthcare professionals and scientists who deal with rare diseases every day.

Zoe Harting was diagnosed with Type 1 SMA and was not expected to live past the age of 2, but is now reaching unprecedented milestones as an energetic 7-year-old, thanks to an experimental treatment.

And we didn’t have to look far to find someone affected.

Bioinformation John Harting, of our Applications Development group at PacBio, came face-to-face with the rare disease that is the most common genetic disease in infants — spinal muscular atrophy (SMA) — when his first child, Zoe, was three months old.

Everything seemed normal when she was born in October 2012, but John and his wife Eliza started to notice that Zoe didn’t have a lot of strength nor activity.

“We were new parents so we didn’t know any better,” John says. “We asked the doctor a couple times, but we kept getting answers like: ‘Some children develop slower,” or “Maybe she has a bit of hypotonia that will go away.’ Basically, they were reluctant to look much deeper.”

During a holiday visit to the in-laws, however, John noticed how little his daughter was moving compared to her cousin, who was a few weeks younger. The couple returned to their doctor and demanded testing. A few weeks later, they got a call from a neurologist, suggesting they come in to meet with a doctor and social worker.

“It was pretty scary to get that call and to drive to the hospital expecting bad news,” John says.

And the news was indeed bad. Zoe was diagnosed with Type 1 SMA, the most severe form of the fatal degenerative neuromuscular disease.

“They told us, basically, that within two years she was going to pass, and we could just hold her and love her and let her go. That was all we could do. There were no treatments at the time.”

Fortunately, that was not the case. The couple switched pediatricians, and the new doctor happened to have attended a conference where she heard a presentation by Stanford pediatric neurologist John Day about a new potential SMA treatment, and a clinical trial recruiting candidates.

Zoe was accepted on the trial and became the first child in the world to receive the experimental drug, Nusinersen, now approved by the FDA and sold as Spinraza.

Zoe is now a 7-year-old first grader with a strong personality and growing independence, who loves to chase her classmates around the playground in her mini motorized wheelchair.

“She’s slowly started to achieve milestones that SMA Type 1 patients never did before,” John says.


 

Contributing to the Cure

John says he is proud to be working at a company that is helping to make further advances in rare disease treatment possible.

While any given rare disease affects a relatively small number of people, these diseases collectively affect some 400 million people around the world. About 7,000 rare diseases are currently recognized.

In the case of SMA, the disease is triggered by a gene mutation that is actually quite common: it is estimated that 1 in 40 people carry it. When both parents have the mutation, their child can develop the disorder, which causes muscles to atrophy because they don’t receive the right signals from the spinal cord.

Genetic screening and early intervention is crucial. But this is tricky. Standard tests that use PCR (polymerase chain reaction) and short read technologies don’t always put mutations in the right context. For example, there may be “pseudogenes,” where big stretches of the genome are replicated and look almost identical, making it difficult to distinguish between true carrier genes and other kinds of unusual gene conversions.

“There can be false negatives. Someone could look like they are not a carrier, when in fact they have two copies of the gene, but they are both on one chromosome, instead of one on each chromosome” John says.

PacBio sequencing is changing that, by allowing scientists to read parts of the genome that have been difficult to reach until now. John has also been collaborating with others to dive deeper into target genes and to develop assays for diagnostics and drug development.

“A lot of these rare diseases are genetic, and are in places that are difficult to sequence, where they overlap with more common diseases,” John says.

“There’s a lot of interesting stuff that goes on that’s almost invisible to some of the other technologies and we can help to learn more about them and contribute to some important medical discoveries.”

 

Helping to SOLVE Rare Diseases

How PacBio sequencing is helping in rare disease research.

Rare disease researchers are increasingly turning to PacBio long-read sequencing technology to study areas of the genome inaccessible by other means, or to unravel complex disease-causing variants, such as tandem repeats, structural variants, complex rearrangements, and transposable elements.

Long reads can be a straightforward way to detect repeat changes because an adequately long read can encompass an entire expanded repeat as well as flanking unique sequences, for example.

In a review in the Journal of Human Genetics, Satomi Mitsuhashi and Naomichi Matsumoto from the Yokohama City University in Japan note that “long-read sequencing is especially highly recommended when repeat diseases or complex chromosomal rearrangements are suspected.”

The SOLVE-RD research program, a European-based consortium of more than 20 institutions, is also using the PacBio Sequel II System to sequence more than 500 whole human genomes with the aim of pinpointing disease-causing variants.

“Even with exome sequencing, as many as 50% of rare disease cases remain unsolved. The SOLVE-RD team believes that long-read SMRT Sequencing will be essential for discovering the causal elements that have proven elusive with previous approaches, and we anticipate that this research will ultimately make it easier for doctors to diagnose other patients with these rare diseases in the future,” said SOLVE-RD team member Alexander Hoischen, Associate Professor for Genomic Technologies and Immuno-Genomics at Radboud University Medical Center.

Hoischen gave examples of some recent discoveries in a number of conditions — from ALS and FTD, SCA10 and Parkinson’s disease, to Myotonic dystrophy, Bardet-Biedl syndrome, and Fragile X disorders — in a review paper in Frontiers in Genetics.

At PacBio, we are continually inspired by our users who focus on the study of rare diseases, and we are committed to supporting such research. Two recent SMRT Grants were awarded for projects devoted to advancing our understanding of spinocerebellar ataxia (Cleo van Diemen at the University Medical Center Groningen) and myotonic dystrophy (Stéphanie Tomé of the Centre de Recherche en Myologie at Sorbonne Université/INSERM in Paris).

We are also official supporters of Rare Disease Day, and we ask that you join us this Saturday by taking the opportunity to honor the entire rare disease community, including those who live with a disease and the many researchers striving to improve their situation. Participate on social media using #RareDiseaseDay, #ShowYourStripes, or #ShowYourRare. Or make it an IRL experience: find events to join in the U.S. or around the world to help show your support and raise awareness for those affected.

You can also support the Harting family’s efforts to purchase a van to transport Zoe more easily.

 

R

 

Read More »

Friday, February 14, 2020

A Rose is a Rose: HiFi Reads Enable Sequencing of Complex Tetraploid Species

Photo of a rose, whose complicated genome is being decoded using PacBio HiFi long-read sequencing

Assembling the genomes of the tetraploid rose has been challenging, but PacBio HiFi reads are helping Dutch researchers overcome the hurdles.

The genome of the rose is almost as complicated as its connotations when given as a gift on Valentine’s Day or other special occasions.

Although relatively small in size, at 400-750 Mb, with seven chromosomes, the cells of roses have multiple sets of chromosomes beyond the basic set. And these can vary widely between the commercial varieties. Some are diploids, with two homologous copies of each chromosome (like humans, with one from the mother and one from the father), while others can have as many as five different sets (pentaploids). Most are tetraploids, with four sets of chromosomes.

To further complicate things, many roses are “segmental allotetraploids,” which means that part of the genome is behaving like an allotetraploid (with four chromosome sets from two distinct species, which occurs during hybridization) – and part of the genome is behaving like an autotetraploid (with four sets of homologous chromosomes).

Needless to say, parsing all of this out is challenging. But researchers from the Netherlands recently presented their solution, using HiFi reads generated by the Sequel II System.

In a workshop discussion at PAG XXVIII, Bart Nijland (@bart3601) of Genetwister Technologies (@genetwister), explained how his team set out to make a haplotype-aware assembly of Rosa x hybrida L. in order to capture its full range of genetic variation, rather than rely on more traditional assemblies which collapse the haplotypes into single sequences that could be missing critical information.

“For a highly heterozygous, highly complex, commercially important species like the rose, there is a huge benefit to making a haplotype-aware assembly,” Nijland said. “A lot of the existing technologies don’t perform very well in doing this. So we were very happy when PacBio released its HiFi protocol. Due to the high accuracy of the reads, we thought this could really help us in solving this challenge.”

The next challenge was isolating DNA from the leaf tissue of a tetraploid rose variety, which is notoriously difficult because of secondary metabolites. Once that was overcome and the sample was processed to create a HiFi SMRT library, speedy sequencing of four SMRT Cells 8M was performed on the Sequel II System at Radboud UMC. The result was more than two terabytes of raw polymerase data, with an average yield of more than 500 Gb per SMRT Cell.

“We did a k-mer analysis to investigate the heterozygosity of the sample. Due to the high accuracy of the reads, we could nicely see four distinct peaks, which you would expect in a heterozygous, tetraploid sample,” Nijland said. “And when mapping the HiFi reads, we could already distinguish four haplotypes. So we were very happy to see this.”

In order to get an even better picture of the variation between the diploid and tetraploid varieties, Nijland and colleagues, including Henri van de Geest (@geesthc) and Mark de Heer, performed a de novo assembly using FALCON and Canu.

“Our assembly is very much improved and we were able to separate many of the haplotypes,” Nijland said.

Short read data mapped back to the Old Bush reference was unable to parse haplotypes, but HiFi data clearly showed 4 distinct haplotypes.

The next step is to improve the assemblies even further by using Bionano or HiC technologies, which Nijland is hoping will help separate some of the alleles that were extremely similar due to being a segmental allotetraploid.

“We managed to assemble a heterozygous, polyploid genome, without the need for ultra high molecular weight DNA, which is required for a lot of other long-read sequencing,” Nijland said. “Also, the sequence coverage which is required in the assembly is lower, and because of the high accuracy, the computation of the assemblies is much less.”

“Most importantly, we’re getting a better representation and better overview of genomic content in the assembly. This provides a very valuable tool for molecular breeding efforts in rose.”

 

Catch up on other PAG presentations in a recent blog post and watch Nijland’s full PAG talk here:

Read More »

Wednesday, February 12, 2020

NARMS Scientists Track Antibiotic Resistance in Foodborne Bacteria Using SMRT Sequencing

Launched in 1996, NARMS is a U. S. public health surveillance system that tracks antimicrobial susceptibility of select foodborne enteric bacteria.

We hear a lot about the growing crisis of antibiotic resistance in human health, but it turns out this is just the most visible place it appears as it moves through our complex modern environment. For example, when intensive farming is used to feed large urban populations, antibiotic resistance can first emerge on farms and gain access to human communities through the food system.

One of the key groups on the front lines of monitoring antibiotic resistance from farm to fork in the United States is the National Antimicrobial Resistance Monitoring System (NARMS). NARMS was launched in 1996 as an interagency partnership among the USDA, the FDA, the CDC, and state and local health departments to protect public health by tracking changes in the antimicrobial susceptibility of bacteria in food animals, retail meat and ill people.

Nationwide, public health labs submit Salmonella, Campylobacter, Shigella, Escherichia coli O157, and Vibrio isolates from clinical specimens and outbreaks to the CDC for testing. In addition, 19 states collect samples of retail chicken, ground beef and pork chops every month for culturing, serotyping, antimicrobial susceptibility testing, and genome sequencing by the FDA. Finally, the USDA conducts similar tests on bacteria isolated from food animals at randomly sampled, nationally representative slaughter and processing plants throughout the country.

By combining information from all these sources, NARMS can detect emerging trends in resistance, understand the genetic mechanisms of resistance, link illnesses to specific sources or practices, educate consumers, and develop data-driven recommendations for improving antibiotic stewardship.

One of the tools NARMS uses for bacterial whole genome sequencing is PacBio long-read sequencing, prized for its ability to assemble not only chromosomes but also plasmids and other accessory genome elements that frequently carry drug resistance genes. Over the years, scientists at NARMS have used PacBio reference genomes to facilitate numerous comparative genomics analyses of Salmonella, Campylobacter and Enterococcus strains, examining how virulence and resistance to β-lactamase, ciprofloxacin, linezolid and other families of antibiotics evolves at the molecular level.

Adding to this body of work, NARMS Director Patrick McDermott and collaborators recently reported applying SMRT Sequencing to 11 E. coli isolates collected from retail meats. One of the explicit goals of the study was to add more closed plasmids carrying quinolone resistance to their reference database, expanding our understanding of this emerging challenge to the treatment of Gram-negative infections. All the selected E. coli strains used in this study were resistant to ciprofloxacin and known to carry plasmid mediated quinolone resistance (PMQR) elements. The team generated “closed, circular chromosomes and plasmids from each isolate,” they write.

Figure 1: IncF plasmid containing multiple antimicrobial resistance genes, including qnrA1 (quinolone) and tetA (tetracycline). Resistance genes are depicted in red, with additional annotated genes depicted in blue.

One key finding of the study was that seven of the plasmids analyzed did not match any existing sequences in GenBank. The authors commented, “This demonstrates the importance of increased sequencing of plasmids even in well-studied bacteria such as E. coli, since completely new plasmids are still being discovered.” They also note that while the prevalence of PMQR genes in the US food supply is currently low and the E coli strains from this study are unlikely to cause disease in humans, the identification of numerous novel plasmids suggests the potential for further spread to other strains or genera.

Furthermore, the authors emphasized that “This work shows the value of long-read sequencing in de novo characterization of [antimicrobial-resistant] plasmids.” More specifically, “Using only short-read sequencing data makes it difficult to accurately identify plasmids or fully characterize them.”

For example, closed plasmids are required to identify when multiple resistance genes or multiple copies of the same gene are co-located in one plasmid. This more complete information is important for uncovering the potential for co-selection of resistance. Another key finding from the paper is that while fluoroquinolone is not commonly used in food animal production, seven out of the 11 PMQR plasmids sequenced in this study also carried genes for resistance to tetracycline, “the highest selling antimicrobial for food animals in the United States.” Continued use of tetracycline in food animals could therefore drive co-selection for fluoroquinolone resistance in E. coli.

 

Learn more about the methods and workflow for bacterial whole genome sequencing.

 

Read More »

Monday, February 10, 2020

PacBio Sequencing Contributes to New Japanese Reference Genome

People of Japanese descent just moved a little closer toward the promise of precision medicine thanks to a population-specific reference genome based on the de novo genome assembly of three Japanese individuals. A new preprint describing the work shows that SMRT Sequencing was instrumental in the achievement.

Scientists from Tohoku University, led by Jun Takayama (@jntkym), Kengo Kinoshita (@kk824), Masayuki Yamamoto, and Gen Tamiya, aimed to create an improved reference genome resource that would better represent the genetic background of a Japanese population than the current human reference genome. “Some ethnic ancestries are under-represented in the international human reference genome (e.g., GRCh37), especially Asian populations, due to a strong bias toward European and African ancestries in a single mosaic haploid genome consisting chiefly of a single donor,” they write.

To address that challenge, they sequenced the genomes of three Japanese individuals to more than 100-fold coverage with PacBio SMRT Sequencing. The contig N50 value for each genome was approximately 20 Mb. Bionano optical maps were used to perform hybrid scaffolding to boost contiguity even further. “These and other assembly statistics were better than or comparable to other published de novo assemblies,” the authors report.

 

Figure 1a from bioRxiv preprint – Jun Takayama et al.

Fig 1a. Construction of JG1: PCA plot showing that the three sample donors are within the Japanese population cluster.

 

Next, the team had to merge all three of these assemblies to “construct a reference-quality haploid genome sequence,” they write. “We integrated the genomes using the major allele for consensus, and anchored the scaffolds using sequence-tagged site markers from conventional genetic and radiation hybrid maps to reconstruct each chromosome sequence.” The meta-assembly was designed to avoid the inclusion of rare variants and unresolved sequences for broadest possible applicability.

Takayama et al. validated the utility of this new reference genome — known as JG1 — by analyzing its representation of common variants among Japanese people and its ability to home in on causal variants for rare disease from seven Japanese families. In all cases, the population-specific reference performed at least as well as or better than other assemblies in detecting relevant variation; for example, in the rare disease case, JG1 reduced the number of false-positive variant calls from an exome analysis.

JG1 “is highly contiguous, accurate, and carries the major allele in the majority of single nucleotide variant sites for a Japanese population,” the scientists report. “We expect that population-specific reference genome such as JG1 will prove to be practical and beneficial options for genome analyses of individuals originated from the population.”

 

PacBio long-read sequencing is being used to develop population-specific reference genomes as part of several international research efforts. Learn more about these projects and explore detailed assembly information in our interactive map.

 

Read More »

Tuesday, February 4, 2020

‘Pathway for Discovery’: SMRT Grant Winner Aims to Address the Mysteries of Autism with HiFi Sequencing

Tychele Turner, Assistant Professor, Washington University in St. Louis School of Medicine

We are pleased to announce the winner of the 2019 Human Genetics SMRT Grant: Tychele Turner, an assistant professor who recently joined the Washington University in St. Louis School of Medicine.

Turner’s research focuses on neurodevelopmental disorders, particularly on finding answers to unsolved cases. Her project aims to sequence members of a family affected with autism, using long reads and the high accuracy of HiFi sequencing to try to identify a causal genetic variant. We spoke with her to learn more about this winning proposal.

Q: How did you get involved in studying neurodevelopmental disorders?

A: My interest in neurodevelopmental disorders goes back to my graduate school days. I worked in Aravinda Chakravarti’s lab, where I focused on studying autism, especially in families with multiple affected girls. In autism, there is a sex bias; about 80% of all cases are male. When you have a female with autism, that’s pretty rare, and when you have multiple affected females in a family, that’s even rarer. The prevailing thought is that it might just take a more severe mutation for a girl to become affected with autism. That’s where I started my research career.

Then I moved to Evan Eichler’s lab for my postdoc, where I did large-scale assessment of children with neurodevelopmental disorders — looking at thousands to tens of thousands of individuals using microarrays, whole-exome sequencing, and short-read whole-genome sequencing. We were able to find a lot of new genetic components, particularly from de novo mutations.

Q: Why is your lab focused on uncovering new genetic components associated with autism?

A: We can only explain about 30% of all cases today. It seems low because the heritability of autism is high, so we think there is more to discover. To find these things we haven’t been able to see before, I think part of the issue may be technology. That’s why I was really excited when the opportunity came for the PacBio grant. That’s just the kind of thing we might need to find the variation we can’t explore with the older technologies.

Q: How is the research community trying to get answers for the remaining 70% of cases?

A: One approach is adding more samples. As we sequence more and more people, we’re able to find more and more of those genes with statistical significance. I think the future is very bright on that front. But the other approach is using new technologies to find the types of variation that we’ve missed. We could implicate new genes and also go back to known genes and identify new mutations. I would call this completing the allelic series within the genes that reach significance. If we can get to that point, we can be very clear about all the contributing elements. It’s not fun to be limited by your technology.

Q: Why do you think HiFi sequencing could make a difference?

A: I think it’s really important because it will allow us to find structural variants — such as small deletions and duplications — that we never can see otherwise. I’ve worked a lot with whole genome sequencing from short-read data, but we’re limited with that technology. We can detect really big structural variants. If someone has a deletion that’s a megabase, we will see it. But if that person has a deletion that removes one exon of a gene, and that deletion is 200 base pairs, we have a really hard time finding that in our data. And if we do find it, we have a hard time pulling it out from the noise.

But with PacBio’s long reads, a 200 base pair deletion is no problem because you’ll see it within the actual read. You just map it to the genome, and you have your answer. That’s what I’m really excited about. It also lets you get into the GC-rich regions of the genome, which is important for repeat expansions like the one associated with fragile X syndrome.

Q: How do you foresee PacBio sequencing helping the family you are working with?

A: I’m working with John Constantino in the autism clinic here at WashU — he did a deep clinical workup on this family, which has two girls with autism who have a fairly severe phenotype. They have previously been tested with arrays and exome sequencing and so far, there have been no answers. We think the reason we haven’t been able to find a genetic event yet is probably a technology issue. Our plan is to do PacBio sequencing with the SMRT Grant and also to generate some data with complementary technologies. We’re going to go all in for this family.

Q: In your proposal, you described this approach as a “pathway for discovery.” What did you mean by that?

A: As a new lab, this is really exciting because we’re going to have the first opportunity to look at this kind of data and we think that it’s going to be important to use for other families in the future. In addition to getting an answer for this family, we can use it as a platform to show people how to solve these cases. I’m really interested in going back to the whole collection of families with autism where we don’t have an answer and figure out what’s happening. Some percentage of cases should be explained with this new approach. Having this grant will help us to do that.

 

We’re excited to support this research and look forward to seeing the results. Thank you to our co-sponsor and Certified Service Provider, the HudsonAlpha Genome Sequencing Center, for supporting the 2019 Human Genetics SMRT Grant Program. Explore the 2020 SMRT Grant Programs to apply to have your project funded.

Learn more about variant detection.

Read More »

Thursday, January 30, 2020

At PAG 2020, HiFi Data ‘Transformational’ for Advancing Plant and Animal Research

PAG 2020 LogoWhat better way to start the year than a gathering of thousands of stellar scientists? We were excited, once again, to attend the Plant and Animal Genome (PAG) Conference in sunny San Diego and to showcase some of the achievements of our customers at our well-attended workshop.

For those who missed it – or just want to relive the excitement – here is an overview, and recordings of the presentations.

The workshop kicked off with our CSO Jonas Korlach looking back at the evolution of SMRT Sequencing over the last decade, and concluded with an update on the latest PacBio developments, including reduced analysis time with HiFi Reads and an ultra-low DNA input protocol, by Michelle Vierra (@the_mvierra), strategic marketing manager for plant and animal sciences.

 

Watch Korlach’s introductory remarks:

Watch Vierra’s full workshop talk, PacBio Update on Products and HiFi Applications

 

 

Expanding the Tree of Life

Graph showing 44 new lepidoptera PacBio genome assemblies from the Sanger Institute

The Sanger Institute’s Darwin Tree of Life Project has already created 44 new moth and butterfly (lepidoptera) genome assemblies using PacBio sequencing.

First up at the PAGXXVIII PacBio workshop was Mark Blaxter (@blaxterlab), project lead for the Sanger Institute’s Darwin Tree of Life – a position he described as his ‘dream job’. The project, which aims to sequence all 60,000 species believed to be on the British Isles, over the next 12 years, starting with species representing 4,000 families.

“After that, we’ll move on to the genera and after that we’ll do the rest,” Blaxter said. “This requires us ramping up to do 5,000 genomes a year. If you divide that by the number of working days in a year, that’s 20 genomes a day. That’s five before coffee, another five before… It’s terrifying. But I think actually we can get there.”

The Sanger team has already generated data for 94 species, including 44 new moth and butterfly (Lepidoptera) PacBio assemblies. Combined with HiC data, they have been able to generate chromosomal, telomere-to-telomere assemblies from the HiFi reads.

“Having spent years sequencing other butterflies, this is truly transformational,” Blaxter said. “We hope we can spread this across the whole of the Tree of Life.”

Watch Blaxter’s full workshop talk: Endless Forms – Genomes from the Darwin Tree of Life Project

 

The Fungus Among… Plants

Petri dish cultures of fungi that live in plants

The diversity of fungal symbionts that live within plants, as presented by University of Arizona researcher Jana U’Ren at PAG.

In a talk that might just inspire you to take on mycology, Jana U’Ren (@you_wren) of the University of Arizona discussed the fungi that live inside of plants and her studies of their biology and evolution.

U’Ren’s studies focus on symbiotic fungi found in the photosynthetic tissue of plant leaves. A single leaf can harbor dozens to hundreds of species of fungi. They live asymptomatically within their host species, and are grouped together functionally as endophytes.

Prior sequencing efforts of these endophytes were limited to ~300 base pair fragments of hyper variable regions. While that was useful for community analysis to understand where the species overlapped, it couldn’t be used for phylogenetic analyses.

“What we’re trying to do now is to answer both Where and Who these (fungi) are using PacBio sequencing.”

U’Ren sequenced ribosomal DNA amplicons from 25 different species of plants from Boreal regions at the Arizona Genomics Institute, resulting in more than a million high-quality reads and a treasure trove of data to sift through, which she is currently doing.

“It was a beautiful dataset. We have very high-quality data that has been validated with all the culturing that we’ve been doing over the last 12 years,” U’Ren said. “We recovered a high richness of fungal OTU (operational taxonomic unit), which was what we were looking for, and what we found was this higher phylogenetic diversity.”

 

Watch U’Ren’s full workshop talk: Phylogenetic Insights into the Endophyte Symbiosis using PacBio Ribosomal DNA Sequencing

 

 

A Rose is a Rose

Photo of a rose, whose complicated genome is being decoded using PacBio HiFi long-read sequencing

Assembling the genomes of the tetraploid rose has been challenging, but PacBio HiFi reads are helping Dutch researchers overcome the hurdles.

The genome of the rose is almost as complicated as its connotations when given as a gift on Valentine’s Day or other special occasions.

Many roses are “segmental allotetraploids,” which means that part of the genome is behaving like an allotetraploid (with four chromosome sets from two distinct species, which occurs during hybridization) – and part of the genome is behaving like an autotetraploid (with four sets of homologous chromosomes).

Needless to say, parsing all of this out is challenging. Bart Nijland (@bart3601) of Genetwister Technologies explained how his team set out to make a haplotype-aware assembly of Rosa x hybrida L. in order to capture its full range of genetic variation, rather than rely on more traditional assemblies which collapse the haplotypes into single sequences that could be missing critical information.

“A lot of the existing technologies don’t perform very well in doing this. So we were very happy when PacBio released its HiFi protocol. Due to the high accuracy of the reads, we thought this could really help us in solving this challenge,” Nijland said.

A k-mer analysis of their sequenced samples revealed four distinct peaks, exactly what they were expecting in their heterozygous, tetraploid samples. Further de novo assembly of diploid and tetraploid varieties by Nijland and colleagues, including Henri van de Geest (@geesthc) and Mark de Heer, provided an even better picture of the variation between them.

“This provides a very valuable tool for molecular breeding efforts in rose,” Nijland said.

 

Watch Nijland’s full workshop talk: The Impact of Highly Accurate PacBio Sequence Data on the Assembly of a Tetraploid Rose

 

Going Ape Over Iso-Seq Analysis

Graphic representation of the completedness of several great ape genomes, which are becoming clearer thanks to PacBio long-read DNA and RNA sequencing

The full genetic picture of great apes is becoming clearer thanks to PacBio sequencing data, according to Zev Kronenberg.

The work of Zev Kronenberg (@zevkronenberg) and team made headlines — and the cover of Science — when reported a high-resolution comparative analysis of great ape genomes. During the workshop, he shared how transcriptome analysis via the Iso-Seq method led to further discoveries.

Kronenberg, then a post-doc in the lab of Evan Eichler and now a senior bioinformatics engineer at PacBio, used PacBio’s RNA sequencing method, Iso-Seq, to annotate the great ape genomes his team created, detangle several complicated loci, and enrich our biological understanding of the differences between us and our closest relatives.

“We spent a lot of effort not only ensuring that the genome assembly turned out well, but we made sure that the de novo genome annotation was done correctly and that we were able to trust the genes that we find.”

Mapping the transcriptome data of the great apes against human transcriptome data, Kronenberg and his colleagues looked for areas where they differed. He showed several examples, including a human specific 60 Kb intronic deletion that, with a bit of digging, a graduate student was able to associate with a region linked in other studies to human diet.

“Without the Iso-Seq data, that probably would have been the end of the story. But with Iso-Seq data, we were able to identify how this non-coding variant could potentially have a phenotypic effect.”

The project proved the power of combining genome and transcriptome data, Kronenberg said.

No story is really complete with just a genome. The Iso-Seq data was absolutely central to us discovering really interesting biological candidates.”

The higher capacity and speed of the Sequel II System has made even more possible, Kronenberg added.

“We sequenced I think well over 100 SMRT Cells for only the Iso-Seq data. Today, a single SMRT Cell would do that whole project. And that, to me, is mind blowing.”

 

Watch Kronenberg’s full workshop talk: Characterizing Genetic Differences between Great Apes using Iso-Seq Data

 

HiFi Data Assemble!

Staff Scientist Elizabeth Tseng presenting her poster at PAG.

Several other presentations throughout the conference demonstrated how highly accurate HiFi reads on the Sequel II System are improving results, including “HiCanu: Resolving repeats and haplotypes” by Sergey Koren (@sergekoren) of NHGRI, slides of which are available here. In addition to HiCanu, two other genome assemblers built for HiFi data also made their debut: Nighthawk from Zev Kronenberg and Hifiasm from Heng Li (@lh3lh3).

Our four poster presentations from the PAG conference are available to view:

Learn more about whole genome sequencing, Iso-Seq analysis, and HiFi reads for plant and animal research.

 

Read More »

Monday, January 27, 2020

When Snakes Strike: SMRT Sequencing Reveals Hidden “Venom-ome”

The team from AgriGenome and MedGenome helped assemble the genome and transcriptome of the lethal Indian Cobra (Naja naja) using PacBio long-read sequencing

Snake milking, horse blood harvesting and brewing — antivenom production is still more medieval art than modern science. But a new high-quality snake genome may finally pull it into the 21st century.

As recently reported in Nature Genetics, a team of scientists led by Somasekar Seshagiri, a former staff scientist at Genentech and now president of the nonprofit SciGenom Research Foundation (@SGRF_Science) in India, assembled the genome and transcriptome of the lethal Indian Cobra (Naja naja) using PacBio long-read sequencing and other genomic technologies.

They also created a “venom-ome,” a catalog of venom-gland-specific toxin genes they hope can be used for the development of synthetic antivenom of defined composition using recombinant technologies.

The new cobra genome is one of only a few snake genomes ever published. Previous assemblies were generated primarily using short-read sequencing, resulting in highly fragmented assemblies, “thus limiting their utility for creating a complete catalog of venom-relevant toxin genes,” the authors noted. Compared with the king cobra genome, the Indian cobra genome contains far fewer scaffolds (1,897 versus 296,399), and 929-fold better contiguity.

“This high-quality genome allowed us to study various aspects of snake venom biology, including venom gene genomic organization, genetic variability, evolution and expression of key venom genes,” the authors wrote.

A team of scientists from AgriGenome (@agrigenome), a PacBio certified service provider, was instrumental in generating long-read PacBio whole genome and venom gland Iso-seq data. Their bioinformatics team helped build a functional annotation pipeline that leveraged 101,761 Iso-seq transcript isoforms to identify and correctly annotate 139 toxin genes out of the 12,346 genes expressed in the venom gland, the ‘venom-ome’. Of the 139 toxin genes, 19 were expressed primarily in the venom gland.

Targeting these core toxins — which are responsible for a wide range of symptoms in humans, including heart-function problems, paralysis, nausea, blurred vision, internal bleeding and 100,000 deaths per year worldwide — could lead to the development of a safe and effective humanized antivenom, as well as drugs to treat hypertension, pain and other disorders, the authors suggest.

“The genome and the associated predicted proteome will also serve as a powerful platform for evolutionary studies of venomous organisms,” the authors wrote.

 

Learn more about the methods and workflow for PacBio whole genome sequencing.

Read More »

Thursday, January 23, 2020

Project to Rapidly Sequence Maize Pangenome Delivers Publicly Available Resource

Matt Hufford, associate professor at Iowa State University, helped produce a 26-line maize pangenome assembly collection

Maize researchers have been rejoicing over a New Year’s gift delivered by a group of 33 scientists: A 26-line “pangenome” reference collection.

The multi-institutional consortium of researchers used the Sequel System and BioNano Genomics optical mapping to create the assemblies and high-confidence annotations. They released the results on January 9, and in several presentations at the Plant and Animal Genome XXVIII Conference, less than two years after the ambitious project was funded by a $2.8 million National Science Foundation grant.

The collection includes comprehensive, high-quality assemblies of 26 inbreds known as the NAM founder lines — the most extensively researched maize lines that represent a broad cross section of modern maize diversity — as well as an additional line containing abnormal chromosome 10.

Scientists can download the project’s raw whole genome sequencing data, RNA sequencing data, optical map data, gene annotations and gene models at MaizeGDB. The site also features browsing and data visualization tools.

Led by faculty investigator R. Kelly Dawe (@corncolors), Distinguished Research Professor at the University of Georgia, Matt Hufford (@mbhufford), associate professor at Iowa State University, and Doreen Ware, a computational biologist at USDA and Cold Spring Harbor Laboratory, the NAM Consortium also included scientists from Corteva Agriscience, who are conducting their own large-scale sequencing effort of the company’s maize lines as well.

“People have been using these particular lines for years, so everybody has been really excited to get these new references as a resource,” Hufford said. “The assemblies that have come out are better than anything else that’s out in maize.”

Maize has been extremely challenging to sequence because the vast majority of its 2.3 Gb genome — a staggering 85 percent — is made up of highly repetitive transposable elements. It is also amazingly diverse. A study comparing genome segments associated with kernel color from two inbred lines revealed that 12 percent of the gene content was not shared – that’s much more diversity within the species than between humans and chimpanzees, which exhibit more than 98 percent sequence similarity.

The 26 varieties were prepped at the Arizona Genomics Institute, sequenced at the University of Georgia, Oregon State University, and Brigham Young University, and assembled by the NAM Consortium using PacBio long reads. Scaffolds were validated by BioNano optical mapping, and ordered and oriented using linkage and pan-genome marker data. RNA-seq data from multiple tissues were used to annotate each genome using a pipeline that included BRAKER, Mikado and PASA.

“We spent a lot of time on gene model annotation, validation and benchmarking against B73 (the first reference genome annotations for maize, created by Ware’s lab in 2009, and updated in 2017) and other maize genes that have been manually curated by the community,” Hufford said.

Now comes the fun part: Peering into all the data and seeing what secrets it will reveal.

“For the last few months, we have started to see the cool biology emerging,” Hufford said. “What we are seeing is a lot of structural variation linked to phenotypic traits we haven’t been able to explain before.”

In addition to answering questions about basic biology and agronomic variation, the data is shedding light on the evolution of the different maize lines.

“We’re learning about the tempo of gene loss following a genome doubling event several million years ago. It appears to be ongoing, and still in flux,” Hufford said.

Next steps for the consortium include additional functional annotations for the NAM gene models, such as transposable elements, SNPs and insertions, as well as methylome and ATAC-Seq data.

“These data will help the maize community assess the role of variation in the determination of agronomic traits,” Hufford said.

Hufford will also be using SMRT Sequencing on the Sequel II System for two other large assembly projects for teosintes, a wild relative of maize, and other grass species.

“I think it’s really going to help with some of these complex varieties,” he said.

 

Learn more about the methods and workflow for PacBio whole genome sequencing.

 

Read More »

Monday, January 13, 2020

Direct Phased Genome Assembly Using Nighthawk on HiFi Reads

By Zev Kronenberg, Senior Engineer of Bioinformatics at PacBio

Since the introduction of HiFi reads the community has embraced these long and highly accurate reads for human genome assembly and paralog resolution [1-5]. At PacBio, the assembly team (Figure 1) is working to build on the accuracy of HiFi data for direct phasing during assembly.

Figure 1. The PacBio assembly team. From left to right, James Drake, Zev Kronenberg (@ZevKronenberg), Derek Barnett (@DerekWBarnett), Chris Dunn, and Ivan Sović (@IvanSovic)

In diploid organisms, phasing an assembly means separating the maternally and paternally inherited copies of each chromosome, known as haplotypes. Each phased contig, or haplotig, is made up of reads from the same parental chromosome (Figure 2). Phased genomes give better quality than collapsed genomes; they provide allelic information, which can be important for studying human diseases, crop improvement, evolution, and more.

Figure 2. Phased de novo assembly. A collapsed haploid assembly meshes contigs from different haplotypes (unphased assembly), while a partially phased assembly may still switch between the two haplotypes in its primary contigs. A fully phased assembly would cleanly separate the two haplotigs.

 

FALCON-Unzip is a diploid-aware genome assembler that has been used to assemble and phase many PacBio genomes [6]. It first creates a collapsed assembly, then uses heterozygous single nucleotide variants to partition the reads by haplotype and reassembling them into haplotigs. The assembly outputs are primary contigs with associated haplotigs (Figure 3).

Figure 3. FALCON-Unzip phasing and haplotig assembly steps. In the first stage primary contigs and associate contigs are produced, reads are aligned to the primary contigs, and phased. The phase is then re-introduced to the assembly graph, followed by re-assembly.

 

While FALCON-Unzip has consistently given our users excellent results, it was built for long reads with higher error rates and does not take advantage of the high accuracy of the HiFi reads. In 2019, FALCON-Unzip was adapted for HiFi data, producing high-quality results [7]. However, the current implementation still requires iterative assembly, and does not use indels for phasing. Therefore, we have started working on a new graph cleaner called Nighthawk that simplifies the assembly graph by removing cross-haplotype alignment overlaps, which can significantly speed up and improve assembly. While still a work in progress, the preliminary results are promising.

Nighthawk: A smart, efficient assembly graph cleaner

Nighthawk uses that classical bioinformatics data structure, the De Bruijn graph, to identify genetic variants (substitutions, insertions, and deletions) and remove cross-haplotype overlaps in the assembly string graph.

Most long-read genome assemblers follow the overlap-consensus-layout (OLC) workflow. The overlap stage begins with a pairwise alignment of all reads (Figure 4A). For each read, a pile of alignments to all other reads is generated. The goal of Nighthawk is to detect and remove cross-haplotype overlaps — that is, alignments between reads that come from different haplotypes. It also needs to remove other false alignments that come from paralogs, repeats, etc.

Given a pile of reads, Nighthawk builds a read-colored k-mer De Bruijn graph [8], where each node represents a k-mer; node colors denote a unique set of reads (Figure 4B). For each read overlap, Nighthawk calculates a read similarity score (RSS). The RSS is the number of shared variants between two reads. A positive RSS indicate that reads are in phase with another, while a negative RSS suggest the read overlap is cross-haplotype and should be removed (Figure 4C). Nighthawk removes overlaps with a negative RSS. The remaining overlaps are then passed on for the layout and consensus stage of assembly (Figure 4D).

It is amazing to see how clean a HiFi-based De Bruijn graph is (Figure 5). This is often a work of art in itself! After running Nighthawk, the overlaps can then be passed into string graph assemblers such as FALCON for assembly.

Figure 4. The Nighthawk workflow. Nighthawk builds a colored De Bruijn graph from read overlaps. Overlaps are scored by shared variants between two reads. Overlaps with negative RSS indicate cross-phase overlaps and are removed. The resulting overlaps are passed to a string graph assembler (such as FALCON) for phased assembly.

 

Figure 5. A HiFi De Bruijn graph for a pile of reads from Drosophila genome sequencing. Each dot represents a k-mer (k=23), the edges denote neighboring k-mers. The larger red dots mark the head of heterozygous bubbles.

 

Testing Nighthawk on a HiFi data set

We evaluated how well Nighthawk’s RSS could distinguish in-phase and cross-phase overlaps against three ground truth sets (Table 1). In all three data sets, Nighthawk’s RSS was able to distinguish in-phase read overlaps (true positives) from cross-phase read overlaps (true negatives) while having very few false positives and false negatives.

But what effect does Nighthawk’s graph cleaning have on the assembled genome? Our team patched Nighthawk into FALCON and assembled a heterozygous (0.6%) F1 Drosophila HiFi data set. The haploid genome size is 140 Mb, so a perfectly assembled diploid genome would consist of a total of 280 Mb total in primary and associated contigs.

Our Nighthawk-FALCON assembly produced 247.1 Mb of primary contigs and 14.9 Mb associated contigs, creating a diploid genome that’s a total of 262 Mb (93.9%). The phasing accuracy, as measured by parental k-mers, was much better using Nighthawk for both primary and associated contigs compared to other methods.

 

Toward a truly phased assembly

We have shown that HiFi data alone can be used to effectively phase a Drosophila genome. Our new tool, Nighthawk, is an assembly graph cleaner that uses the accuracy of HiFi reads for variation detection. The phasing of the primary and associate contigs improves compared to FALCON when Nighthawk is used to filter out cross-phase alignment overlaps.

Nighthawk is still a work in progress, and many challenges remain. One such challenge is the use of alignment identity as a filter to identify cross-phase overlaps. Setting the right identity threshold is a Goldilocks problem: a filter that’s too stringent would fragment the assembly, while a filter that’s too relaxed would not remove all the false overlaps. Another challenge is complex graph structures that may arise from repeat structures, homozygosity, lack of overlap coverage, etc.

Nighthawk is only the first piece in the overlap-layout-consensus assembly process. Our team is continuing to modify string-graph algorithms to recognize the graph structures Nighthawk generates. We are excited about the new possibility HiFi data brings and believe that fast, direct phased assemblies will be feasible in the not-too-distant future.

Acknowledgments

The PacBio assembly team would like to thank Tobias Marschall (@tobiasmarschal) for the inspiration to use De Bruijn graphs for variant calling (NCBI Hackthaon 2019) and Mark Chaisson (@mjpchaisson) for technical guidance on avoiding common pitfalls.

References

[1] Wenger et al., “Accurate Circular Consensus Long-Read Sequencing Improves Variant Detection and Assembly of a Human Genome”, Nature Biotechnology (2019)

[2] Vollger et al., “Improved Assembly and Variant Detection of a Haploid Human Genome Using Single-Molecule, High-Fidelity Long Reads”, Annals of Human Genetics (2019)

[3] Vollger et al., “Long-Read Sequence and Assembly of Segmental Duplications”, Nature Methods (2019)

[4] Garg et al., “Efficient Chromosome-Scale Haplotype-Resolved Assembly of Human Genomes”, bioRxiv (2019)

[5] Porubsky et al., “A Fully Phased Accurate Assembly of an Individual Human Genome”, bioRxiv (2019)

[6] Chin et al., “Phased Diploid Genome Assembly with Single-Molecule Real-Time Sequencing”, Nature Methods (2016)

[7] Kronenberg et al., “High-quality Human Genomes Achieved through HiFi Sequence Data and FALCON-Unzip Assembly”, ASHG Poster (2019)

[8] Garg et al., “A Graph-Based Approach to Diploid Genome Assembly”, Bioinformatics (2018)

[9] Patterson et al., “WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads.” In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014)

[10] Koren et al., “De Novo Assembly of Haplotype-Resolved Genomes with Trio Binning”, Nature Biotechnology (2018)

Read More »

Wednesday, January 8, 2020

SMRT Grant Winner: Hunting for Answers in Spinocerebellar Ataxia

Cleo van Diemen, University Medical Center Groningen

A hearty congratulations to Cleo van Diemen at the University Medical Center Groningen for winning the 2019 Neuroscience SMRT Grant!

Van Diemen’s impressive proposal involves using PacBio long-read sequencing to find new genetic mechanisms associated with spinocerebellar ataxia (SCA). While some 70% of SCA patients can get clear diagnostic and prognostic information because they have one of the ~37 genes known to be associated with this condition, 30% of patients have no such clarity. In this project, van Diemen and her colleagues will use their SMRT Grant award to generate highly accurate long reads for two SCA patients with unknown disease etiology.

As team-leader of the research & development unit of the genome diagnostics section of the genetics department, van Diemen aims to introduce new technologies to help her colleagues achieve their research and diagnostic goals. In this case, she is working with a scientist focused on SCA patients to find a way to diagnose previously unsolvable cases.

So far, existing approaches have included standard linkage analysis, SNP arrays to look for some known structural variants, exome sequencing, and gene expression analysis. Now, van Diemen hopes that adding structural variant detection with SMRT Sequencing will provide some new answers. Repeat expansions are among the possible culprits. “Repeat genes have been identified in a lot of ataxias,” van Diemen says. With SMRT Sequencing, it will finally be possible “to do this genome-wide approach for new repeat genes.”

Structural variation is another potential source of causal mechanisms for the unexplained SCA cases. “There is some evidence that structural variants may play a role in ataxias,” van Diemen says. But SNP arrays lack the ability to discover new variants or to detect complex situations, such as inversions. And short-read sequencing often misses these large elements. “With long-read sequencing, it’s easier to identify them,” she adds.

University Medical Center Groningen

Ultimately, the goal is to give all SCA patients the DNA-based information that will help them manage their condition. “There are some differences in the phenotypic spectrum, so knowing the genetic basis can help patients understand what they will face in the future and also makes it possible to consider genetic testing for family counseling,” van Diemen says. “That’s the clinical importance of having a genetic diagnosis.”

This SMRT Grant represents van Diemen and her team’s first use of PacBio sequencing. She believes it will be “a good starting point” that will help them understand how to apply long-read sequencing for larger-scale studies in the future. “We are looking forward to it,” she says. “It’s a great opportunity.”

 

We’re excited to support this research and look forward to seeing the results. Thank you to our co-sponsor and Certified Service Provider, the Center for Genomic Research at the University of Liverpool, for supporting the 2019 Neuroscience SMRT Grant Program.

Learn more about upcoming SMRT Grant Programs for a chance to win free sequencing.

Read More »

Friday, January 3, 2020

Novel Workflow Produces Fully Phased Human Genome Assemblies Without Trio Sequencing

A new preprint from lead authors David Porubsky and Peter Ebert, senior authors Evan Eichler and Tobias Marschall (@tobiasmarschal), and collaborators reports a method for generating fully phased, de novo human genome assemblies without parental data. The approach combines PacBio HiFi reads (>99% accuracy, 10-20 kb) with the short-read, single-cell Strand-seq technique. The authors provide a proof-of-principle through assembling the genome of a Puerto Rican female from the 1000 Genomes Project.

The work extends a recent publication from many of the same authors in which HiFi reads were used to produce an accurate and contiguous assembly of the human haploid genome, CHM13. To help assemble a phased diploid genome, the newer work adds Strand-seq, “a single-cell sequencing method able to preserve structural contiguity of individual homologs in every single cell.” The authors used Strand-seq to group HiFi reads by chromosome, order and orient contigs, and phase variants over long genomic distances. “Taken together, these features make Strand-seq the method of choice to be combined with high-accuracy long-read sequencing platforms to physically phase and assemble diploid genomes.”

The team generated 33.4-fold HiFi read coverage of the selected sample using the Sequel II System. They called single nucleotide variants in the HiFi reads with DeepVariant and phased variants using Strand-seq and HiFi reads. That “resulted in chromosome-length haplotypes with >95% … of all these heterozygous variants placed into a single haplotype block,” the scientists report. “With such global and complete haplotypes we assigned ~81% of the original PacBio HiFi reads to either parental haplotype 1 (H1) or haplotype 2 (H2).”

The team then used two tools, Canu and Peregrine, to assemble the haplotype-separated reads. A small number of chimeric contigs were corrected with Strand-seq data and the SaaRclust algorithm. The final contig N50s of the fully phased assemblies were 25.8 Mb and 28.9 for each haplotype. Assemblies were found to be highly accurate, with basepair quality scores higher than QV40; nearly all gene-disrupting indels in the sequence were found to be true biological events, not assembly artifacts. By titrating HiFi read coverage, the authors found that around 15-fold coverage of each haplotype is sufficient to produce an accurate, contiguous assembly.

“Our assembly strategies allow us to transition from ‘collapsed’ human assemblies of ~3 Gbp to fully phased assemblies of ~6 Gbp where all genetic variants, including [structural variants], are fully phased at the haplotype level,” the scientists report. In addition to the importance of using this method for assembling individual genomes, the authors note, “Fully phased, reference-free genomes are also the first step in constructing comprehensive human pangenome references that aim to reflect the full range of human genome variation.”

Read More »

Friday, December 27, 2019

SMRT Sequencing Highlights – Top Publications of 2019

With the release of the award-winning Sequel II System, 2019 was an exciting year for the SMRT Sequencing community. We were inspired by our users’ significant contributions to science across a wide range of disciplines. As the year draws to a close, we have taken this opportunity to reflect on the many achievements made by members of our community, from newly sequenced plant and animal species to human disease breakthroughs.

 

“It has been another phenomenal year for science. The introduction of the Sequel II System will accelerate discovery even more, and I can’t wait to see what 2020 will hold.”

Jonas Korlach, Chief Scientific Officer

 

Human Biomedical Research

The year brought incredible insights into human genetics. Some researchers homed in on single mutations, while others zoomed out to explore variation on a population scale. PacBio technology was also selected for new large-scale sequencing projects, including the NHGRI Human Genome Reference Program and the All of Us program. Here are some of our favorite publications from the year:

 

  • The mystery cause of progressive myoclonic epilepsy in a family that eluded detection in standard whole-exome sequencing was revealed with PacBio whole genome sequencing, as reported in Journal of Human Genetics and on our blog.
  • New insights into specific human populations were revealed in several studies, including Melanesians, as reported in Science, and Tibetans, as reported in National Science Review.
  • Double mutations in the PIK3CA oncogene were found to influence targeted therapy, as highlighted in Science and our blog.
  • The importance of comprehensive variant detection was featured in several papers. University of Washington researchers Mitchell R. Vollger and Evan Eichler reported that “HiFi may be the most effective standalone technology for de novo assembly of human genomes” in their Annals of Human Genetics paper (read our blog), while members of the Human Genome Structural Variation Consortium reported “the most comprehensive assessment of SVs in human genomes to date” in Nature Communications. University of Michigan researchers Steve S. Ho and Ryan E. Mills shared their review entitled “Structural variation in the sequencing era.”
  • A PLoS One paper by Mayo Clinic researchers demonstrated the use of No-Amp targeted sequencing to interrogate the sequence structure of expanded repeats in Fuchs Endothelial Corneal Dystrophy.
  • The utility of the PacBio Iso-Seq method for studying disease risk genes was showcased in a Frontiers in Genetics paper by PacBio and Duke University researchers studying transcripts across synucleinopathies.

Plant & Animal Sciences

Commoner’s law of ecology states that “everything is connected to everything else,” and this was highlighted in several studies that showed the interdependence of microbes, plants, insects, and other animals. International consortia such as the Vertebrate Genomes Project, the Earth Biogenome Project, and the Sanger Institute’s 25 Genomes Project released many new reference genomes, which will only bolster our understanding of individual species as well their interactions with their ecosystem cohabitants. Here are some of our favorite publications from the year:

 

  • Korean scientists provided a great example of mutualistic interactions in their Nature Communications paper examining the relationships between Streptomyces bacteria, strawberry plants, and pollinating bees.
  • A USDA project to sequence the spotted lanternfly showcased the power of SMRT Sequencing to rapidly generate high-quality genomes from the DNA of single insects to fight invasive species.
  • The latest Nature publication from the Cantu Lab delved into a largely unexplored feature of plant genomes — structural variants — in a study of the population genetics in grapevine domestication.
  • Pathologists interested in uncovering the secrets of plant immunity used PacBio targeted sequencing to create inventories of NLR genes, which are candidates for engineering new pathogen resistance (read our blog).
  • For shrimp, which have notoriously hard genomes to sequence, an isoform-level transcriptome reference generated with the Iso-Seq method was reported on Fish and Shellfish Immunology and summarized in our blog.

Microbiology & Infectious Disease

From C. difficile to symbiotic defense systems, we were treated to new insights in the realm of microbiology. We also learned about a new way to use an old method to provide unprecedented taxonomic resolution at species and strain level and gained insight into intra-bacterial defense. Here are some of our favorite publications from the year:

 

  • Not only has the Mount Sinai Pathogen Surveillance Program adopted SMRT Sequencing for continuous monitoring and disease control, the accumulated PacBio data has also inspired new research, including a paper published in Nature Microbiology on the discovery of a conserved orphan methyltransferase that drives C. difficile infection persistence (read our blog).
  • A team of researchers at the Jackson Laboratory published a study in Nature Communications, and featured on our blog, using HiFi sequencing to unlock the full potential of 16S rRNA Sequencing to provide taxonomic resolution of the human gut microbiome at species and strain level.
  • PacBio reference genomes enabled a groundbreaking study published in Nature of intra-bacterial defense genes in the human gut microbiome by researchers at the University of Washington.
  • As published in Science, long reads were also used to reconstruct a tripartite symbiotic factory for a marine toxin, involving bacteria, algae and a sea slug.

Did we miss one of your favorite publications of 2019? Tweet your favorites to us @PacBio, using #PoweredbyPacBio. And check out our searchable publications database for more than 1300 examples of outstanding SMRT Science from 2019.

 

Read More »

Subscribe for blog updates:

Archives