This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
Whether you are seeking to characterize microbial diversity in the gut, or distinguish between pathogenic and commensal bacteria on the skin, full-length 16S rRNA sequencing using PacBio systems is a valuable tool for metagenomics studies, according to microbiology researchers in a recent webinar.
One of the biggest advantages is the ability to do gene prediction directly on high-quality HiFi reads, without the need for any assembly at all, she said. This differs from shotgun sequencing assembly using short-read data, in which anywhere between 20-70% of the data will not map to the resulting assembly.
“With HiFi sequencing, every single read yields 7-9 completely intact genes,” Ashby said. “This lets you get information even from species that have very low representation in the data, because you don’t have to have enough coverage to do assembly to improve the contiguity nor to do error correction, as is the case of other long-read technologies.”
The high accuracy of HiFi reads also means you can use existing NGS tools and pipelines, without any modification, she added.
The 16S rRNA sequencing protocol uses an asymmetric barcoded system to enable very high levels of multiplexing, and an all-in-one kit available from Shoreline Biome makes extraction, amplification and analysis even easier. Ashby said users can multiplex (at 96 or 192x) to get up to 3.6 million full-length 16S sequences. If shotgun sequencing, users can expect up to 2.4 million reads, each at about 10 kb in length.
Of the 450,000 pre-term babies delivered in the United States each year, about 10% born before 32 weeks develop necrotizing enterocolitis (NEC), a severe infection that happens when bacteria that are normally confined to the intestine escape through an impaired barrier, triggering a cascading inflammatory response that can lead to multi-organ failure and death in about half of afflicted babies.
In hopes of better understanding the development of the condition and the “leaky gut” that contributes to it, research associate Bing Ma, from the Institute for Genome Sciences at the University of Maryland, set out to characterize gut microbiota in preemies using 16S rRNA sequencing.
She started with short-read sequencing, and found evidence that increases in microbial diversity correlated with decreases in intestinal permeability, and higher Clostridiales abundance, in particular, was associated with an improved intestinal barrier.
“These were very exciting results,” Ma said. “However, there were limitations, and outstanding questions remained. The taxonomic resolution in these short regions of 16S was not optimal.”
To address these shortcomings in a follow-up study with an expanded cohort, Ma added full-length 16S PacBio sequencing, which picked up Bifidobacterium species that were missed by the first study. This proved important, as both Bifidobacterium and Clostriadiales were found to be linked to intestinal permeability. Ma was able to obtain species level resolution for 88% of the reads. For most of the remaining sequences, resolution was limited largely because of 16S rRNA database limitations. This high level of resolution allowed Ma to map the subspecies dynamics of B. longum and B. breve.
Ma then obtained long-read metagenomics shotgun data and was able to obtain a number of closed, circularized metagenome-assembled genomes (MAGs), including one of B. breve. The MAG of B. breve revealed a carbohydrate utilization pathway involved in the metabolism of oligosaccharides in human breast milk to short chain fatty acids. Breastfeeding and lower use of antibiotics seemed to contribute to the abundance of the bacteria, which in turn led to improved outcomes for the babies.
“Through metagenome sequencing and genome assembly, we get some mechanistic understanding of the role of the gut microbiome on the intestinal barrier maturation in early preterm neonates, and that will help us for future development of rationally formulated live biotherapeutics.”
Plenty of insights from polymorphisms
How much can you learn about the microbiome simply from polymorphisms revealed by 16S rRNA sequencing? Quite a lot, according to Jackson Laboratory PI and Human Microbiome Project leader George Weinstock (@geowei).
Weinstock first became convinced about the utility of tracking 16S polymorphisms from a 2013 study of Propionibacterium acnes. Much of the bacteria on our skin are strains of P. acnes, and Weinstock wanted to see if he could differentiate between pathogenic and commensal strains using ribotypes.
He isolated hundreds of acne strains and conducted full-length Sanger sequencing of more than 31,000 16S rRNA gene clones. Many of the strains differed by only one or two nucleotides, but he was able to distinguish between them, and identified five that were almost exclusively found on acne lesions, and one found almost exclusively on healthy skin.
“Strains can have functional differences, and sometimes the haplotypes are very tightly linked to the polymorphisms in the 16S genes, so by just looking at the 16S gene sequences, you can infer something functionally about the strains,” Weinstock said.
More recently, he used PacBio sequencing to see whether circular consensus sequencing could pick up these types of polymorphisms in E. coli, which has seven well characterized 16S rRNA genes. He sequenced a mock community of 36 strains and compared the results to both a reference sequence and a shotgun sequence of a pure E. coli isolate, with favorable results. To test clinical utility, he also compared the sequences to that of a highly pathogenic E.coli strain (E. coli 0157 Sakai).
“We could distinguish these two strains from each other even though there’s only a few mutations, looking at these polymorphisms,” Weinstock said.
In order to confidently call such polymorphisms, however, you need high accuracy, and PacBio has become the preferred method in the Weinstock lab, he added. Similar to Ma, Weinstock has also found that full-length sequencing is able to identify species that partial 16S sequencing misses.
“The full-length sequences are definitely the gold standard, no doubt about it,” Weinstock said. “It really is worth putting the effort into full-length sequences.”
Watch the webinar:
The study of genomics has revolutionized our understanding of science, but the field of transcriptomics grew with the need to explore the functional impacts of genetic variation. While different tissues in an organism may share the same genomic DNA, they can differ greatly in what regions are transcribed into RNA and in their patterns of RNA processing. By reviewing the history of transcriptomics, we can see the advantages of RNA sequencing – using a full-length transcript approach – become clearer.
Reaching for the Transcriptome
Even before genome sequencing became commonplace, scientists were able to measure gene expression activity using hybridization approaches like Northern blots, or later, microarrays. While these techniques provided a rigorous method for RNA measurement, they were limited in that they could only measure known transcripts. Additionally, many of these methods made the subtle differences between transcript isoforms difficult to decipher.
As next-generation sequencing (NGS) grew, researchers had the ability to interrogate the activity of all expressed genes including those that were previously unknown. This was a huge leap forward in the ability to characterize the transcriptome in an organism. The approach, known as RNA-seq, quickly replaced microarrays in many labs due to the enormous amount of information it could generate.
Understanding gene activity — not just the genes encoded by the genome, but the ones that are turned on at a specific time or in a specific cell or tissue — is critical for elucidating biological consequences. But even as RNA-seq expanded, scientists grew concerned about the accuracy of results when NGS short reads had to be pieced back together computationally to create whole transcripts. This is especially difficult in complex eukaryotes where one gene can generate many isoforms that differ in their transcription start site and alternative RNA processing.
In a Mendelspod interview, Stanford genomics expert Mike Snyder (@SnyderShot) described the RNA-seq process this way: “We take RNA, we blow it up into little fragments, and then we try and assemble them back together to see what the transcriptome looked like in the first place. … You can’t always figure out which parts of the puzzle belong together.”
A similar concept was presented by Ian Korf (@IanKorf) in a Nature Methods article, where he equated RNA-seq to reassembling magazine articles that had been put through a shredder. Suffice it to say, there have been concerns about the need for extensive computational inference when using an RNA-seq approach.
Capturing the Whole Transcriptome with the Advantages of Long Reads
What scientists really needed went beyond RNA-seq; for the most accurate view of biology possible, they had to have a way to capture isoforms in their entirety. The introduction of long-read sequencing answered this need.
With Single Molecule, Real-Time (SMRT) Sequencing, researchers can generate highly accurate long reads, or HiFi reads, that are tens of kilobases long. That’s enough to capture most transcripts in a single sequencing read, with no downstream bioinformatic assembly required. This became known as the Iso-Seq method because it enabled full-length isoform sequencing, which can be used to explore a transcriptome in its entirety or in a targeted fashion to examine individual genes.
As scientists began using SMRT Sequencing for transcriptome projects, a common theme emerged: long-read studies, even of the most well characterized organisms, routinely showed far more transcript diversity than previously observed. It quickly became clear that short-read NGS platforms greatly underestimated the diversity of expressed isoforms. Key published examples of the Iso-Seq method revealing previously unknown isoform diversity across different types of organisms include:
- Revealing the Transcriptomic Complexity of Switchgrass by PacBio Long-read Sequencing
- Full-length Transcriptome Analysis of Litopenaeus vannamei Reveals Transcript Variants Involved in the Innate Immune System
- Single-Molecule Real-Time (SMRT) Full-Length RNA-Sequencing Reveals Novel and Distinct mRNA Isoforms in Human Bone Marrow Cell Subpopulations
Beyond Alternative Splicing
In addition to alternative splicing, the Iso-Seq method is important for understanding alternative polyadenylation, genome annotation, fusion transcripts, isoform variant phasing, and long noncoding RNAs. Here are just a few examples of how the Iso-Seq method allowed for new transcriptome studies:
Connect alternative polyadenylation motifs to specific isoforms to determine their role in agronomically important traits.
Wang, B., et al. (2018) A comparative transcriptional landscape of maize and sorghum obtained by single-molecule sequencing. Genome Research, 28(6), 921-932.
Gain a complete view of the porcine transcriptome including the identification of genes missing in the assembly.
Warr, A., et al. (2020) An improved pig reference genome sequence to enable pig genetics and genomics research. GigaScience, 9, 1-14.
Discover novel fusion transcripts in clinically relevant genes to understand disease mechanisms.
Tian, L., et al. (2019). Long-read sequencing unveils IGH-DUX4 translocation into the silenced IGH allele in B-cell acute lymphoblastic leukemia. Nature Communications, 10(1), 2789.
Isoform Variant Phasing
Detect and phase mutations in disease genes for clinically relevant insights.
Vasan, N., al. (2019) Double PIK3CA mutations in cis increase oncogenicity and sensitivity to PI3Kα inhibitors. Science, 366(6466), 714–723.
Long Noncoding RNAs
Identify long noncoding RNA for a more comprehensive view of transcriptional and translation regulation
Teng, K., et al., (2019) PacBio single-molecule long-read sequencing shed new light on the complexity of the Carex breviculmis transcriptome. BMC Genomics, 20(1), 789.
Full-length transcript sequencing offers many benefits to scientists looking to interrogate transcriptomes. Whether exploring the human transcriptome to better understand health and disease or studying plants and animals to advance agricultural or conservation efforts.
Explore other posts in the Sequencing 101 series:
“We are now embarking on an era where all genetic variation in an individual will be completely discovered,” write Glennis Logsdon (@glennis_logsdon), Mitchell Vollger (@mrvollger), and Evan Eichler in a recent Nature Reviews Genetics paper. “Hundreds and ultimately thousands of new human reference genomes will be produced.” A decade ago that would have sounded impossible, but today this bold proclamation is widely accepted in the genomics community — a telling sign of the remarkable innovation that has driven genome sequencing in recent years.
In their review, the University of Washington scientists give credit for much of these accomplishments to advancements in long-read sequencing. “Sequencing technology is the ‘microscope’ by which geneticists study genetic variation,” they write, “and it is clear that long- read technologies have provided us with a new ‘lens and objective’ for understanding DNA and RNA variation, structure and organization.”
That new lens has allowed researchers to fill in many of the blind spots left by short-read sequencing, which is limited to read lengths of just a few hundred bases. These “are too short to detect more than 70% of human genome structural variation (that is differences that involve 50 bp or more), with intermediate-size structural variation (less than 2 kb) especially under- represented,” the authors note. Long reads generated using PacBio sequencing, on the other hand, can span tens of kilobases.
Short-read sequencing platforms also struggle to get through repetitive regions or regions with extreme GC content. “For example, even PCR-free, short-read genomic libraries show up to twofold reductions in sequence coverage when the GC composition exceeds 45%, limiting the ability to discover genetic variation in some of the most functionally important regions of our genome,” the scientists report. Such regions include first exons, centromeres, telomeres, and segmental duplications.
Approaches to extend the capabilities of short reads — such as linked reads, synthetic long reads, and Hi-C — “are generally inferior to strict long-read sequencing approaches” for many applications, the authors write.
The review provides a great education on the applications of long-read sequencing, such as detecting structural variation, enabling diploid and even telomere-to-telomere human assemblies, and characterizing the transcriptome. The authors also explain the various long-read data types, including PacBio HiFi reads, “the first data type that is both long (greater than 10 kb in length) and highly accurate (greater than 99%).” With HiFi reads, the scientists add, it is not necessary to use short-read data for error correction.
The accuracy of HiFi reads, combined with the throughput of the Sequel II System, provide a cost-effective option for variant discovery in population-scale sequencing or family-based sequencing, the scientists note. Even lower 10- to 15-fold HiFi read coverage is useful for finding meaningful variation. With diploid assemblies, long-read sequencing “will revolutionize genomics by revealing the full spectrum of human genetic variation, resolving some of the missing heritability and leading to the discovery of novel mechanisms of disease,” Logsdon et al. write.
“The wealth of additional information afforded by single-molecule, long-read sequencing compared with short-read sequencing promises a more comprehensive understanding of genetic, epigenetic and transcriptomic variation and its relationship to human phenotype,” the scientists conclude.
It’s not unusual for progeny to outperform their parents, and it’s often the goal in plant breeding. But tracing the molecular basis of such heterosis can be difficult, especially in diploid species with high genetic diversity and allele-specific expression like maize.
Cold Spring Harbor scientists have tackled the challenge using the PacBio Iso-Seq method and a new tool, IsoPhase.
As reported in Nature Communications Biology, Bo Wang, Doreen Ware, and colleagues performed an isoform-level phasing study in maize using the temperate line B73 and the tropical line Ki11, as well as their reciprocal crosses (B73 × Ki11; Ki11 × B73), which exhibit dramatic differences in height, root number and biomass from their parents.
The Cold Spring Harbor team phased 6,907 genes in the two reciprocal hybrids and were able to identify parental origin as well as novel isoforms in the hybrid lines. They also measured differing haplotypic expressions.
“Full-length, single-molecule sequencing provides an unprecedented allele-specific view of the haploid transcriptome,” the authors wrote.
“Haplotype phasing using long reads allowed us to accurately calculate allele-specific transcript and gene expression, as well as identify imprinted genes and investigate the cis/trans-regulatory effects.”
Because alleles from the same gene can generate heterozygous transcripts with distinct sequences, full analysis of allele-specific expression (ASE) is necessary to achieve a thorough understanding of transcriptome profiles. Previous attempts using short-read RNA-seq have provided expression information, but have not been able to provide full-length haplotype information.
The Cold Spring Harbor team used the Sequel platform to produce a single-molecule full-length cDNA dataset for the two maize parental lines and their reciprocal hybrid lines from root, embryo, and endosperm.
Barcoded SMRTbell libraries produced 4,898,979 HiFi reads, yielding 250,168 full-length, high-quality consensus transcript sequences. After mapping to the maize RefGen_v4 genome assembly and assessing for redundancy, the team ended up with 3,344 novel transcripts.
For phasing of these transcripts, the team applied the new IsoPhase tool, which uses the full-length nature of the reads and SNP calling to phase reads.
To determine which allele belonged to B73 or Ki11, they took advantage of the fact that all B73 reads must only express one allele, whereas all Ki11 reads must only express the other. Once the parental alleles were identified, they obtained the allelic counts for the F1 hybrids.
“Sequencing of full-length haplotype-specific isoforms enabled accurate assessment of allelic imbalance, which could be used to study the molecular mechanisms underlying genetic or epigenetic causative variants and associate expression polymorphisms with plant heterosis.
The approach does not require parental information (although parental data could be used to assign maternal and paternal alleles) and can be used on exclusively long-read data, they added.
“To our knowledge, this is the first full-length isoform phasing study in maize, or in any plant, and thus provides important information for haplotype phasing to other organisms, including polyploid species,” the authors wrote.
Starting a sequencing project can be daunting. First of all, there are several types of sequencing technologies, each based on unique processes. At PacBio, we use a technology called Single Molecule, Real-Time (SMRT) Sequencing.
Learn how SMRT Sequencing works in this short video:
Although each sequencing project is unique, there are five main steps to go from DNA to discovery with SMRT Sequencing:
Step 1: Sample Prep
Similar to cooking, for the best results, start with the best ingredients. The ideal sequencing starter is high molecular weight DNA. There are plenty of kits on the market that can help with this, and PacBio also maintains a site of DNA extraction protocols to aid these efforts.
Expert sample wrangler Olga Pettersson (@OlgaVPettersson) of SciLifeLab at Uppsala University, also advises: “Aim for getting molecules as long as you can, as pure as you can, as fresh as you can.”
When in doubt, you can always outsource the task to experts at other labs or sequencing centers, such as our Certified Service Providers.
Step 2: Library Prep
Library preparation for all of the major next generation sequencing (NGS) platforms requires the ligation of specific adapter oligos to fragments of the DNA to be sequenced. The DNA has to be fragmented to the optimal length determined by the sequencing technology you are using.
PacBio uses a SMRTbell library format, in which DNA fragments are capped on both sides with ligated hairpin adapters, where the sequencing primers attach. This creates a circular template for the polymerase to navigate. These can be created for libraries of varying insert lengths — from 250 bp to greater than 25,000 bp. Samples can also be barcoded and multiplexed to increase throughput. Learn more about kits for fast and easy library preparation.
Step 3: Sequencing
Once your library is prepared, the PacBio sequencing system, such as the Sequel II System, takes over. At the heart of SMRT Sequencing is the SMRT Cell, which contains millions of tiny wells called zero-mode waveguides (ZMWs). Single molecules of DNA are immobilized in these wells, and as the polymerase incorporates each nucleotide, light is emitted, and nucleotide incorporation is measured in real time. The reactions are recorded in a format that can then be analyzed using on-instrument tools, as well as additional software.
You can explore this interactive to learn about the Sequel II System and SMRT Sequencing Applications including whole genome sequencing for de novo assembly, comprehensive variant detection, full-length RNA sequencing, metagenomic sequencing and more.
With the current sequencing performance, a single SMRT Cell can be used to generate a reference-quality assembly of a 2 Gb genome, characterize a whole transcriptome and identify alternative splicing, or determine the composition of up to 96 microbiome samples. Find out what else you can do with a single SMRT Cell.
Step 4: Data Analysis
As mentioned, initial analysis occurs within the instrument itself to provide the sequence output. Secondary analysis can then be performed with the SMRT Link user interface and SMRT Analysis, which feature a suite of analytical applications and visualization tools.
Unique to SMRT sequencing is the ability to sequence the same DNA molecule multiple times generating highly accurate long reads, or HiFi reads. For users, like Jeremy Schmutz of the HudsonAlpha Institute of Biotechnology, this is quite beneficial.
“With HiFi reads, we can take the reads and do something with them right away. We don’t have to go through an enormous amount of downstream computation and processing to get to the point of having some sequence that we can evaluate,” he stated.
Depending on what you want to do with your data, there are also a range of tertiary analysis tools you can use, many of which have been developed by users. You can delve into example datasets to get to know SMRT Sequencing data and check out PacBio DevNet for easy access to open-source community-developed analysis tools and other resources to help further your analysis and data interpretation.
Step 5: Understanding Biology
The final and most exciting part of all good sequencing projects is taking your study from data to discovery by using it to improve understanding of biology. Members of the SMRT community have already greatly expanded knowledge across a huge range of fields. Check out our blog and SMRT resources library, to find many examples of how SMRT Sequencing has been applied to biological questions ranging from confirming the causative variant of a disease, to sex determination of the asparagus plant, or determining what microbes are best for cheese making.
For an additional introduction to SMRT Sequencing, including a Q&A with expert users, check out this video.
If you are interested in using SMRT Sequencing in your research and would like a free project consultation, please connect with a PacBio Scientist
Explore other posts in the Sequencing 101 series:
Tackling larger and larger genomes has been an attractive pursuit for many scientists as sequencing technologies improve at rapid rates. But what about the other end of the spectrum — the tiny organisms that comprise much of the diversity of life?
An obvious obstacle to decoding the DNA of small organisms such as insects, nematodes and other arthropods is collecting enough of it to actually sequence (usually multiple micrograms worth). Until recently, the solution was to pool DNA from many of these tiny creatures to create a representative sample, and extrapolate the biology of the individual constituents from there.
But this is far from ideal, as it does not reflect the true genome of an individual within the species, especially between specimens from the wild and those raised (often through in-breeding) in the lab. Nor does it resolve haplotypes. Plus, it can take a long time to breed and collect enough specimens.
Recently, PacBio announced a way to create high-quality de novo genome assemblies from just a couple hundred nanograms of starting genomic DNA. Gap-less, mega-base scale contiguity of these assemblies could offer many advantages, such as providing insights into promoters, enhancers, repeat elements, large-scale structural variation relative to other species, and many other aspects relative to functional and comparative genomics.
This ‘low DNA input’ workflow was embraced by many, and the first usage, a collaboration with scientists at the UK’s Wellcome Sanger Institute, resulted in the assembly of an Anopheles coluzzii mosquito genome with unamplified DNA from a single individual female insect.
Erin Bernberg (@ErinBernberg) of the University of Delaware Sequencing and Genotyping Center, a PacBio certified service provider, took sequencing to another extreme when she extracted DNA from 1.5 cm ice worms and successfully helped Scott Hotaling (@MtnScience) at Washington State University delve into the genomics of the annelids.
“Low-input works,” Bernberg said during a webinar about the work. “You don’t need as much DNA as everyone is worrying about with PacBio. You can generate good data with it.”
Since then, the workflow has been adapted to the high-throughput Sequel II System and adopted by the Darwin Tree of Life project, which aims to sequence all 60,000 species in the UK, many of which are too small for standard long-read sequencing. Mark Blaxter, Tree of Life Genomics Programme Lead, reported 44 new high-quality Lepidoptera genomes completed with the low DNA input workflow at the Plant and Animal Genomes Conference in early 2020.
And Sanger scientist Chris Laumer (@tendersombrero) shared his work on optimised protocols for sequencing individual meiofaunal organisms using long-range PCR library amplification, HiFi sequencing, and linked-read scaffolding, in a SMRT Leiden presentation.
How low can you go?
Scientists at the Max Planck Institute in Germany are testing whether it is possible to reduce DNA sample sizes even further – to as little as 5 ng of genomic DNA.
Franziska Beran, group leader at the MPI for Chemical Ecology, Bruno Huettel, Head of the Max Planck Genome-centre Cologne (MP-GC), and Christian Woehle, bioinformatician at MP-GC, are using an amplification-based ultra low DNA input workflow to investigate the genome of the horseradish flea beetle (Phyllotreta armoraciae).
The tiny beetle has a fascinating defense strategy known as sequestration, in which the beetles not only overcome the pungent chemical compounds of the horseradish plant, but exploit those chemicals to make themselves unpalatable for their own enemies.
The team has managed to extract valuable genomic data from just 5 ng of genomic DNA from single bugs. They also worked with specimens which were maintained in ethanol, a standard long-term specimen storage solution, suggesting that it may now be possible to sequence specimens from historical collections.
“With this new solution from PacBio, we can now touch specimens which were not possible previously to apply long-read sequencing,” Huettel said. “In the past, we could pool several individuals, but this would produce a mess of data which cannot be resolved into single unique contigs, as we would sequence several haplotypes.”
Together with colleagues at the LOEWE Centre for Translational Biodiversity Genomics (TBG) in Senckenberg, Frankfurt, the MP-GC is also using the ultra low DNA input workflow to delve into the genomics of springtails — minute arthropods that are distant relatives of insects.
Present in the oldest known terrestrial ecosystems (~410 Ma) they are now diversified in a wide range of ecological niches, from tree canopies to ice-cold regions and deep cave systems. They are among the most abundant organisms in soil and thus a major component of soil functions.
Scientists are collecting DNA from two types of springtails: Sminthurides aquaticus and Desoria tigrina. The former is a palearctic species living on floating leaves and wet rocks in ponds, and it is one of the few springtail species displaying a strong sexual dimorphism. The latter belongs to a complex of species common in compost, and thus a commensal of agricultural and gardening activities.
The springtails were sequenced as part of the MetaInvert project, which establishes a database of genomes for several hundred soil invertebrate species, including springtails, oribatid mites, nematodes, potworms, myriapods, and several other groups. These genomes are used to reveal evolutionary relationships, special adaptations and host-microbiome associations.
“The gathering of high-quality genomes is a needed resource to understand the evolutionary history and modalities of terrestrialization of arthropods and identifying the genomic traits that allowed the successful diversification of springtails,” said Clément Schneider of the TBG. “Among the applied prospects is the discovery of new natural products and better management of our soils through a better understanding of the functions of their inhabitants.”
Hear more from PacBio experts and users in the recent webinar, No Organism Too Small: Build High-Quality Genome Assemblies of Small Organisms with HiFi Sequencing.
Read more about the low DNA input workflows:
It’s well known that finding the genetic cause of rare diseases can be complex — that’s why so many remain unsolved.
But researchers are beginning to get a grasp on just how complex these conditions can be, thanks to the heightened power of PacBio.
PacBio principal scientist Aaron Wenger kicked off a recent webinar with a quote from University of Washington scientist Evan Eichler, who said “there are three key aspects to genetic disease associations: comprehensive variant discovery, accurate allele-frequency determination, and an understanding of the pattern of normal variation and its effect on expression.”
Wenger explained how HiFi reads can address each of these. He noted that while solve rates have improved slightly with each new sequencing technology, more than 50% of cases still remain unsolved. But he added that he is optimistic that the Sequel II System and HiFi reads will be the next transformative technology to increase solve rates, sharing several examples to support his claim.
Kristen Sund (@kristen_sund), a researcher at Cincinnati Children’s Hospital, then shared two specific examples where Single Molecule, Real-Time (SMRT) Sequencing enabled her to untangle complicated structural rearrangements that had eluded earlier detection using other technologies, including short-read sequencing.
“We set out to understand how new tools like the Sequel may help find an answer for some of our undiagnosed patients,” Sund said. “The goal was to use low-pass long-read SMRT Sequencing to obtain base pair resolution of the breakpoints and identify a gene that was responsible for the phenotype.”
Both cases involved patients with neurological diseases who had a known chromosomal arrangement, but no known gene cause for their phenotype. They both had previous genetic testing with results that were either “normal” or non diagnostic.
The first case, a 9-year-old with short stature and hypotonia, came to the researchers’ attention when she came to the hospital for treatment for nocturnal epilepsy. The girl also had developmental and speech delay, and some self-stimulatory behaviors.
Sund’s testing revealed 31,324 structural variants in the girl’s genome. Notably, one of these structural variants included changes in a protein (MBD5) that regulates gene transcription, and has been linked to developmental delay, speech impairment, seizures, sleep disturbances, and autistic-like behaviors, in a condition known as MBD5-Associated Neurodevelopmental Disorder (MAND).
“I thought we had an answer for this patient, but I still had questions,” Sund said.
So she dug deeper, and discovered the case was far more complex than anyone expected, with not just one straightforward causative variant, but a complicated interplay of inversions, insertions, and deletions. Sund said she is still working through some additional complexity of mapping the exact breakpoint, but she was glad to be able to provide some answers to the girl’s parents.
“This family has been looking for an explanation for their daughter’s epilepsy for nine years. In addition to having an answer for their daughter and resources about the condition and its prognosis, they will also be able to use this information to connect with other families who have a similar diagnosis,” Sund said.
The second case was similarly complex – a movement disorder in a 17-year-old with chorea, myoclonus, anxiety and hypothyroidism, who also had a family history of early death due to lung cancer. In this case, Sund investigated whether it might be a pathogenic variant of the NKX2-1 gene that caused the condition. Paying particular attention to the breakpoints of chromosomal rearrangements, she found many surprises.
“Although this is a small sample size, it does make you wonder whether most – or all – simple rearrangements are actually many chromothripsis events with multiple breaks and rearrangements,” Sund said.
She noted that structural variant analysis should not be limited to coding regions, and concluded that there can be real clinical utility in using long-read technologies for diagnosis, prognosis and treatment of rare disease patients.
“This application will be incredibly meaningful to families who have been searching for a diagnosis,” Sund said.
Sund’s sequencing was carried out at the University of Minnesota Genomics Center, a PacBio Certified Service Provider, as part of the winning 2018 Structural Variation SMRT Grant project.
NGS Operations Manager Archana Deshpande, said: “We have used the Sequel System across a range of applications, including whole genome sequencing for plants, Iso-Seq analysis of gene expression, multiplexed microbial whole-genome sequencing, and the detection of structural variants. Our new Sequel II System promises even more and we are extremely pleased with the amount and quality of data our preliminary runs have generated.”
Watch the webinar:
See additional information and examples of the use of SMRT Sequencing in rare disease research:
As the flurry of research around the SARS-CoV-2 virus continues at an unprecedented pace, scientists are beginning to tackle some of the more complex immunological responses with the help of Single Molecule, Real-Time (SMRT) sequencing.
Hundreds of people tuned in live to a special May 7 webinar, “Understanding SARS-CoV-2 and host immune response to COVID-19 with PacBio sequencing.”
Meredith Ashby, Director of Microbial Genomics at PacBio, described some of the resources being generated by both PacBio and our users in order to help labs who are using SMRT Sequencing technology to investigate SARS-CoV-2 and COVID-19.
These include two microbial sequencing protocols — the Mt. Sinai 1.5 kb and 2 kb amplicon protocol and the Eden 2.5 kb protocol — as well as guidelines for target-based capture using IDT probes and M13 tag barcoding.
PacBio scientists have been verifying these protocols and rebalancing primers to improve evenness of coverage, Ashby said. Results and additional resources are being continuously added to a resource page on our website, she added. Since the webinar, PacBio scientists have finished optimizing the Eden protocol for SMRT Sequencing. Modified primers that enable barcoding have been validated. In addition, the modified workflow now has two multiplex PCR reactions, as opposed to the 14 individual PCR reactions in the original version. Finally, the primer concentrations were rebalanced to reduce coverage variance to approximately 8-fold, enabling higher multiplexing of samples.
Personal profiling: HLA Typing
In order to attack the enormous range of pathogens our bodies are exposed to, our immune systems have an agile arsenal. Among them are the human leukocyte antigens, or HLA. They interact with our T Cell receptors to activate and modulate our immune responses, and the HLA loci are some of the most polymorphic regions in the human genome.
There’s a huge diversity of HLA alleles across the population, but also mixed and matched within each individual. Melissa Laird-Smith (@lissagoingviral), assistant director of technology development at the Icahn School of Medicine at Mount Sinai in New York City, discussed why it is so important to understand personal HLA profiles, and how complicated it is to do so.
She focused on two particular classes of HLA alleles: Class I, expressed by most nucleated cells, which presents antigens to CD8+ T cells and can trigger cytotoxic T cell activity to clear any pathogen considered foreign, or ‘non self’; and Class II, expressed by macrophages, dendritic cells and B cells, which presents antigens to CD4+ cells, triggering ‘helper’ responses such as cytokine production and antibody production by B cells (illustrated, right).
This activity relies on very specific binding, however, with HLA allele puzzle pieces only fitting into certain matching antigens.
“If an HLA allele-matched antigen combination is not available, T cells cannot recognize and activate a response,” Smith said. “So this idea that specific HLA alleles are important for mounting the appropriate immune response is critically important when you’re thinking about how HLA modulation can control response to viral infection.”
In terms of COVID-19, the range of responses observed in individual patients suggests these immune interactions play a critical role, and that elucidating it will be crucial to determining what it will take to protect us via a vaccine, she said.
She then shared details of the high-resolution HLA typing protocol the Mt. Sinai team developed. Combining PacBio sequencing with the commercially available and validated HLA NGSGo kit offered by GenDx, the Mt. Sinai protocol generates full-length genes, ranging from 3 to 6 kb, from 200 ng of high quality genetic DNA.
“The advantage of long-read sequencing is that you go around a smaller amplicon multiple times, which allows you to collapse these reads and generate a highly accurate, intramolecular consensus, or HiFi read,” Smith said.
“When you map these HiFi reads to the full-length sequence of the HLA molecules, you get coverage that allows you to phase the two alleles at each position within an individual without imputation or bioinformatic reconstruction,” she added.
Watch Laird-Smith’s full presentation:
Analyzing antibodies: IGH haplotype diversity
Immunoglobulins are also critical components of the immune system that are highly variable between individuals.
Sequencing has enabled scientists to interrogate variation in the expressed antibodies within human populations, but what hasn’t been as readily explored is how haplotype diversity within the antibody gene regions themselves contributes to the story, said Corey Watson (@cwatson29), assistant professor at the University of Louisville School of Medicine.
“A lack of genomic resources has hindered our ability to understand the role of IG genetic variation in phenotypic data and disease,” Watson said. “If we can understand some of these genetic determinants to the repertoire, then we can better define — or maybe even predict — what a repertoire might look like in a given person, and understand better how these signatures associate with functional responses.”
The immunoglobulin heavy chain (IGH) gene loci, located on chromosome 14, are extremely complex and poorly characterized at both the genomic and population level.
In addition to functional genes, these areas also include pseudogenes. In the entirety of the V, D and J region, there are about 130-150 genes within a given haplotype, making the locus incredibly repetitive. In fact, more than 50% of the sequence within this 1 Mb region falls within a segmental duplication. There are also many structural and copy number variants. These can be quite large and contain multiple genes, ranging from 9-60kb. This means that any given haplotype can vary by tens of genes when compared to another.
“This is a challenge when thinking about doing genomics, particularly when thinking about how we utilize existing reference assemblies,” Watson said. “In the case of the two reference assemblies that we have currently, both of these missed genes, or don’t include genes that are known to occur in the human population.”
Yet, despite the fact that we now know that these regions are complex and that there is variation in the population, we have a poor set of tools with which to interrogate them, Watson said. This can have far-reaching impacts, he added.
Microarray technologies, for example, often do a poor job of tagging variation within the IGH loci — in one study, less than 50% of variants were tagged effectively.
“The implications of this, you could imagine, especially for immune system disorders, would be great, given the number of GWASes that have been conducted over the years,” Watson said.
To address the problem, Watson’s lab has:
- Done benchmarking in haploid cell lines with orthogonal BAC assembly data
- Used this data to develop a probe set for robust target capture of the IGH loci for PacBio HiFi sequencing
- Developed a pipeline for IGH locus assembly, variant calling and annotation, IGenotyper
- Begun conducting targeted IGH sequencing at scale for comprehensive high-throughput genotyping
“One reference simply is not enough,” Watson said. “Our goal is to create a pipeline that can generate haplotype-specific assemblies leveraging PacBio HiFi reads. Just by leveraging the longer PacBio reads, we are able to gain access to parts of the locus that previously were not being interrogated very well or accurately.”
Watch Watson’s full presentation:
Watch the full webinar:
You may also be interested in:
For scientists who utilize DNA sequencing in their research but are not experts in the underlying technology, it can be difficult to determine the accuracy of sequencing results — and even harder to compare accuracy across sequencing platforms. Furthermore, accuracy differs not only between technologies but also across genomic regions as some stretches of the genome are inherently more difficult to read.
It is critically important to understand accuracy in DNA sequencing to distinguish important biological information from sequencing errors.
What are the Types of Sequencing Accuracy?
There are two key types of accuracy in DNA sequencing technologies: read accuracy and consensus accuracy. Read accuracy is the inherent error rate of individual measurements (reads) from a DNA sequencing technology. Typical read accuracy ranges from ~90% for traditional long reads to >99% for short reads and HiFi reads.
Consensus accuracy, on the other hand, is determined by combining information from multiple reads in a data set, which eliminates any random errors in individual reads. Deeper coverage, meaning more reads from which to build a consensus, generally increases the accuracy of results. However, there are still limitations to calling consensus from multiple reads. Consensus calculation is a complicated and computationally expensive process, and it cannot overcome systematic errors. If a sequencing platform consistently makes the same mistake, then it will not be erased by generating more sequencing coverage.
To sidestep this problem, it is common to “polish” long reads that have systematic errors with high accuracy short reads. However, because of their read length, short reads cannot always map to the long reads unambiguously, limiting their ability to improve accuracy. In general, consensus is improved – and vastly simplified – by starting with highly accurate reads with no systematic biases.
How Does Accuracy Impact the Utility of Sequencing Data?
It is commonly known that certain genomic regions are more difficult for sequencers to get through than others. Centromeres and telomeres are notoriously tough because of the highly repetitive sequence they contain. Regions that are AT-rich or GC-rich are similarly difficult because they respond poorly to the amplification protocols required by some platforms. Palindromic sequences or hairpin structures are difficult to denature, making such regions challenging for sequencing tools that include a denaturation step.
Many scientists avoid these problems by opting for a single-molecule sequencing method that does not require amplification or denaturation, such as PacBio’s SMRT Sequencing technology. Because SMRT Sequencing can process even difficult regions, performing uniformly regardless of sequence context, it generates accurate results even in regions that would flummox other platforms. Selecting a platform without systematic bias, like the Sequel II System, is important to producing the most accurate sequence data.
The accuracy of a genome assembly goes beyond the accuracy of each individual base. Even perfect reads can contribute to poor accuracy if they are not ordered and oriented correctly in the assembly. This question of where to place the read is called mappability.
Reads containing only a piece of a large structural element, or consisting of highly repetitive sequences, can be very difficult to align, mapping ambiguously to many different locations in a reference. This is where short reads really struggle; because of their size, there is a greater chance that they will not contain enough unique sequence data to anchor them properly in a genome. Since HiFi reads stretch across many kilobases of DNA, they almost always contain unique flanking sequences that can be used to map them accurately in an assembly.
When exploring diploid or polyploid genomes, phasing means separating the different copies of each chromosome (e.g. maternal and paternal for diploid), known as haplotypes. With sufficient accuracy, the identity of nucleotides at each position in the genome can be compared with a reference sequence to identify SNVs, with a heterozygous locus indicating a difference in sequence between a homologous chromosome pair. This is where the inherent low accuracy of traditional error prone long reads becomes a limitation – with a high error rate, it makes it impossible to decide whether a disagreement between a reference and data set is a variant or a sequence error.
Another approach to obtain phase information is to also sequence the parents of the individual whose genome you need phased. However, in many wild species where the parents aren’t available, a highly accurate long-read sequencing approach, like HiFi sequencing, would be simpler. There are also computational methods (learn about Nighthawk) or the use of population haplotype frequency information to infer phasing.
Overall, phased genomes or variant calls are higher quality than haplotype collapsed versions as they provide allelic information, which can be important for studying human diseases, crop improvement, evolution, and more. HiFi reads, with accuracy high enough to detect SNVs and read lengths to detect these SNVs over many kilobases, generate larger phased haplotype blocks.
As scientists analyze more and more genomic data, the role of sequence accuracy will likely only become more important. HiFi reads offer the benefits of high accuracy equivalent to short-read sequencing data, but with the length necessary for complex genome assemblies and phasing of variants across large swaths of the genome.
Explore other posts in the Sequencing 101 series:
The strides scientists have made in rare disease research lately is truly impressive. For an overview of recent progress, we encourage you to check out a new article in The Pathologist from our own Luke Hickey (@Luke_Hickey), Senior Director of Strategic Marketing. It offers a great overview of how scientists have used long-read sequencing to find the genetic explanations for elusive rare diseases.
“Never before have our laboratory techniques been so successful at identifying rare diseases and elucidating their underlying biological causes,” Hickey writes. “The knowledge we obtain today opens the door to new treatments, giving hope to people who suffer from these rare disorders.”
The educational article offers a look at how various types of sequencing have been important for solving rare diseases, including an overview of the types of variants that can be accessed by short-read platforms (typically limited to single nucleotide variants and small indels) or by long-read SMRT Sequencing (all structural variants, even very long elements such as repeat expansions).
Despite recent progress, at least half of known rare disease cases have not been solved with short-read sequencing or older techniques. That’s why so many scientists are now embracing SMRT Sequencing as a new application in the field, with early successes showing the tremendous promise of this approach. “In just the past few years, researchers have used SMRT whole genome sequencing to solve previously intractable rare diseases — and other significant efforts are now underway,” Hickey writes.
The article recaps some of those successes, such as Euan Ashley’s (@eaunashley) Carney complex work and Naomichi Matsumoto’s pathogenic deletion in cases of myoclonic epilepsy. It also provides a look at some large-scale efforts applying PacBio sequencing to rare disease studies, including SOLVE-RD and CSER.
“Large-scale programs like these should contribute a significant amount of new knowledge about the genetic mechanisms underlying rare disease, filling in many of the gaps in our understanding today,” Hickey concludes. As SMRT Sequencing “helps to explain more rare diseases and increase overall diagnostic yield, it should have a profound effect on our ability to diagnose, understand, and ultimately improve treatment for rare disease cases.”
To hear the latest in rare disease research, register to watch our on-demand webinar: Increasing Solve Rates for Rare and Mendelian Diseases with Long-read Sequencing.
It’s been more than a year since we introduced HiFi sequencing to generate highly accurate long reads. In that time, we’ve seen many PacBio users make HiFi sequencing their go-to setting because it’s simple, reliable, and cost-effective. For scientists who have yet to generate their own HiFi data, we thought it might be helpful to publish a few data sets for exploration and analysis.
In a new preprint, we have released HiFi data sets for five samples: mouse, frog, maize, strawberry, and a mock metagenome community. We like to think there’s a data set for everyone here, whatever your research area of interest! Working with any of these HiFi read collections should offer a great introduction to this sequencing mode and show you why we often hear how easy it is to analyze HiFi data compared to traditional long reads.
Consistent with previous reports, the HiFi data generated for these five organisms yielded excellent accuracy, with average read qualities ranging from 99.84% to 99.97%. With that kind of accuracy we look forward to seeing what interesting biology our collaborators find within this data.
|Organism||SRA||HiFi Data Yield (Gb)||Average Read Length (kb)||Average Read Quality|
The five HiFi data sets generated on the Sequel II System
In addition to letting scientists get a fresh look at HiFi data, we hope this release will encourage development of new applications and software for the benefit of the entire sequencing community. New and improved tools for assembling polyploid genomes or calling variants in non-model organisms are just a couple of areas we hope to see grow.
For those of you who want to use existing software to explore these datasets, here are some tools that we find useful for working with HiFi reads:
- Assembly: FALCON, hifiasm, HiCanu (Check out this in-depth overview of HiFi assemblers from @Magdoll)
- Variant detection: DeepVariant for SNVs and small indels (<50 bp) and pbsv for SVs (≥50 bp)
- Metagenomics: Canu for assembly, FragGeneScan for gene prediction, and MEGAN for taxonomic and functional profiling
For this data release, we’d like to thank all of the collaborators who helped to generate and present these results: Jane Landolin (@jlandolin), Nicholas Maurer, David Kudrna, Michael Hardigan, Cynthia Steiner, Steven Knapp (@knapp1955), Doreen Ware, and Beth Shapiro (@bonesandbugs).
And congratulations to the PacBio team members who led the charge on this effort: Ting Hon, Kristin Mars, Greg Young (@PacbioGreg), Yu-Chih Tsai, Joseph Karalius (@JoeyKaralius), Paul Peluso, and David Rank.
What is a Pangenome?
Unless you have an identical twin, no other person has a genome that is identical to yours. The same is true for other animal, plant, and microbial species that reproduce sexually: the genomes of individuals are unique. Less well known, but equally true, is that individual members of a species do not always share even the exact same genes. Nevertheless, scientists mostly use a single reference genome to represent an entire species: one human genome, one maize genome, one Staphylococcus aureus genome.
The Coining of the “Pangenome”
Around 2005, geneticists started to explore the concept of the pangenome, originally defined as the entire set of genes possessed by all members of a particular species and then extended to refer to a collection of all the DNA sequences that occur in a species.
It started with bacteria, as many things do. Genomic activity like recombination, mobile genetic elements, and horizontal gene transfer were clearly contributing to individual diversity across the bacterial domain. Some scientists discovered dozens, if not hundreds, of unknown genes when they sequenced new strains.
In 2007, MIT microbiologist Sallie Chisholm (@ChisholmLab_MIT) set out to determine the extent of genetic variation in the marine cyanobacterium Prochlorococcus. Each strain contains approximately 2,000 genes, and Chisholm estimated that a pangenome for Prochlorococcus would be around 6,000 genes, based on an initial set of 12 genome sequences. Eight years later, with 45 strains sequenced, she revised that estimate up to at least 80,000 genes—around four-times the number of genes in the human genome—with the core genome for the species comprising only about 1,000 genes, or less than 2 percent of the total gene pool.
“That’s a lot of information shaping that collective,” Chisholm told The Scientist. “[The pangenome view] changes the way you think about what an organism is.”
Why is it Important to Capture the Full Range of Genetic Diversity?
Those looking to create vaccines need to understand the genomic variation and versatility of disease-causing microbes, especially if they are hoping to develop universal vaccines that could provide protection against more than one strain in a species.
Those studying adaptation to climate change would benefit from a comparison of genes absent or in abundance within species found in different geographic locations and/or environmental conditions. In crop plants, differences in variable genes could have implications on disease resistance, metabolite production, and stress responses.
And with differences in gene number increasingly being associated with disorders including autism, Parkinson’s and Alzheimer’s diseases there are strong medical justifications for taking a more variation-centric view of the human species. Variants cannot be identified within regions completely missing from the reference sequence, many of which have been found to be more common than previously thought.
What is Being Done to Generate Pangenomes?
To answer this question, we sat down with a few scientists to talk about the era of the pangenome and what’s to come.
One particularly important crop that has haunted geneticists and breeders for years is maize. It is challenging to sequence because the vast majority of its 2.3 Gb genome, a staggering 85 percent, is made up of highly repetitive transposable elements. Maize is also incredibly diverse in its DNA makeup. As an example, a study comparing genome segments from two inbred lines revealed that half of the sequence and one-third of the gene content was not shared – that’s much more diversity within the species than between humans and chimpanzees, which exhibit around 94 percent sequence similarity.
“The whole notion of a single reference genome for crop plants is an antiquated concept borne out of necessity from the technological limitations of the past. Now with the capability to rapidly generate high-quality references for even the largest crop genomes, we can readily access the full complement of sequence diversity and structural variation within a crop,” says Kevin Fengler, Comparative Genomics Lead at Corteva Agriscience.
So the field was delighted when a collective of 33 scientists released a 26-line maize pangenome reference collection earlier this year. The collection was created using PacBio sequencing, and includes comprehensive, high-quality assemblies of 26 inbreds known as the NAM founder lines. These include the most extensively researched maize lines that represent a broad cross section of modern maize diversity, as well as an additional line containing an abnormal chromosome 10.
It turns out it’s not just maize biology that can be informed by pangenomes. “The high level of diversity in maize is well known, but we see a lot of diversity and structural variation underlying traits of interest in all the crop plants we work on. Creating the first reference genome for a crop genome is a great first step, but things get really interesting as you begin to add more genomes and a more comprehensive view emerges,” adds Fengler.
As for our own species, the current reference genome (GRCh38) – an update of the genome produced by the international Human Genome Project in 2000 and based mostly on DNA from one person – has been added to and annotated through the years, but is still an incomplete sequence and woefully inadequate as a representation of human diversity and genetic variation. Scientists estimate that up to 40 megabases of sequence, including protein-coding regions, are absent from the reference genome.
Several studies using PacBio long reads have reported an average of ~20,000 structural variants (SV) per human genome, most of which fall within repetitive elements and segmental duplications. Furthermore, it does not represent the diploid structure of human genomes. Rather, it is an arbitrary linear combination of different haplotypes, or a mosaic of multiple individuals.
Several groups have undertaken efforts to ensure certain populations are better represented in genomic databases, from Sweden to Tibet to Japan. Check out an interactive map of human genomes generated with PacBio sequencing.
When asked the value a pangenome could bring to human research, Fritz Sedlazeck (@sedlazeck), Assistant Professor at Baylor College of Medicine, said, “the pangenome has the potential to represent the diversity of the human population or any species. This eases the re-identification of complex alleles or even haplotypes.”
And it seems the National Human Genome Research Institute agrees, recently committing $30 million towards the creation of a new human pangenome based on high-quality sequencing of 350 individuals from across the human population, to capture all genomic variation observed in human populations.
“One human genome cannot represent all of humanity. The human pangenome reference will be a key step forward for biomedical research and personalized medicine. Not only will we have 350 genomes representing human diversity, they will be vastly higher quality than previous genome sequences,” said David Haussler, director of the University of California Santa Cruz Genomics Institute, which is leading the project.
How to Generate a Pangenome?
So, what are the most important things to keep in mind when creating a pangenome reference?
First, Fengler says that being able to be confident in your results is really important. “Ideally, all of the references in the pangenome collection will be built with a similar recipe to enable direct comparisons without artifacts from different technologies.” This points to the need for a reliable technology that can be used to generate equivalent quality genomes for many samples with little variability.
Second, the data must be high quality. When asked the importance of long reads to pangenome efforts, Sedlazeck said, “they will be important to distinguish between different alleles/paths in the graph and to characterize novel mutations. Thus, being able to cope with graphs that encode a much higher number of variations to better represent the population.” Along those lines, Fengler adds, “the approach for assembly needs to be robust and accurate such that mis-assembly and sequence errors are not interpreted as structural variation and sequence diversity.”
Lastly, cost and speed have to be taken into account. With the high accuracy of HiFi sequencing, only 10- to 15-fold coverage per haplotype is needed for a high-quality resulting genome assembly, and the analysis time can be cut in half.
“Now researchers no longer need to wait for actionable sequence data,” says Fengler. “For maize, we can generate a high-quality reference genome the same day that the sequencing finishes.”
What’s Next in Pangenomes?
As pangenome collections grow, scientists have to tackle questions around how to represent a pangenome. “Which variations should be included into a pangenome? Is it all of them? Then you lose specificity in regions. Is it only the common variations? Then you have a problem with disease-causing variations and other complex regions like HLA,” asks Sedlazeck, highlighting the continued work that needs to be done.
In addition, tackling things like annotation, visualization, and relationship management are on Fengler’s mind. “A variety of new pangenome analysis and visualization tools are needed to fully realize the value of having a pangenome collection for each crop.”
And then we have to move into functional and translational analysis. Scientists need to be able to take their newfound understanding of variation at the genome level and see how it impacts phenotypes, and whether the variation can be introduced artificially to influence agronomic traits, for instance.
One thing is for sure, the pangenome era is upon us, and whether you need a pangenome to understand important traits or you build tools to interpret those traits, there will be plenty to work on in the coming years!
Explore other posts in this series:
Will the next big cancer breakthrough be in immunotherapy? Therapeutic modification of the tumor microenvironment or microbiome? Or early detection and screening?
Whatever the result, long-read sequencing technology can play a pivotal part in the discovery process, according to Meredith Ashby, PacBio’s director of Market Strategy for Microbial Genomics, Cancer and Immunology.
In a recent article for Lab Compare, Ashby highlighted some of the ways Single Molecule, Real-Time (SMRT) Sequencing has given researchers a deeper understanding of tumors at the genomic and transcriptomic level.
By spanning very large structural variants in single reads, SMRT Sequencing can provide clarification in cases where variants may be acting in concert to affect treatment response. Ashby describes recent work where scientists at Memorial Sloan Kettering Cancer Center and other institutes used SMRT Sequencing to explore why certain patients are “super responders” to alpelisib, a targeted PI3Kα inhibitor. The ability to phase all variants along PIK3CA transcripts revealed that patients with distal mutations in cis showed remarkably improved response to therapy, as compared to those who had only a single mutation, or whose mutations were present in trans.
Long reads can also help determine cancer risk status involving ‘hard to sequence’ genes which have highly homologous, inactive pseudogenes. For example, long reads can distinguish SNVs, indels and larger rearrangements in PMS2 from those in the inactive pseudogene, PMS2CL.
Finally, generating full-length transcripts via the Iso-Seq method can disambiguate isoforms to provide insights into cancer biology or serve as better biomarkers. Long reads can reveal cryptic exons, retained introns, and other splicing changes that are often cancer-specific and therefore may be missing from the gene models typically used to aid short-read transcript assembly, Ashby noted. Ashby cited an example where targeted SMRT Sequencing of androgen receptor (AR) isoforms revealed that the structure of AR-V9 was previously mischaracterized, and that the corrected isoform information could be used to improve the prediction of drug resistance in prostate cancer.
“While a decade-plus of short-read data has produced truly exciting information for the cancer research community, there is much more to be learned simply by looking through a different lens,” she said.
The COVID-19 pandemic has brought a sudden urgency to virus research and led many of us to dig more deeply into all the tools available for characterizing viral genomes, from RT-PCR to DNA sequencing. For all their outsized impact on human health, viruses have remarkably small and simple genomes, some just a few thousand bases in length, and most lacking any repetitive structures. With such tidy genomes, you may wonder, why would scientists want to sequence them with a long-read technology like PacBio HiFi reads?
While it is true that most viral genomes do not require long reads for assembly, viruses exist as populations within infected hosts, and long reads are a powerful tool for fully characterizing these populations. Depending on the mutation rate of the virus, the population structure can have one dominant variant with only a very small proportion of rare variants, or it can be comprised of a highly diverse set of closely-related variants, called a quasispecies.
Highly accurate, single-molecule sequencing allows researchers to fully characterize all the variants within a viral population, as opposed to just the dominant variant. Oftentimes this more detailed view of a complex population of viruses reveals important aspects of biology, including how a viral infection evolves over time or in response to therapeutics.
Here are some examples of how scientists have used SMRT Sequencing to explore viral genomes, and the highlights of their work.
Influenza: While the influenza virus continues to evade efforts to produce a universal vaccine, HiFi sequencing has given scientists a clearer picture of the dynamics of flu virus evolution. In one study, researchers sequenced multiple samples from a single patient with a two-year recurrent infection and revealed in great detail how the virus adapted in response to flu treatments. In another population-scale study tracking a flu pandemic in Hong Kong, PacBio sequencing revealed that transmission was enabled in part by minor strain variants.
In addition, scientists have used long reads to analyze large deletions in the influenza genome that characterize viruses incapable of replicating. Most recently, a study combining PacBio long reads and single-cell sequencing gave a comprehensively detailed view of how influenza mutations evolved over the course of a typical infection, revealing both point mutations and indels.
Hepatitis C Virus: Among the most significant challenges in HCV treatment is drug resistance. To better understand how resistance arises, scientists have deployed SMRT Sequencing to study HCV evolution in individuals who failed to respond to antiviral therapy. By using HiFi sequencing to obtain single reads encompassing entire clones, they were able to detail how multid rug resistant variants arose from low-abundance, drug-resistant clones present at baseline.
In another example, researchers generated long-read data to produce full-length HCV envelope sequences, which allowed them to track the transmission path for a sexually transmitted cluster of HCV infections. They also reported how viral genetic diversity changed over the course of an infection, appearing low during the acute stage but increasing over time.
HIV: An ongoing area of research in HIV is how the virus evolves and persists in patients on long-term antiretroviral therapy (ART). With SMRT Sequencing, scientists have generated full-length viral genome sequences from proviruses to study what proportion the latent reservoir is replication competent, and what types of mutations are favored in this reservoir under ART. A separate effort involved analyzing proviral sequences in the brain and other tissues to understand HIV-associated dementia. HiFi reads allowed the researchers to create a detailed phylogenetic tree of all the variants within an individual, and revealed that variants in the brain were distinct in important ways and absent from other parts of the body.
SARS-CoV-2: Recently, researchers at Mt. Sinai published a study using genetic drift in the SARS-CoV-2 virus to determine when and how the virus arrived in New York City. They found that the virus had arrived from Europe and the West Coast multiple times, though not from China, and it arrived significantly earlier than recognized. Other researchers have developed multiple long-amplicon protocols for targeted sequencing of the novel coronavirus, which may enable unique insights into the biology of SARS-CoV-2 as the pandemic, and our understanding of it, evolves.
To learn more about how to sequence viral genomes, explore our COVID-19 sequencing tools and resources or review our sample prep and analysis workflows for resolving viral populations.
Learn how you can use HiFi reads to enable your viral research, from understanding viral genomes to the host immune response.
Explore other posts in this series:
California redwoods: Not only are they giants in height and age (up to 379 feet high, 29 feet round, and thousands of years old), but the famous towering trees are also derived from a massive 27 Gb genome.
Seeking a sequencing challenge for the Sequel II System, we picked the California redwood, or Sequoia sempervirens as it’s known to scientists. There also happened to be several fine specimens at nearby Stanford University.
A small crew of PacBio scientists — Emily Hatas (@EmilyHatas), Greg Young (@PacbioGreg), and Michelle Vierra (@the_mvierra) — headed to campus to acquire samples equipped with ice, scissors, and a kitchen scale. DNA was isolated (using the Circulomics Plant Nuclei kit), a HiFi library was created, and sequencing got underway. In just seven days, the team achieved 22-fold coverage of the genome (606 Gb of HiFi data). Another 6 days later, Greg Concepcion (@phototrophic) generated a partially haplotype-resolved genome assembly almost twice the expected genome size with a contig N50 of 1.92 Mb.
“The results were amazing,” Vierra said. “We are very pleased to see the improvements that this genome assembly represents over other recent conifer genomes.”
The massive genome was put together in just 17 days — 4 days of sample prep, 7 days of sequencing, and 6 days for assembly, and was detailed in a Medium post by Vierra.
But the team wasn’t done yet. As a general recommendation, 10- to 15-fold coverage in HiFi reads is the ideal range to yield a genome that measures up favorable in the 3 C’s of genome quality.
For genomes of the California redwood’s size, it may not be economical or feasible to obtain that much coverage in a limited timeframe, which is why the team opted to see what a reasonable ~20-fold coverage could generate. However, with a little more time on their hands, and enough HiFi library to go around, the team embarked on more sequencing to bring the total to 875 Gb of HiFi data representing 33-fold coverage of the genome.
Not surprisingly, more of a good thing (HiFi reads) made an even better genome assembly. The contiguity improved significantly with a contig N50 of 3.8 Mb and completeness increased with an almost 61% complete BUSCO score.
Overall, what would have been considered a herculean effort not that many years ago was accomplished in only a few weeks by a handful of personnel in their spare time. It’s our hope that with the increasing adoption of PacBio HiFi reads we will continue to see massive improvements in the assembly of all genomes, including large and increasingly complex polyploid plants.
The additional data for the redwood genome has been made publicly available along with the updated genome assembly and can be found here. We hope it will be a useful tool for conifer researchers everywhere!
Comparison of California Redwood genome assembly results.  Hybrid assembly of redwood.  Transcript set of Abies alba from Neale et al. Varying number of transcripts aligned to each genome (4,958 mapped to 22-fold HiFi Reads, 4,970 mapped to 33-fold HiFi reads, 4,760 mapped to ONT)  Assembly with 33-fold HiFi reads was done with 80 cores and an updated version of Hifiasm (0.3.0).
Herculean efforts are being made by scientists around the world to respond quickly to the COVID-19 crisis in a race to understand the virus causing the pandemic and develop diagnostics, vaccines, and therapeutics. But many research questions remain. How can long-read SMRT Sequencing technology help fill the gaps?
PacBio microbiology expert Meredith Ashby highlighted several opportunities to support coronavirus research in a recent webinar as part of a day-long virtual conference hosted by LabRoots.
Sequencing the viral genome
Understanding the basic biology of the virus is essential, and the more detailed our investigation, the better.
Highly accurate, long-read sequencing of viral genomes has for years been used by virus researchers to access information that is challenging to resolve with short-read data. Mutation phasing and the detection of rare variants are both greatly simplified when entire viral genes or genomes can be sequenced in a single molecule, Ashby said.
This phasing information allows researchers to understand if there are mutational hotspots or regions of stability, both of which are important to understanding how to design a vaccine or therapeutic. Better phasing also allows researchers to understand how and how fast the virus evolves, either within one person during the course of an infection, or over time during the course of an outbreak or epidemic. In the context of the current coronavirus pandemic, Ashby said, having high-quality sequencing information can help track whether the virus evolves in such a way that test quality is compromised, as well as monitor whether drug resistance arises once therapeutics are developed.
To illustrate these points, Ashby discussed several examples of researchers using PacBio technology to gain unique insights into the biology of other viruses, including HIV and influenza. In one example from HIV research, scientists were able to determine that two distinct strains of the env gene had arisen in a single patient, and that one strain was compartmentalized in brain tissues. In another example from influenza research, combining PacBio and single-cell sequencing gave a comprehensively detailed view of how viral mutations evolved over the course of an infection, revealing both point mutations and indels.
Why do some people get so sick while others do not?
The answer to this question likely lies within the complex workings of our immune systems, which are still not completely understood. Ashby discussed how many of the regions of the genome that are involved in the immune response are highly polymorphic, and in the past have been quite challenging to sequence and to phase. These include the immunoglobulin heavy-chain (IGH) locus and the human leukocyte antigen (HLA) genes. In addition, long reads also offer several advantages for sequencing B cell receptor (BCR) repertoires.
BCR sequencing can help us identify broadly neutralizing antibodies in recovered patients, which could then be used as interim treatments ahead of the development of a vaccine. Ashby described how using a long-read approach allows sequencing of the entire BCR transcript, including those regions outside the CDR3 domain where mutations can arise during somatic hypermutation.
In addition, full-length sequencing allows for a simplified PCR reaction, which can avoid known issues with primer bias in the commonly used highly-mulitplexed short amplicon PCR strategy. Finally, Ashby discussed how one reason to sequence the immune repertoire, inference of the IGH locus, can instead be accomplished by direct sequencing with PacBio long reads.
The ability to delve deeply into germline sequences and examine allelic differences from person to person could shed light on how the genetic background of patients may influence disease susceptibility, progression and outcomes.
Watch the webinar for examples of relevant research and more detailed application uses:
Interested in applying SMRT Sequencing in your COVID-19 research? Protocols, assay development recommendations, primer sources, workflows and checklists are available on the continually updated COVID-19 Sequencing Tools and Resources page.
Need to find a sequencing center? Connect with us and we’ll put you in contact with the closest service provider that is operational during the pandemic.
Our team is proud to announce that PacBio has been working closely with customers to help in the fight against the COVID-19 pandemic. Scientists in commercial, academic, and government research teams are using highly accurate SMRT Sequencing data to resolve variants of the SARS-CoV-2 virus that exist within one individual or across a population of patients, which is critical to developing and maintaining effective diagnostics, vaccines, and therapeutics.
Many of these efforts are powered by our HiFi reads, which are both long and highly accurate. Such reads are well-suited for applications like viral sequencing, which requires the ability to distinguish variants that may differ by only a handful of single nucleotide variants (SNVs) within a viral gene or across an entire viral genome.
Here’s a look at how some research teams are deploying PacBio sequencing for their coronavirus investigations:
- LabCorp, which is actively supporting the response to COVID-19 in the United States and globally, will work closely with PacBio to sequence a large number of SARS-CoV-2 viruses from de-identified positive samples. LabCorp’s scientific teams will use this information to shed light on virus evolution, mutations found in different geographic regions, and implications for disease severity and outcomes, helping to support more informed patient treatment decisions.
“As we strive to rapidly learn as much as possible about the biology of this novel coronavirus to help deal with the current pandemic and also to look ahead to future outbreaks, SMRT Sequencing will give us an accurate, high-resolution view of the pathogen.”
Marcia Eisenberg, PhD, Chief Scientific Officer of LabCorp Diagnostics
- Scientists at the Vaccine Research Center at the National Institute of Allergy and Infectious Diseases are planning to use the Sequel II System to study virus population diversity and minor variants in samples collected from infected individuals. The information could ultimately be used to support the design of effective vaccines and antibody-based therapies.
- At the University of California, San Diego, scientists are using SMRT Sequencing data to analyze SARS-CoV-2 samples. They will utilize targeted sequencing data to study the viral genome as well as shotgun metagenomics to characterize the microbiome of nasal tissues responding to a COVID-19 infection.
“We anticipate that the insights we will gain from HiFi sequencing on the Sequel II System will contribute significantly to our knowledge about SARS-CoV-2 and how it operates in people.”
Rob Knight, PhD, Director of the Center for Microbiome Innovation at UC San Diego
- At the Research Center Borstel, a member of the German Leibniz Association, scientists who focus on lung diseases will be sequencing SARS-CoV-2 samples and other lung pathogens collected from routine diagnostic samples to foster genomic diagnostic applications and study their spread and evolution.
- Researchers at the Vanderbilt Vaccine Center have used PacBio sequencing technology to study the human B-cell response to the SARS-CoV-2 virus, with the goal of identifying therapeutics or protective antibodies from patient samples.
“We are proud to support the rapidly expanding group of our customers who are engaged in this essential work and believe that the unique nature of SMRT Sequencing will allow them to delve into virus biology and host response research in a way that directly supports the development of much needed diagnostic tests, vaccines and medicines for managing COVID-19,” said Jonas Korlach, PhD, Chief Scientific Officer, PacBio.
Visit our COVID-19 Sequencing Resource Center to review the latest protocols, primer sets, and relevant publications.
We’re pleased to release a short video describing PacBio Sequencing and our latest platform, the Sequel II System. If you’ve ever wondered how Single Molecule, Real-Time (SMRT) Sequencing works, what the Sequel II System is, and what applications are available, this video is a great place to start.
We are excited to share the capabilities of our Sequel II System as it makes SMRT Sequencing affordable for scientists in any lab and provides comprehensive views of genomes, transcriptomes, or epigenomes. The Sequel II System also produces highly accurate long reads, known as HiFi reads, to deliver the highest quality sequencing data.
The three-minute introductory video outlines the key advantages of SMRT Sequencing:
- Long reads allow you to readily assemble complete genomes and sequence full-length transcripts
- High accuracy provides over 99.99% accurate sequencing results
- Uniform coverage enables sequencing through regions inaccessible to other technologies
- Single-molecule resolution lets you capture sequence data with over 99% single-molecule accuracy
- Epigenetics that can be explored through direct detection of base modifications during sequencing
Watch the video for more information about the science behind SMRT Sequencing. From the difference between circular consensus sequencing (CCS) to generate HiFi reads and continuous long read (CLR) sequencing mode, to and how it makes a difference for applications such as whole genome sequencing for de novo assembly, variant detection, and RNA sequencing.
If this introductory video peaked your interest, you can learn more about the Sequel II System, the advantages of SMRT Sequencing, or explore our resource library to learn how scientists worldwide are using SMRT Sequencing to advance their science.
Explore other posts in this series:
In a new preprint, scientists from the National Human Genome Research Institute, the University of Washington, and other institutions describe HiCanu, a modified version of the Canu assembler designed specifically for PacBio HiFi reads. The team put the new assembler through its paces, reporting that it significantly outperformed traditional assembly methods — even getting through centromeres, segmental duplications, and other notoriously difficult regions.
As lead authors Sergey Nurk (@sergeynurk) and Brian P. Walenz, corresponding authors Sergey Koren (@sergekoren) and Adam Phillippy (@aphillippy), and collaborators report, “HiFi is a major leap forward in terms of long-read read accuracy.” They add, “As the accuracy of other long-read technologies have not exceeded 95%, the median accuracy of current HiFi reads can exceed 99.9% (>Q30), making them a promising data type for separating highly similar repeat instances and alleles.”
HiCanu applies homopolymer compression, overlap-based error correction, and tandem repeat masking to eliminate the few remaining errors in HiFi reads, resulting in 97% of reads matching perfectly to a curated reference sequence. This near-perfect accuracy helps to distinguish high-identity genomic repeats, as differences in HiFi reads can be trusted to be biological and not sequencing errors.
The new assembler generated draft assemblies of Drosophila and several human genomes. The HiCanu assemblies were all highly contiguous and extremely accurate. “On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultra-long Oxford Nanopore reads in terms of both accuracy and continuity,” the scientists write. The reported difference in accuracy is especially large: the HiCanu assembly has 831× fewer errors than the assembly of ultra-long Oxford Nanopore reads.
The team zoomed in on certain regions known to be challenging — including centromeres, segmental duplications, and the MHC locus. For CHM13, the scientists report, “This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions.”
HiCanu also deftly handles haplotype phasing, with the authors stating that “HiCanu consistently recovers both haplotypes for the six canonical MHC typing genes in the human genome.”
The authors report several other advantages of HiCanu. First, assemblies generated by HiCanu do not require polishing. In fact, the authors “discourage polishing HiCanu HiFi assemblies, because… polishing pipelines may map reads back to the wrong repeat copies and actually introduce errors.” Second, HiCanu is computationally efficient: “The number of CPU hours required for assembly of a human genome is under 4,000, which could be completed on any modern cloud platform in less than a day for a few hundred dollars,” the team reports. “This is 30-fold less than recent Oxford Nanopore assemblies that required more than 100,000 CPU [hours].”
“We have demonstrated that HiCanu is capable of generating the most accurate and complete human genome assemblies to date,” the scientists write, pointing out that HiCanu could also be applied to non-human genomes, including metagenomic samples. “These results represent a significant advance towards the complete assembly of human genomes.”
Welcome to the Sequencing 101 blog series – where we will provide introductions to sequencing technology, genomics, and much more!
If you’re not immersed in the field of DNA sequencing, it can be challenging to keep up with the rapid evolution among all the platforms and technologies on the market. Let’s start with a quick overview of how these different technologies came about — and how each is used today.
First Generation Sequencing – Starting the Era of Genomics
DNA sequencing as we know it originated in the late 1970s, when Frederick Sanger at the MRC Centre in Cambridge developed a gel-based method that combined a DNA polymerase with a mixture of standard and chain-terminating nucleotides, known as ddNTPs. Mixing dNTPS with ddNTPs causes random early termination of sequencing reactions during PCR. Four reactions are run, each with the chain-terminating version of only one base (A, T, G or C). When visualized with gel electrophoresis, one reaction per lane, the fragments are sorted by length, allowing the DNA sequence to be read off base by base. This technique was revolutionary at the time, enabling sequencing of 500-1000 bp fragments. However, since the original method used radioactive ddNTPs and X-rays, it was less than ideal for widespread use.
By the 1980s, Sanger’s original method had been automated by scientists at Caltech and commercialized by Applied Biosystems. Radioactive ddNTPs were replaced with dye-labelled nucleotides and large slab gels were replaced with acrylic-finer capillaries. Scientists could now simply feed prepared DNA into a machine and view the results of fluorescence-based reactions on an electropherogram. This technology, which was continuously improved over the years, served as the bedrock of the Human Genome Project. Today, automated Sanger sequencing is still in use, primarily in clinical labs where it is acceptable to have low throughput, higher per-sample costs, and sequencing reads 500-1,000 bp in length.
But even after the Human Genome Project, the cost of automated Sanger sequencing — also known as capillary electrophoresis — remained too high to enable the kind of large-scale sequencing projects envisioned by scientists. By the mid-2000s, remarkable efforts had been made to bring down the costs of sequencing. Driven largely by grants from the National Human Genome Research Institute (NHGRI), labs around the world tested out new methods for higher-throughput sequencing, using concepts as diverse as electronics, physics, and magnetics.
Second Generation Sequencing – Short Reads Become Fast and Efficient
One key player in the advent of next-generation sequencing (NGS) was a UK-based company called Solexa, which was later acquired by Illumina. The key innovation of the Illumina platform was ‘bridge amplification’ which allows the formation of dense clusters of amplified fragments across a silicon chip. Amplification of the original single molecule into a large cluster of many copies is what makes it possible to detect a fluorescent signal as a single dNTP is added one at a time, as sequencing proceeds by synthesis. Over time, the number of clusters that could be read simultaneously grew tremendously, and Illumina instruments became the first commercially available massively parallel sequencing technology. Other tools developed around the same time, such as the Ion Torrent platform, became part of the NGS landscape as well. NGS platforms are the dominant type of sequencing technology used today. Their extreme capacity allows for sequencing at very low cost. They are limited, however, in read length; NGS platforms typically produce reads of ~50-500 bp in length. This makes them an excellent fit for resequencing projects, SNP calling, and targeted sequencing of very short amplicons.
Third Generation Sequencing – The Rise of Long Reads
However, short reads are not suitable for all sequencing projects. Another approach that was supported by the so-called $1,000 genome grants from NHGRI was Single Molecule, Real-Time (SMRT) Sequencing from PacBio. This technique uses miniaturized wells, known as zero-mode waveguides, in which a single polymerase incorporates labeled nucleotides and light emission is measured in real time. A different single-molecule approach to long-read sequencing, using pore-forming proteins and electrical detection, was adopted by Oxford Nanopore Technologies (ONT).
Watch this short video to learn how SMRT Sequencing works.
SMRT Sequencing has a number of advantages. Most notable, perhaps, is its ability to produce long reads — tens of thousands of bases long in a single read. These long reads make it possible to span large structural variants and challenging repetitive regions that confound short-read sequencers because their short snippets cannot be differentiated from each other during assembly. Another advantage is low GC bias, which allows PacBio Systems to sequence through extreme-GC at AT regions that cannot be amplified during cluster generation on short read platforms. A third advantage is the ability to detect DNA methylations while sequencing, since no amplification is done on the instrument.
As scientists began to work with SMRT Sequencing — sometimes known as third-generation sequencing — they realized that it had particular value for applications including de novo genome sequencing, phasing, detection of structural variants, epigenetic characterization, and sequencing of the transcriptome without the need for assembly. Technology improvements over time increased the throughput and accuracy of SMRT Sequencing platforms, bringing their costs in line with NGS platforms for many types of projects. Now, SMRT Sequencing has industry-leading accuracy thanks to its HiFi sequencing, and it is being used around the world to produce reference-grade genomes for microbes, plants, animals, and people.
Explore other posts in this series: