This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
In a new Science publication, researchers from the University of Washington and other institutions report detailed analyses revealing the adaptive importance of copy number variants (CNVs) acquired from Denisovan and Neanderthal ancestors, the closest relatives of modern humans, in the modern-day Melanesian population. The team used PacBio long-read sequencing to study these complex stretches of DNA and the Iso-Seq method to generate full-length transcript data.
“Adaptive archaic introgression of copy number variants and the discovery of previously unknown human genes” comes from lead author PingHsun Hsieh (@phhBenson), senior author Evan Eichler, and collaborators. For the project, they focused on the Melanesians, an oceanic population that are known to have more Denisovan and Neanderthal ancestry than other groups. This made an excellent foundation for studying the role of CNVs in adaptation and archaic introgression.
“Relatively little is known about the extent to which CNVs contribute to the genetic basis of local adaptation and, more importantly, whether CNVs introgressed from other hominins may have been targets of adaptive selection,” the authors write.
As part of this project, scientists focused on “two of the largest and most complex” CNVs found in the Melanesian genome — a 5 kb duplication and a 73.5 kb duplication, both on chromosome 16 — for a deeper investigation. “Both events are largely restricted to Melanesians and the Denisovan archaic genome F2 and are thought to be involved in a single >225-kb complex duplication (DUP16p12) introgressed from the Denisovan genome,” they report. “This region has been difficult to correctly sequence and assemble, and only recently has the sequence structure of the ancestral locus (>1.1 Mb) been correctly resolved.”
To better understand the original duplication, the team generated 75-fold whole-genome coverage of a Melanesian individual using SMRT Sequencing. This allowed them to narrow down the insertion location to a 200 kb region that is enriched in segmental duplication that “predisposes the region to recurrent structural rearrangements associated with autism and developmental delay,” Hsieh et al. write.
By applying the Segmental Duplication Assembler, a methodology recently published in Nature Methods, they wound up with a 1.8 Mb contig including the correctly assembled Melanesian duplication. “Notably, the sequence-resolved assembly shows that the actual length of DUP16p12 duplication polymorphism is ~383 kb, which is longer than previously thought,” the authors report. “Sequence and phylogenetic analyses suggest that the variant originated from a series of complex structural changes involving duplication, deletion, and inversion events ~0.5 to 2.5 million years ago within the Denisovan ancestral lineage.” That duplication was inserted into the Denisovan genome within the last 200,000 to 500,000 years and subsequently introgressed into the ancestors of Melanesians between 60,000 to 170,000 years ago, the authors conclude.
The team performed Iso-Seq with hybridization capture probes toward this region to produce full-length gene models and better characterize the functional effects of CNVs in the Melanesian genome. Based on their results — including a comparison to gene models from other humans and the chimpanzee — the scientists found that the 383 kb duplication is likely adaptive. “This helps to explain why this polymorphism has become nearly fixed within the Melanesian populations (>80%) despite its large size, which is typically regarded as selectively disadvantageous,” they note. “Notably, the Melanesian-specific gene NPIPB shows ~3% amino acid divergence and evidence of positive selection despite its recent origin.” The scientists predict that the proximity of this duplication to a genomic region associated with autism (chr16p11.2) will have an impact on the frequency of autism-associated rearrangements in the Melanesian population.
Based on these results and other data confirming Neanderthal-origin CNVs in the Melanesian genome, the scientists were able to “reconstruct the structure and complex evolutionary history of these polymorphisms and show that both encode positively selected genes absent from most human populations,” they write. “This study highlights the substantial large-scale genetic variation that remains to be characterized in the human population and the need for development of additional reference genomes that better capture the diversity of our species and complete our understanding of human genes.”
We caught up with Hsieh at ASHG 2019, where he was presenting a poster on this research. He summarized the project by stating, “The high-quality, long-read sequencing data opens up an unprecedented venue to study variants in complex genomic regions. The ability to access these new variants helps us advance our understanding of the biology and evolution of our own species.”
A recent bioRxiv preprint reports efforts to sequence the genome of a Tibetan individual and detect the genetic underpinning of adaptive traits associated with tolerating high altitude. The authors used SMRT Sequencing to achieve extremely high contiguity and accuracy, and incorporated scaffolding and other complementary technologies to build a robust assembly.
The results are reported in the preprint, “De novo assembly of a Tibetan genome and identification of novel structural variants associated with high altitude adaptation.” Lead author Ouzhuluobu, senior author Bing Su, and collaborators discuss their evaluation of the new genome assembly as well as key findings from it. They chose to focus on a Tibetan person because of the population’s unique and long-term residence in “one of the most extreme environments on earth”— the Tibetan Plateau, at an average elevation exceeding 4.5 kilometers.
The team’s genome assembly, named “ZF1”, is the first for a Tibetan individual. Using the assembly, the scientists identified 6,500 structural variants that were not detected in two other long-read Asian genome assemblies. “[Genes near] ZF1-specific SVs are enriched in GTPase activity that is required for activation of the hypoxic pathway,” the authors report. In addition, they found a “163-bp intronic deletion in the MKL1 gene showing large divergence between highland Tibetans and lowland Han Chinese.” They note, “This deletion is significantly associated with lower systolic pulmonary arterial pressure, one of the key adaptive physiological traits in Tibetans.”
Previous studies had suggested that the Tibetan population may have more genomic content from archaic hominid species, such as the Denisovans, than other modern populations. “To take advantage of the de novo ZF1 assembly, we performed a genome-wide search of archaic sharing non-reference sequences (NRSs) and compared the results with the two de novo assembled Asian genomes (AK1 and HX1),” the authors report. “We found a total length of 39.6 Mb and 45.9 Mb sequences shared with those of Altai Neanderthal and Denisovan, corresponding to 1.32% and 1.53% of the entire ZF1 genome respectively. These archaic proportions are much higher than that in AK1 (0.82% and 0.70%) or HX1 (0.98% and 0.85%).” One of the archaic shared regions is a 662 bp insertion associated with improved lung function.
“The high-quality genome allows us to better understand the sequences showing population-level or individual-level specificity where they are different or even absent from the human reference genome,” the scientists write. “Our study demonstrates the value of constructing a high-resolution reference genome of representative populations (e.g. native highlanders) for understanding the genetic basis of human adaptation to extreme environments as well as for future clinical applications in hypoxia-related illness.”
A new review article nicely sums up the utility of long-read sequencing for solving rare diseases that cannot be explained by other methods. The paper, published in the Journal of Human Genetics, comes from authors Satomi Mitsuhashi and Naomichi Matsumoto at Yokohama City University in Japan.
The scientists note that long-read sequencing serves as a good complementary approach for cases that are not solved with short-read sequencing alone. “The approximate current diagnostic rate is <50% using [short-read whole exome and genome sequencing], and there remain many rare genetic diseases with unknown cause,” Mitsuhashi and Matsumoto write. “There may be many reasons for this, but one plausible explanation is that the responsible mutations are in regions of the genome [or are types of variants] that are difficult to sequence using conventional technologies.”
Many recent projects have used long-read sequencing technologies to discover pathogenic variants associated with rare disease. “The results of these studies provide hope that further application of long-read sequencers to identify the causative mutations in unsolved genetic diseases may expand our understanding of the human genome and diseases,” the scientists report.
The review discusses several particular types of disease-causing variants, including tandem repeats, structural variants, complex rearrangements, and transposable elements. In addition to citing studies that have used long-read sequencing to search for pathogenic variants, the scientists also consider why long reads make a difference for each situation. With tandem repeats, for example, they note that “long tandem repeats are difficult to analyze by Sanger sequencing,” and “long reads are a straightforward way to detect repeat changes because an adequately long read can encompass an entire expanded repeat as well as flanking unique sequences.”
Mitsuhashi and Matsumoto also review studies in which researchers made use of the PacBio No-Amp targeted sequencing application to target a region of the genome using CRISPR instead of PCR amplification. In the studies, scientists “found this approach accurate” and obtained high-coverage HiFi reads of the targeted region.
Going forward, the authors suggest a workflow for solving rare disease cases: begin with short-read exome or whole genome sequencing for small variants, and if that does not yield an answer, move on to long-read sequencing for larger variants. “Long-read sequencing is especially highly recommended when repeat diseases or complex chromosomal rearrangements are suspected,” they conclude.
Matsumoto will be presenting his team’s research at our ASHG 2019 workshop on Wednesday, October 16, 2019. Register today to reserve your seat or to get the recording.
Learn more about Variant Detection with SMRT Sequencing.
When MRSA hits your hospital, what do you do?
If you’re located in Europe or other places where infection rates are still relatively low, you can take a seek-and-destroy approach, isolating an affected patient and working out in concentric circles to identify contacts and potential transmissions.
If you’re in New York City, however, the strategy is not so simple. Hospital-associated infections with methicillin-resistant Staphylococcus aureus are endemic in the Big Apple, and this has required a fresh approach to treat and prevent the costly bacterial menace.
At Mount Sinai Hospital, the strategy now involves SMRT Sequencing. Established in 2013, the Mount Sinai Pathogen Surveillance Program has sequenced more than 2,000 genomes, cataloging around 43,000 isolates from 22,000 patients. While its original role was in reactive outbreak investigation, it is now also used as a tool for proactive, continuous infection surveillance for common hospital pathogens.
As previously reported in this blog post and this webinar, adding SMRT Sequencing to routine surveillance of MRSA and C. difficile throughout the hospital has provided a more comprehensive view of drug resistance and revealed new pathogenic strains and unexpected transmission paths.
We recently caught up with Harm van Bakel, Assistant Professor of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, to learn more about how the program has evolved and some of the results published in this paper.
Why did you choose SMRT Sequencing?
MRSA is highly clonal. Traditional molecular strain typing methods, such as pulsed-field gel electrophoresis (PFGE), S. aureus protein A (spa) typing, and multilocus sequence typing (MLST), can facilitate rapid screening, but their resolution is limited, and they are unable to capture genetic changes that lead to alteration or loss of typing elements.
Asa result, short-read whole genome sequencing has emerged as the gold standard for studying lineage evolution and nosocomial outbreaks. By comparing sequences of your clone of interest to a reference genome, you are able to profile changes in the core genome. However, you may still be missing crucial information contained in non-conserved ‘accessory’ genome elements, which harbor a lot of virulence and drug resistance determinants, and evolve more rapidly.
These accessory elements, which include endogenous prophages, mobile genetic elements, and plasmids, are very repetitive in nature, so long-read sequencing that is able to resolve these was needed to help us determine what’s going on in both the core and accessory genome elements. It also gives us additional information to tease apart the evolution of an outbreak.
In the paper, we gave an example of a persistent outbreak in the hospital’s neonatal intensive care unit, which was eventually traced to adult hospital wards, with ventilators as a potential vector. Long-read sequencing enabled comparative genome and gene expression analyses of the outbreak clone to hospital background strains, in which we identified genetic and epigenetic changes, including acquisition of accessory genome elements that may have contributed to the persistence of the outbreak clone.
What lessons have you learned from continuous surveillance?
Logistically, we have learned that integration and automation are key. In the beginning, we were a little naive. But we quickly realized that you can’t just analyze the genomes in a vacuum. Just identifying infection relatedness between two patients isn’t enough either. You have to understand how certain strains may have spread between patients. This requires detailed information about the patient’s condition, where they have been in the hospital, which staff and equipment they have come in contact with. We had to develop bioinformatics tools to layer genetic information on top of patient and hospital epidemiologic records to create a single, integrated map. We’ve found we need an entire support system in addition to the sequencing in order to make it work. We’re fortunate that our health system has the centralized testing and medical records systems to help us operate in an efficient manner.
Scientifically, we have learned that there’s a lot more happening under the surface than we were aware of — in regards to strain types, evolution, frequency, virulence and transmission. We continue to see under-the-radar outbreaks that we wouldn’t be aware of without a sequencing-based surveillance program. And we’ve learned that MRSA can be colonized for a long time and re-emerge weeks, if not months later, when infected patients return to the hospital.
The continuous surveillance has led to a much better understanding of how pathogens circulate throughout the hospital. It allows us to be more proactive. In some cases, it has led to interventions in certain wards. In the case of the NICU outbreak, if we had implemented continuous sequencing before it occurred, we may have been able to intervene much sooner.
We want to continue to improve the program so that it’s fast and cost effective enough to inform infection prevention. Not only does this benefit patient care, but it can help avoid larger outbreaks, ward closures, and other costs associated with investigating and reacting to infections.
We also want to build a more comprehensive view of all pathogens, so we have started to track viral pathogens too, including influenza.
And we have expanded the program beyond just Mount Sinai Hospital, into the entire Mount Sinai Health System, including other hospitals and community care facilities that cover most of the Manhattan area. We hope this will help us understand just how pathogens are moving throughout the system as patients travel from facility to facility, and to detect new pathogens emerging from the community.
We can’t make headway into reducing the MSRA endemic unless we understand it better. We need to better map the route of transmission of the pathogen between people, and in the environment. This is only possible through hospital-wide – and ideally region-wide – surveillance.
The DNA sequencing community lost one of its founding fathers last month with the death of Jo Messing, director of the Waksman Institute at Rutgers University.
Dr. Messing, who died at the age of 73, developed shotgun sequencing and the M13 sequencing vector used for cloning in the 1980s. Because he declined to patent this work, it was freely available and quickly became the foundation for a burgeoning molecular genetics field.
Dr. Messing’s scientific acumen and commitment to innovation in DNA sequencing remained a guiding force for the community throughout his life. In a PNAS paper published just a few days after his death, he and his colleagues presented some truly fascinating discoveries about the evolutionary and immune function of sequence repeats in the genome of Spirodela polyrhiza, an aquatic plant.
His dedication to generating high-quality genome assemblies for plants helped improve our understanding of maize, rice, sorghum, and other challenging genomes. A few years ago, we reported on his analysis of structural variation in maize, in which his team produced a highly accurate representation of copy number variation in the plant’s genome.
The PacBio team will remember Dr. Messing as a kind and generous scientist whose enthusiasm for our long-read sequencing technology was a true gift. He always remained humble no matter how great his accomplishments, and we count ourselves fortunate to have known him.
It was the first multicellular eukaryotic genome sequenced to apparent completion, but it turns out the Caenorhabditis elegans reference that’s been used as a resource for the past 20 years does not exactly correspond with any N2 strain that exists today.
Assembled using sequence data from N2 and CB1392 populations of uncertain lineage grown in at least two different laboratories during the 1980s and 1990s, accuracy of the C. elegans reference genome is limited both by genetic variants and by the limitations of the technology of the time (clone-based Sanger technology). It is believed the strain may have accumulated up to 1,000 neutral mutations even before it was first frozen in 1969 with substantial genetic differences between strains in different laboratories since then.
So a team of researchers from Stanford, Cornell, and the University of Tokyo sought to recomplete the genome by performing long-read assembly of VC2010, a modern and easily available nonmutagenized derivative of N2. Not satisfied with the completeness of earlier assembly attempts, the team decided to use three sequencing technologies: Illumina short reads, as well as PacBio and Nanopore long reads.
As described in their cover-gracing Genome Research study, their VC2010 assembly has 99.98% identity to N2, but with an additional 1.8 Mb, including tandem repeat expansions, genome duplications, and more than 53 newfound genes. For 116 structural discrepancies between N2 and VC2010, 97 structures matching VC2010 (84%) were also found in two outgroup strains, implying deficiencies in N2.
“Although we do not expect this or any assembly to be perfect, the VC2010 assembly provides substantial advantages over its predecessors in both precision and completeness,” the authors wrote.
The team assembled raw PacBio reads with the long-read genome assemblers Canu, FALCON, miniasm, and HINGE, yielding complementary assembly gaps from the same input sequencing data. Merging these PacBio assemblies resulted in an assembly containing only five gaps across the genome. Using Nanopore long reads, they were able to close three more, bringing the total gaps down to two.
The improved assembly “yielded features not visible in the N2 reference assembly, and has several technical and biological implications,” they added.
They also suggested that more of the nematode genetic record may need to be corrected. They note that almost 2% of the putatively gap-free C. elegans genome proved to be missing from the N2 assembly, including long stretches of repetitive DNA, and said “it seems likely that most of the nematode assemblies generated over the last decade are missing some repetitive regions of genomic DNA.”
Such highly tandemly repeated regions may be crucial for understanding fast-evolving gene families relevant to nematode ecology, and for identifying rapidly evolving virulence factors in parasites such as N. brasiliensis, they added.
“With the possible exception of highly reduced genomes such as Pratylenchus coffeae, long-read assembly will probably be needed to detect and resolve these systematically lost genome sequences,” they wrote.
In order to ensure reproducibility of their new assembly in vivo, the team derived a highly clonal strain from VC2010, called PD1074 (available here), and used it to generate most of the genomic sequence data.
“C. elegans researchers who wish to have significantly higher genomic and genetic reproducibility than is possible with N2 are encouraged to adopt PD1074 as a new reference strain for wild-type controls, classical mutagenesis, and genome engineering,” the authors wrote.
Reining in the Wild Strains
In a second Genome Research paper, researchers from Seoul National University generated a de novo assembly of CB4856, which is one of the most genetically divergent strains of C. elegans compared to the N2 reference strain.
Their study sought to determine how substantial genomic changes are generated and tolerated within a species, and to compare the wild strain with the N2 reference, as the two have numerous heritable phenotypic differences, including aggregation behavior, mating, nictation behavior, pathogen response and genetic incompatibility.
Not satisfied that the current, short-read generated CB4856 reference genome accurately represents genomic rearrangements that are longer than the insert length, and concerned that it might be missing insertions and repetitive sequences, the team generated their own PacBio genome assembly to the level of pseudochromosomes containing 76 contigs.
They identified structural variations that affected as many as 2,694 genes, and found that subtelomeric regions contained the most extensive genomic rearrangements, even creating new subtelomeres in some cases.
The high variability of subtelomeres over generations facilitates the emergence of new genes and may help to increase the fitness of organisms, the authors note. However, subtelomeres — hypervariable regions adjacent to the telomere — are highly repetitive by nature, which makes genome assembly at their sites very difficult and has hampered the study of their involvement in chromosome evolution.
The subtelomere structure that the Korean team was able to unravel with PacBio sequencing implies that ancestral telomere damage was repaired by alternative lengthening of telomeres, even in the presence of a functional telomerase gene, and that a new subtelomere was formed by break-induced replication, the authors said.
“Our study demonstrates that substantial genomic changes including structural variations and new subtelomeres can be tolerated within a species, and that these changes may accumulate genetic diversity within a species,” they wrote.
The researchers said they hoped their CB4856 genome will serve as a better reference genome for wild C. elegans strains, and that the numerous SVs between N2 and CB4856 will help to better understand the effect of SVs on traits by association studies using these strains.
The National Human Genome Research Institute has awarded nearly $30 million for new sequencing and bioinformatics initiatives that aim to better represent the full range of human genetic diversity. An entirely new human reference genome — the “pangenome” — will be built from high-quality sequencing of 350 individuals from across the human population. Here at PacBio, we’re excited that highly accurate long-read WGS data from our Sequel II Systems will be an important component of this new program.
“It has grown more and more important to have a high-quality, highly usable human genome reference sequence that represents the diversity of human populations,” said Adam Felsenfeld, NHGRI program director in the Division of Genome Sciences, in a statement announcing the news.
The Human Pangenome Project will be carried out at a sequencing center and a reference center, each having extensive collaborations with scientists at many institutions. The University of California, Santa Cruz (UCSC) will form the sequencing center with teammates at the University of Washington, Washington University in St. Louis, and Rockefeller University. Their goal will be to use cutting-edge sequencing technologies, including SMRT Sequencing, to create higher-quality assemblies than have been produced in the past. A separate reference center will be created by Washington University in St. Louis, UCSC, and the European Bioinformatics Institute. Its aim will be to pull all 350 assemblies together into a useful representation for the genomics community.
“The human pangenome reference will be a key step forward for biomedical research and personalized medicine. Not only will we have 350 genomes representing human diversity, they will be vastly higher quality than previous genome sequences,” said UCSC’s David Haussler in a statement announcing these grants. “We are going to use all of the latest and best sequencing technologies and push their capabilities to get the most complete and accurate sequences possible.”
Evan Eichler, professor of genome science at the University of Washington and a Howard Hughes Medical Institute investigator, said in a statement: “We finally have the technology and methods to go after the parts of the human genome that were beyond our reach 20 years ago. It’s an exciting time for human genetics with implications for improved variant discovery associated with disease.”
For genome sequencing with PacBio technology, scientists will use the Sequel II System’s HiFi reads for highly accurate consensus sequencing of multi-kilobase inserts. In a recently posted preprint about a haploid human genome, scientists noted that HiFi mode successfully resolved segmental duplications (SDs), which are known for being difficult to assemble; in fact, authors stated that the HiFi assembly has “the highest fraction of resolved SDs for any of the published assemblies analyzed thus far.” They concluded, “Our results suggest that HiFi may currently be the most effective stand-alone technology for de novo assembly of human genomes.”
A hearty congratulations to all the scientists who will be involved in this new initiative! We look forward to working closely with these teams as they generate much-needed data for the biomedical research community.
It was the coolest critter Erin Bernberg (@ErinBernberg) had ever worked with – quite literally.
The senior scientist at the University of Delaware Sequencing and Genotyping Center, a PacBio certified service provider, received a shipment of tiny, live ice worms from Washington State University and immediately faced several challenges. How would she get them out of their ice cubes? How would she isolate DNA from the delicate, dark pigmented creatures? And would she be able to extract enough DNA to sequence?
Thanks to the new PacBio low DNA input protocol, the answer to the last question was yes. In fact, Bernberg was able to create a library with enough yield for up to 17 SMRT Cells 1M from only 500 ng of DNA.
She detailed her process for extracting ice worm DNA and shared some of the results during a recent webinar about the new low DNA input workflow.
“Low-input works,” Bernberg said. “You don’t need as much DNA as everyone is worrying about with PacBio. You can generate good data with it.”
Tiny body, giant genome
The ice worm may be tiny at ~1.5 cm, but its genome is giant compared to other annelids, around 1.5 Gb.
Previous attempts to sequence the genome failed because it was “inconveniently large” and the DNA came from pools of several specimens, which clouded the assembly with individual-specific noise, according to Scott Hotaling (@mtn_science), a postdoctoral researcher in the lab of Joanna Kelley at Washington State University who is studying the ice worm as part of his research into species that have adapted to the harsh conditions in extreme environments, like polar regions.
Hotaling said he is excited to delve into the 160 Gb of data generated by Bernberg’s 10 SMRT Cells. He said the genome they have assembled is around 1,000-fold more contiguous than that of its closest relative, the earthworm.
Why is contiguity important in the ice worm genome? There are so many unknowns when it comes to annelid genetics, Hotaling said. As an example, he cited AMP deaminase, a known regulator of energy metabolism that is suspected to play a part in the ice worm’s unique method of thermal regulation.
“Ice worms do this cool thing where they actually ramp up their energy levels as they get colder, which is essentially the inverse of what a lot of organisms do,” Hotaling said.
A partial sequence of AMP deaminase (540 amino acids long) had been previously generated for iceworms, so Hotaling was able to easily locate it in his new PacBio generated genome. In doing so, he discovered just how much information had been missing: “tons.”
“Imagine this across 20,000 genes. There’s so much more power to look from gene to gene at variation that might be evolutionarily relevant,” he said.
Evolution at the extremes
Mountain glacier habitats may look desolate, but there’s a lot going on below the surface. Not only are there millions of microbes, but also snow algal blooms and other lifeforms. Among them, ice worms (Mesenchytraeus solifugus) are the largest to spend their life cycle in ice.
The ice worms that Hotaling studies on Mt. Rainier survive extremely harsh conditions. Tens of millions of ice worms can live on a single glacier. Not only do they face constant cold stress, but the highly reflective ice and snow surfaces make the glaciers some of the most UV intensive places in the world.
It’s unclear how they have adopted to such an extreme lifestyle, but such knowledge could provide insight into evolutionary processes, Hotaling said. He wants to learn whether their extreme lifestyle translates to extreme physiology, and how. He will be exploring which genes and pathways have been under selection in ice worm genomes, and whether there is congruence between genes under selection and those differentially expressed in response to stress.
“We want to know things like can they freeze? It seems like a dumb question because they live in ice, but scientific observations from 30 years ago suggest they can only handle temperatures within a few degrees of freezing,” Hotaling said.
Early experiments suggest the worms are living at the absolute lower limit of their thermal tolerance, and they also exhibit a high tolerance for UV. His team is now fleshing out the picture by layering RNA-based annotations on top of the main sequencing data.
“Lineages like the ice worm are pretty far from anything else that’s been sequenced. So just looking at what’s in there is, in its own right, a pretty interesting pursuit. But we’d also like to add some context to look at genome evolution and gene expression,” he said.
He is also keen to taxonomically annotate the gene to capture contamination from other organisms that the ice worms may have hosted.
“We sequenced an entire worm – all of its gut contents, all of its parasites, whatever was on its body. So contamination is almost a certainty,” he said. “I think this is going to be a big challenge for anyone working in this space. How do we develop tools to really deal with this in a high-powered, efficient way?”
For further information on this topic and specifics of the low DNA input workflow, watch the full webinar and read the application note referenced in the presentation.
Great news from the rare disease community: the European research program SOLVE-RD has chosen SMRT Sequencing technology to help reveal the genetic mechanisms responsible for these tough-to-diagnose genetic diseases. As part of this work, scientists will sequence more than 500 whole human genomes with the PacBio Sequel II System to pinpoint disease-causing variants.
The SOLVE-RD research program, a consortium of more than 20 institutions funded with a five-year, €15 million award from the European Union’s Horizon 2020 initiative, aims to improve the diagnosis and treatment of rare diseases by applying novel tools to cases that were not solved with short-read exome sequencing.
In a press release issued today, Alexander Hoischen, Associate Professor for Genomic Technologies and Immuno-Genomics and a member of the SOLVE-RD team at Radboud University Medical Center, stated: “Even with exome sequencing, as many as 50% of rare disease cases remain unsolved. The SOLVE-RD team believes that long-read SMRT Sequencing will be essential for discovering the causal elements that have proven elusive with previous approaches, and we anticipate that this research will ultimately make it easier for doctors to diagnose other patients with these rare diseases in the future.”
The sequencing will be performed at Radboud University Medical Center. Marcel Nelen, Laboratory Specialist in Genome Diagnostics, commented in the press release: “Our team is eager to deploy PacBio’s Sequel II System to generate hundreds of high-quality human genomes for phenotypes very likely to be associated with challenging genomic regions or structural variants including repeat expansions. In our experience, SMRT Sequencing reliably detects far more structural variants — including pathogenic variants — than any other sequencing technology.”
Attendees of this week’s AGBT Precision Health meeting in La Jolla, Calif., can learn more about this project from Hoischen’s presentation on Saturday, Sept. 7th, at 10 am PDT.
Until recently, enriching for certain regions of the genome has been virtually impossible. Repeat expansions, extreme GC regions, and other genomic elements are very difficult to target using traditional enrichment methods. That’s why our new “No-Amp” targeted sequencing application — a streamlined, amplification-free approach based on the CRISPR/Cas9 system — is a valuable addition to the SMRT Sequencing toolbox.
The method was demonstrated in a recent PLoS One publication, and a new webinar delves into technical details of the protocol. Hosted by our own Paul Kotturi and Jenny Ekholm, the presentation offers an overview of uses for which the No-Amp method is beneficial, real-world examples of its results, and advantages it holds compared to traditionally used PCR and Southern blot techniques.
Kotturi kicked off the presentation with a look at the general advantages of SMRT Sequencing, including long reads, high accuracy, single-molecule resolution, simultaneous epigenetic detection, and uniform coverage. He also noted some recent performance metrics from the new Sequel II System: more than half of data is in reads >190 kb , and each SMRT Cell 8M generates up to 160 Gb of sequence data. With the HiFi sequencing mode that makes use of circular consensus sequencing, the system can achieve Q30 accuracy with just eight passes around a molecule.
Next, Ekholm stepped in to focus on the No-Amp application. Generating a sequencing library using the No-Amp method is relatively straightforward, the first step is to block the 5’ and 3’ ends of the genomic DNA, followed by the CRISPR/Cas9 digestion. To enrich for the region of interest guide RNAs are designed flanking each end of the targeted region, making them available for sequencing adapter ligation after the Cas9 digestion. The sequencing library is then cleaned up before sequencing. The No-Amp method takes two days (with less than four hours of hands-on time) and is compatible with both the Sequel System and the Sequel II System.
Users of the No-Amp method can multiplex target regions, samples, or both to maximize sequencing efficiency and minimize cost. Typical target insert sizes range from 4 kb to 6 kb, though scientists have successfully extracted even longer fragments with this process, Ekholm noted. The expected yield is hundreds of Q20 reads per target and the on-target rate for the No-Amp method is 40-60%, which translates to enrichment factors of 10,000-100,000 fold.
Later in the webinar, Kotturi discussed elements needed for this protocol: high-purity, high molecular weight DNA; 5-10 µg of DNA per SMRT Cell but only 1-2 µug / sample when multiplexing 5-10 samples / run; guide RNAs; barcoded adapters, if multiplexing samples; and a No-Amp accessory kit with primers and buffers. He also presented information about cost. In a five-sample multiplex workflow, the cost (U.S. list price) comes to $220 per sample. When multiplexing increases to 10 samples, the per-sample cost drops to $130 per sample. When multiplexing multiple targets per sample, these costs drop even further per locus. At PacBio, we routinely run 4 targets per sample.
If your research would benefit from capturing and sequencing regions that are otherwise intractable, this webinar is well worth your time. It also includes valuable information about data analysis and visualization, specific examples of targeting disease-associated repeat expansion regions, and much more.
Watch the complete webinar and visit www.pacb.com/noamp to learn more:
With her distinctive dark eyeshadow, grey lipstick-like markings and delicate disposition, she was a natural film star. And her life certainly provided enough drama for any Hollywood blockbuster, complete with high-speed boat chases in pursuit of black market “cocaine of the sea” cartels. Unfortunately, her ending was not a happy one. But efforts by an international consortium of conservation geneticists are making sure her legacy isn’t lost.
The DNA of one of the last remaining vaquita porpoises in the world has been preserved and decoded, as part of an ambitious project to create chromosomal-level genome assemblies of all extant vertebrates species on Earth — 70,000 in total.
Members of the Vertebrate Genomes Project (VGP), an international consortium of more than 150 scientists from 50 academic, industry and government institutions in 12 countries, and the Earth Biogenome Project (EBP) gathered in New York on August 27 to announce the completion of the first 100 genomes, including several species of critical conservation and scientific interest. The assemblies represent 77 orders sequenced to such completeness for the first time, which, along with 13 from the previous data set, add to a total 90 of the 260 orders the group is seeking to sequence as Phase 1 of the project.
The most endangered among them is the vaquita. It is estimated that less than 20 remain in the world. The female whose sample contributed to the VGP effort died shortly after a rescue attempt by the group Vaquita CPR. It is hoped that her legacy will live on in the information extracted from her DNA; it has already provided insight into breeding patterns of the Phocoena sinus, whose habitat is limited to a small area in the northern Gulf of California.
“We hope and trust that useful information will result that may benefit other endangered species of threatened porpoises. And we are saddened to think that one day, these tissue samples may be all that is left of this animal,” said Oliver Ryder, Ph.D., director of Conservation Genetics for San Diego Zoo Global, where the vaquita’s tissue was taken to be stored in its Frozen Zoo repository.
Her plight was featured in a documentary film produced by Hollywood star Leonardo DiCaprio, Sea of Shadows, which follows the efforts of Mexican police forces to crack down on the vaquita’s biggest threat: illegal poaching of the totoaba fish (Totoaba macdonaldi), whose swim bladders, or maws, fetch high prices in Chinese markets for their use in traditional medicine. The tiny vaquita – the smallest member of the cetacean order that also includes whales, dolphins and other porpoises – gets trapped in the gillnets used by poachers.
Another species championed by DiCaprio, the Bolson tortoise (Gopherus flavomarginatus), also made the VGP sequencing list, as did two other critically endangered species (European eel and Smalltooth sawfish); seven endangered species (Blue whale, Grey crowned-crane, Green sea turtle, Atlantic halibut, Ring-tailed lemur, Chimpanzee and Golden aronawa); and eight vulnerable species (Sterlet, Thorny skate, Siamese fighting fish, Abyssinian ground hornbill, Atlantic cod, European turtle dove, Marmoset monkey and Red-bellied piranha).
Origin of a Species
Beyond conservation, the genomes of some of the species may also shed light on fundamental processes of evolution. Often referred to as ‘living dinosaurs’, leatherback sea turtles (Dermochelys coriacea) are an ancient lineage that possess unique physiological adaptations, including those that allow them to survive in cold waters exploiting habitats far beyond many other ectotherms.
“Their populations having declined by greater than 90%, Pacific leatherbacks are one of eight species among the most at risk of extinction in the near future protected by the United States NOAA under the Endangered Species Act,” said Lisa M. Komoroske, Assistant Professor of Conservation Genomics & Ecophysiology at the University of Massachusetts, Amherst.
She will be using the VGP genome to study the remaining genetic diversity in the species and to inform new leatherback conservation initiatives, including translocation across oceans to enable genetic mixing with other populations to avoid excessive inbreeding.
Weird and Wondrous
Among the species studied are a few truly unique – and strange – creatures.
The Great Potoo (aka Nyctibius grandis) may have a silly name and equally cartoonish look, with huge eyes and a ginormous mouth, but they’re no joke. They are generally heard rather than seen. During the day, they remain motionless in mimic of broken tree branches. At night, the nocturnal creatures make unsettling sounds that haunt the Neotropics, with mouths open wide to catch passing insects, occasionally moving to pounce on other prey in quick sallies.
Ubiquitous but Useful
Other genomes, like the chicken, may seem mundane, but could prove vital to agriculture and biomedical research.
The most commonly studied avian genome in these areas, the chicken genome is getting an important upgrade as part of the project. It is one of 12 genomes that reflects the DNA of both parents. Using a process called “trio binning”, the DNA of the parents are used to separate the DNA sequences of the child chromosomes to assemble two genomes (one each from mother and father) from one individual. Based on an assembly approach developed by Sergey Koren and Arang Rhie of the Adam Phillippy Lab at the National Human Genome Research Institute, these trio-based assemblies are 40-60% better than the non-trio based assemblies at separating out parentally-inherited DNA.
Approximately $600 million will be needed to complete the VGP project. Crowdsourcing among scientists has so far raised $4.8 million of the $6 million needed for Phase 1.
Get more information about the first 100 species.
An ambitious project to sequence 5,000 microbial genomes was jointly initiated by a consortium of 10 institutions across China, including Nankai University, China CDC, Academy of Military Medical Science, Third Institute of Oceanography-Ministry of Natural Resources, South China Sea Institute of Oceanology-CAS, China National Center for Food Safety Risk Assessment, Shandong University, Tianjin University of Science & Technology, East China University of Science and Technology, and Tianjin Biochip Corporation (TBC).
TBC, a PacBio service provider in China, has led the sequencing phase of the project, which is expected to be completed by the end of 2019. We recently sat down with Sun Yamin, general manager of TBC, to learn more about the project.
What’s the difference between the Prokaryotes 5,000 Complete Genomes Project (P5KCGP) and other microbial sequencing projects?
Previous microbial genome projects were scattered and typically based on one researcher’s own interests and directions. As a result, many common microbial species’ genomes have been sequenced repeatedly, while less commonly studied microbial species have still not been sequenced at all.
The current microbial genome database has an obvious species imbalance. Many microbial genomes have only low-quality genomic scaffolds. Our goal is to create a genomic database that covers a much broader array of microbial diversity, including pathogenic microorganisms, food safety microbes, marine microbes, and terrestrial resource microbes.
We expected to add at least 500 new microbial genomes that are currently not found in the NCBI database by the completion of the project. Our goal is to submit a high-quality, closed genome with no gaps for each of the 5,000 microbial genomes included in our project. In order to achieve this goal, we chose the PacBio Sequel System as our sequencing platform, as SMRT Sequencing technology combines long read lengths, high accuracy, and no GC content bias.
At present, only the Sequel System can meet our project requirements, given the challenges presented by many bacterial genomes. Using the latest version 3.0 reagents, the average read length of 22 kb on the Sequel System is sufficient to span repeats that can be more than a dozen kilobases in length in some bacterial genomes. In addition, we have seen GC content up to 70% in microbial samples we’ve sequenced. Even so, assembly can be accomplished easily with PacBio data.
What is the significance of the P5KCGP project?
While microorganisms were the first genomes to be sequenced by scientists, the sum of all microbial sequencing data worldwide is less than the amount of data produced by a laboratory that performs human genome sequencing. Although the genomes of microorganisms are relatively small, the enormous species and functional diversity of microorganisms in nature means that microbial genomics has not been given sufficient attention. For pathogenic and foodborne microorganisms in particular, it is important to have reference-quality genomes.
What challenges has the P5KCGP project encountered?
1) Sample collection. On average, each partner needs to provide 400-500 microorganisms. Since our goal was to include bacterial species that are rare in nature, it can take a long time to isolate and grow samples.
2) Controlling costs. Generating closed microbial genomes requires more resources than simply coming up with a bunch of draft genomes. To manage sequencing costs, we have succeeded in multiplexing 16 microbial samples on each SMRT Cell 1M by optimizing the library preparation process.
3) Dealing with difficult-to-sequence microbes. The habitat of microorganisms in nature is diverse, and some live in extreme environments requiring quite high GC content in their genomes. Sequencing of such microbial genomes is more difficult.
What groundwork does this project lay for future research efforts?
We want to better understand how microbes that are widely distributed in nature have evolved and adapted to diverse environments with the much more complete survey of microbial genomes made available through this sequencing project. In addition, some rare microorganisms living in extreme environments often have potential industrial value. Two examples sequenced through this project are the extremely acidophilic methanotroph isolate V4, Methylacidiphilum infernorum, and the Geobacillus thermodenitrificans.
Patients with myotonic dystrophy type 1 (DM1) want to know their size — the size of the expansion of repeats of the unstable CTG sequences that cause the progressive deterioration of neuromuscular functions that they might face.
Size matters to them, because it has been found to correlate with the severity and onset of symptoms, which can range from severe cardiac and respiratory abnormalities and intellectual impairment in children, to muscle weakness, hypersomnolence or cataracts in adults. The earlier the onset, the more severe the symptoms tend to be. The autosomal disorder, which is the most common form of inherited muscular dystrophy in adults, also tends to get progressively worse with each generation. But the manifestations vary widely between patients, and even within families, making it extremely difficult to predict how it will affect any individual.
Stéphanie Tomé would like to arm genetic counselors with more information to help patients navigate through their difficult diagnoses and prognoses, and to inform their decisions about their own lives and those of their offspring. Ultimately, she would also like to be able to provide them with new options to manage or even alter their diseases.
To do so, she needs to be able to read the repeats, which can be encoded in sections as large as 3,000 triplets. So she has turned to PacBio SMRT Sequencing, which is capable of capturing sequences of long stretches of DNA, including complete regions of repeats found in patients with DM1 and other expansion disorders, such as Huntington’s Disease and Fragile X.
Tomé, an investigator at the Centre de Recherche en Myologie at Sorbonne Université/INSERM in Paris, is the winner of the 2019 Targeted Sequencing SMRT Grant. Along with the 10 other scientists in her research group, led by Geneviève Gourdon, and collaborators from around the world, Tomé will sequence sections of mutated genes in DM1 patients to determine the exact size and pattern of CTG repeats.
“We have some idea of what may be going on at either end of these regions, but we don’t have any information about what is happening in the middle,” Tomé said. “Improving our knowledge of the entire repeat sequence will help us make clearer correlations between the genetic instability and the clinical manifestations of DM1.”
Information generated in the project could also help researchers advance their understanding of some of the mechanisms behind the degenerative disorder.
If the disorder is characterized by long lengths of trinucleotides gone haywire, then it would be advantageous to be able to shrink the repeat regions back down to an asymptomatic size. Researchers have found cases where the regions have naturally contracted, and others where there are interruptions in the repeat codes.
Tomé and colleagues are pursuing this avenue of research, hoping to be able to harness knowledge about contractions and/or interruptions to induce them as a way to prevent and/or treat DM1 and other disorders. Tomé said drug screens on mouse models have already identified some potential compounds that could induce contraction, but they need to be tested and modified for use in humans.
Putting it in Perspective
Tomé admits that the data gathered from this project will likely not lead to immediate solutions, but it could provide some immediate relief to patients hungry for more insight into their disorder. And she hopes that SMRT Sequencing could become an alternative method of molecular diagnostics to ameliorate the prognosis and counseling offered to patients.
“Currently, the clinical labs tend to use Triplet Prime PCR. With this technique, we can say whether a patient is going to become sick or not sick, but it’s difficult to provide any sort of prognosis,” Tomé said. “To be able to give the patient more precise information, quickly, is very important, I think. Many patients are anxious, and don’t understand why there is so much variability between their son and daughter. They want to know.”
By collaborating with clinicians and a multidisciplinary group of 10 teams at Centre de Recherche en Myologie, Tomé embraces any opportunity to get different perspectives on the disorder, including the patient perspective.
“It’s very interesting to talk to the patients. By staying in the lab, you can lose sight of the bigger picture. By leaving the lab, you get new ideas, you learn more about what the problems are and what you might be able to do to improve the lives of patients,” Tomé said.
As the behavior of repeat regions appears similar between triplet diseases, Tomé said the project’s findings might also be applicable to 13 other expanded repeat disorders.
“This widens the potential impact of our study considerably,” she said.
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win free sequencing. Thank you to our co-sponsor and Certified Service Provider, the McDonnell Genome Institute at Washington University in St. Louis, for supporting the 2019 Targeted Sequencing SMRT Grant Program.
The annual meeting of the European Society of Human Genetics — held last month in the sleek Swedish Exhibition & Congress Center in Gothenburg, Sweden — was a terrific assembly of thousands of scientists who are together pushing the boundaries of what’s possible in genome research. The PacBio team particularly enjoyed seeing so many impressive ESHG presentations with scientific results from SMRT Sequencing pipelines featuring applications such as de novo whole genome sequencing, structural variant detection, the Iso-Seq method, and targeted sequencing.
For example, in a plenary talk, the University of Southern California’s Mark Chaisson (@mjpchaisson) spoke about using long-read PacBio sequencing to analyze structural variation across human genomes. Representing the Human Genome Structural Variation Consortium, he talked about the growing number of available de novo sequenced human genomes, along with the need to characterize their complete universe of structural variants, many of which are missed in short-read assemblies.
Chaisson presented results from trio sequencing projects run by the consortium, showing that this approach allows for reliable and accurate phasing even of large structural variants, thanks to the use of long-read data. He noted that with the PacBio Sequel II System, it is now feasible to fully sequence a human genome in a single run. Chaisson concluded, “We are now in a realm where large scale human genome sequencing studies can be done using a long-read approach.”
Other great presentations came from Jozef Gecz at the University of Adelaide, who spoke about a repeat expansion associated with a heritable form of epilepsy, and Michael Talkowski of the Broad Institute, who presented on structural variant discovery and the use of sequencing systems for genomic medicine. There were also several posters from PacBio users with exciting results, such as a Swedish reference genome, clinical sequencing of the SMA gene, and amplification-free sequencing of a repeat expansion that causes corneal dystrophy.
Our own team presented posters at ESHG as well. Billy Rowell shared “Comprehensive Variant Detection in a Human Genome with Highly Accurate Long Reads” while Jenny Ekholm presented “Sequencing the Previously Unsequenceable Using Amplification-free Targeted Enrichment Powered by CRISPR/Cas9.”
We’d like to thank all of the scientists who checked out our posters or stopped by the PacBio booth to learn more about SMRT Sequencing applications and the new Sequel II System. We appreciate your time and interest!
A new preprint evaluates the utility of PacBio HiFi reads for assembly of a human genome. The study is a follow-up to a recent publication in Nature Biotechnology that introduced a technique to generate sequencing reads with both long read length and high accuracy.
“Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads” comes from lead authors Mitchell Vollger and Glennis Logsdon, senior author Evan Eichler, and collaborators at the University of Washington, PacBio, and other research institutes. For this project, they focused on sequencing a hydatidiform mole human cell line (CHM13), a useful model system because it is haploid unlike typical diploid human cells. “We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets,” the scientists write.
The team generated 24-fold coverage of CHM13, the same sample used to produce a previous assembly with CLR data. They employed the Sequel II System, producing an average of 19.1 Gb of HiFi reads with each SMRT Cell 8M. The HiFi and CLR assemblies had similar contiguity: contig N50 of 29.5 Mb for HiFi and 29.3 Mb for CLR. The HiFi assembly was much more accurate, with an estimated Phred quality value of Q45, compared to Q40 for the CLR assembly. Further, the authors note that, due to divergence in BAC clones used to measure accuracy, the quality value for the HiFi assembly is “a lower bound of the true QV.”
Next, the scientists performed an analysis of segmental duplications (SDs), which are notoriously challenging elements to assemble correctly. The HiFi assembly resolved more of these duplications than the CLR assembly. “HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of large tandem repeats, as validated with orthogonal analyses… This is the highest fraction of resolved SDs for any of the published assemblies analyzed thus far,” they report.
“We conclude that there are three essential strengths of the HiFi technology over CLR technology,” the authors conclude, citing reduced compute time to generate a de novo assembly, superior assembly accuracy, and improved ability to assemble the most difficult regions of the genome. “Our results suggest that HiFi may currently be the most effective stand-alone technology for de novo assembly of human genomes.”
Crucial assembly sites and mitosis mediators, centromeres are central to every cell, but missing from even the most complete genome assemblies.
In a PLOS Biology paper, Amanda Larracuente and colleagues at the University of Rochester and Barbara G. Mellone of the University of Connecticut, described how they sequenced the repetitive regions of the fruit fly genome, including its centromeres, using SMRT Sequencing.
Embedded in blocks of highly repetitive satellite DNA, centromeres have eluded efforts at assembly.
Only recently, long-read single molecule sequencing technologies have made it possible to obtain assemblies of highly repetitive parts of multicellular genomes such as the human Y chromosome centromere and maize centromere 10. This is the first time researchers have sequenced all the centromeres in any multicellular organism.
“Our study shows that combining long-read sequencing with ChIP-seq and chromatin fiber FISH is a powerful approach to discover centromeric DNA sequences and their organization,” the authors wrote. “Our overall strategy therefore provides a blueprint for determining the composition and organization of centromeric DNA in other species.”
Drosophila melanogaster proved the ideal model to investigate centromere genomic organization, as it has a relatively small genome (roughly 180 Mb), organized in just three autosomes (chromosome 2, 3, and 4) and two sex chromosomes (X and Y). The estimated centromere sizes in Drosophila cultured cells range between 200 and 500 kb and map to regions within large blocks of tandem repeats.
It has been believed that satellites are likely the major structural elements of Drosophila, human and mouse centromeres. By tracking the histone H3 variant centromere protein A (CENP-A), the team was able to identify the fruit fly centromeres and found that they primarily occupy islands of complex DNA enriched in retroelements flanked by large blocks of simple satellites. They estimate that approximately 70% of the functional centromeric DNA of D. melanogaster is composed of complex DNA islands, which are rich in non-LTR retroelements and buried within large blocks of tandem repeats.
“They likely went undetected in previous studies of centromere organization because three of the five islands are either missing or incomplete in the published reference D. melanogaster genome … having an improved reference genome assembly is crucial for identifying centromeric DNA sequences,” the authors state.
The retroelements they found were not merely present near centromeres, but were components of the active centromere cores.
“Why retroelements are such ubiquitous components of centromeres and whether they play an active role in centromere function remain open questions,” the authors wrote.
Additional avenues worth exploring include identifying associated tandem repeats, as well as mapping the span of the CENP-A domain and its binding sites.
“Knowing the identity of D. melanogaster centromeric DNA will enable the functional interrogation of these elements in this powerhouse model organism,” the authors wrote.
To enable better understanding of biology, sequencing data must be accurate and complete. This is especially true when seeking out variants and determining their implications.
Luckily, technical and software improvements for SMRT Sequencing are making it easier to efficiently generate genome assemblies with unparalleled accuracy.
As presented in a webinar by PacBio Staff Scientist Sarah Kingan (@drsarahdoom) and GoogleAI Genomics Project Lead Andrew Carroll (@acarroll_ATG), HiFi reads enabled by circular consensus sequencing (CCS) on the new Sequel II System challenge the notion that sequencing technologies require a tradeoff between length and accuracy.
Kingan highlighted several benefits to using HiFi data for genome assembly:
- Higher accuracy of assemblies due to the high inherent base quality of HiFi reads
- Dramatic time-savings in generating a genome assembly
- Algorithmic improvement in the FALCON assembler that enhance the performance of HiFi assemblies
HiFi reads are extremely accurate because they utilize single-molecule consensus, rather than multiple-molecule consensus, which is required for traditional long-read assembly methods. The resulting HiFi assemblies have higher base accuracy than assemblies produced by continuous long reads.
HiFi reads are also more efficiently produced by CCS due to algorithmic enhancements that reduce compute time. CCS for a single SMRT Cell 8M run on the Sequel II System will be able to be completed in 3.5 hours with the upcoming software release.
Because the HiFi reads are already error corrected, the genome assembly process is simplified and streamlined, requiring only 20% of the compute time for a human genome compared to a continuous long read assembly.
HiFi data needed HiFi-ready assembly tools
In order to make the most of these improvements, some assembly and analytical programs have also been modified.
While testing the system on several human and animal genomes, Kingan said the PacBio team achieved equivalent or higher contiguity in multiple species, such as the fruit fly and bluefin tuna. But in a complex plant genome such as rice, with its multitude of repeat-induced overlaps, the results weren’t as robust.
So Kingan and colleagues modified the FALCON-Unzip assembler to make the most of the higher accuracy HiFi reads. By ignoring indel differences, they were able to better assemble the plant genomes. These latest features will be added soon to the already-incorporated improvements of faster read tracking and polishing.
Deep learning digs deeper
When it comes to assessing the “unknown unknowns,” artificial intelligence and machine learning is better than even the most robust human-designed algorithms, said GoogleAI’s Carroll.
His team has developed DeepVariant, a germline variant caller distinguished by its best-in-class accuracy. The open source program is also extensible – it can be re-trained for new technologies without writing new software — and this is exactly what his team did, in order to better handle HiFi data.
HiFi read errors are different from short-read errors, Carroll explained. Short-read data can lead to mapping complexity and coverage variability. HiFi reads are much more mappable and uniform, but can have noisier indel lengths in homopolymers, he said.
Carroll’s team fed the DeepVariant program millions of examples and labels from Genome in a Bottle to update weights in the model. Considering the range of uses and needs of PacBio users, they included data collected on both SMRT Cells 1M and 8M, featuring a variety of insert sizes and coverage levels.
Better training yields better results
The team saw somewhat improved SNP accuracy, which was already very high; substantially improved indel accuracy; and robust, more uniform coverage titrations. The developers were surprised to see that DeepVariant was also able to call some structural variants without specifically being trained to do so. And the improved DeepVariant 8.0 was able to confidently call regions that were previously deemed “difficult,” “non-confident,” or “non-callable.”
“AI-based programs actually benefit from more data and more difficult – and different – data,” Carroll said. “There are thousands more variants we can now call confidently.” Improved haplotype phasing and the ability to call variants in other HiFi data types are also on the horizon, Carroll said.
Watch the complete webinar and visit www.pacb.com/HiFi to learn more:
Variety is the spice of life, and one of the drivers of genetic variation is gene splicing.
After a gene is transcribed, there are alternatively spliced transcripts that add even more variety to that gene’s expression and its menu of phenotypes.
It appears that there are types of disorders that take advantage of these varieties. Top amongst them are myeloid disorders, where somatic mutations in splicing factors lead to cell proliferation in myelodysplastic syndromes (MDS) and blood cancers.
Christopher R. Cogle, a physician-scientist at the University of Florida, would like to understand why, in hopes that such knowledge could be used to develop new therapeutic strategies to target acute myeloid leukemia (AML).
With the help of the Icahn Institute for Data Science and Genomic Technology at Mount Sinai, the 2019 RNA Sequencing SMRT Grant recipient will be able to interrogate the differential isoforms within AML cell lines and test the effects of a novel splicing factor depletion agent his lab has created.
AML is the result of a multistep transforming process of hematopoietic stem and progenitor cells (HSPCs) which enables them to proceed through limitless numbers of cell cycles and to become resistant to cell death. Interference with DNA replication using a combination of chemotherapy drugs has been the mainstay in AML therapy for more than fifty years, but the relapse rate is still very high.
Cogle’s lab has found that some of these leukemia cells embed within blood vessels to protect themselves from these drugs, so he developed a vascular disrupting agent to disrupt such sanctuary. He treated around 40 patients with the therapy, with success, but also some side effects.
Seeking a similar, but better tolerated alternative, he returned to the lab, growing AML cells on endothelial cells and then testing the leukemia killing activity of 31 million compounds. He identified several promising compounds that selectively killed AML cells within the vascular niche, while sparing endothelial cells and normal lymphocytes.
Extensive proteomic studies on one of the hit compounds showed that it binds and inhibits a splicing repressor. To understand the role of this splicing repressor in AML, Cogle’s team generated cell lines of human AML with and without knock-down of the splicing repressor and found that depletion of the splicing repressor leads to AML cell death and failure to engraft in mice. But downregulation of the splicing repressor in normal hematopoietic cells don’t affect cell viability or proliferation.
Indispensible – or not
Why is the splicing repressor seemingly indispensable in AML, yet dispensable in normal hematopoietic cells? This is the question Cogle is hoping the Iso-Seq method will answer.
He plans to use the PacBio RNA Sequencing technique to compare and contrast the gene expression and transcript isoform expressions in AML versus normal HSPC with and without knock-down of the splicing repressor.
“We’ve used conventional RNA sequencing to examine the splicing repressor depleted AML cell lines, and wished we had longer and more reads to detect the full variety of isoforms under the control of the splicing repressor,” Cogle said.
The initial RNA sequencing data was able to illuminate some biology — and several gaps that Cogle hopes to fill with more robust Iso-Seq data. He will be working with his UF colleague Ana Conesa, who has experience in functional RNA splicing as well as Big Data, including the development of a newly released computational tool, tappAS.
“In order to get to the resolution needed, you need to move beyond conventional RNA sequencing. Iso-Seq will be an important tool in dissecting these splicing mechanisms and matching them to cancer phenotypes.”
Cogle’s ultimate goal is to get these new therapies into the clinic, and full-length transcript sequencing and isoform analysis will help this endeavor in many ways. It will help explain the oncogenic mechanisms of blood cancer and how his pharmacological agents work — information that can be used in validation studies, toxicology studies, designing bioactivity assays for early phase clinical trials, and expansion campaigns to identify additional compounds.
It will also help answer some fundamental biological questions.
“This is where PacBio will have one of its greatest impacts in science,” Cogle said. “It allows people to look at alternative splicing to a depth and breadth that conventional RNA sequencing cannot.”
We’re excited to support this research and look forward to seeing the results. Check out our website for more information on upcoming SMRT Grant Programs for a chance to win free sequencing. Thank you to our co-sponsor, the Icahn Institute for Data Science and Genomic Technology at Mount Sinai, for supporting the 2019 RNA Sequencing SMRT Grant Program.
Today we offer the final post in our blog miniseries about early access users’ experiences with the new Sequel II System. Shane McCarthy, a scientist at the University of Cambridge who was able to use the new sequencing system at the Wellcome Sanger Institute, gave a presentation on his experience generating data for tree-of-life sequencing projects.
McCarthy participates in several of these large-scale projects, such as the Vertebrate Genomes Project, the Sanger 25 Genomes Project, and the Darwin Tree of Life Project. For all of them, the goal is to produce high-quality, phased, chromosome-level assemblies with minimal gaps.
Through Sanger’s early access to the Sequel II System, McCarthy and his team were able to evaluate the new sequencing system’s performance on several animal genomes. These included fish (brown trout, sterlet, ploughfish, and milkfish), amphibians (Gaboon caecilian and common frog), and others; most had been sequenced previously so there were existing genomic resources to use for comparison.
The genomes were assembled with a mix of continuous long read (CLR) data and HiFi data, the latter of which is produced via circular consensus sequencing (CCS). For the CLR sequencing mode, the new SMRT Cell 8M yield was 80 Gb to 90 Gb. In CCS mode, the cells often produced more than 250 Gb of raw data. “We were quite happy” with the yields, McCarthy said, noting that the system performed consistently.
After giving an overview of his work, McCarthy dove into detailed looks at two of the fish samples to help webinar attendees understand the Sequel II System’s performance. For the sterlet, which has a genome made more challenging due to an unresolved whole genome duplication that left some residual tetraploidy, his team used two SMRT Cells of CLR data for the assembly. They compared the results for this fish to previous assemblies of its parents, using trio binning to assign haplotypes to their maternal or paternal origin. A BUSCO analysis found that more than 92% of genes were complete in each haplotype, a level that McCarthy considers very good at this stage of the assembly. He also presented data on milkfish, which similarly led to strong results (at least 95% of genes were complete) from BUSCO analysis.
McCarthy noted that data from these projects are being made available through the VGP. As for the Sequel II System, he concluded, “it’s a huge leap in scaling and affordability for these tree-of-life genome assembly projects.”
For more details, watch McCarthy’s full presentation.
A new publication released in PLOS One from scientists at the Mayo Clinic offers a great look at our CRISPR/Cas9-based, amplification-free targeted sequencing method and its utility for accurately sizing a clinically important repeat expansion.
“Amplification-free long-read sequencing of TCF4 expanded trinucleotide repeats in Fuchs Endothelial Corneal Dystrophy” comes from lead author Eric Wieben, senior author Michael Fautsch, and collaborators. This is the second group to use the amplification-free technique for this disease; the first performed their work on a PacBio RS II System, while this team used the newer Sequel System.
What makes the disease such an interesting target for this approach? While Fuchs endothelial corneal dystrophy (FECD), a late-onset degenerative eye disease, affects just 4% of Caucasians in the U.S., more than 75% of those cases can be traced to an expansion of a CAG repeat found in the TCF4 gene. That makes FECD “the most common disease that is attributable to the expansion of a trinucleotide repeat,” according to the paper. Intriguingly, Mayo Clinic investigators have found that a fraction of patients with the repeat expansion don’t develop the disease; in this project, they aimed to test their hypothesis that interruptions in the repeat sequence may explain the phenomenon.
But identifying interruptions required sequencing the entire length of the repeat expansion, something that could not be done with PCR amplification due to its likelihood of introducing confounding artifacts in sequencing data. The team turned to PacBio’s recently launched amplification-free protocol (we call it No-Amp), which uses the CRISPR/Cas9 system to target specific sequences of interest. “This method permits the enrichment and direct sequencing of targeted sequences without PCR amplification,” the scientists write, adding that SMRT Sequencing technology “also permits the generation and analysis of full-length sequences from even expanded repeats.”
To evaluate this method, scientists compared results to those from an STR assay and from Southern blots. The data were highly concordant: all of the amplification-free “size estimates for sub-pathological length repeats match the STR results within 1 repeat triplet,” they report. Also, “the sequencing was successful in identifying a previously described interruption within an unexpanded allele and provided sequence data on expanded alleles greater than 2000 bases in length,” the team notes.
While this study found no novel repeat interruptions that might explain why certain individuals do not develop FECD, it did generate some interesting results. First, two samples were found to have “novel variation in the AGG repeats that immediately precede the CAG repeats,” which could help scientists hone their hypothesis about these patients. Another intriguing discovery was the heterogeneity in repeat lengths of the expanded allele. “Given that these samples were not PCR amplified, this suggests somatic instability of the expanded repeat sequence and consequent mosaicism within the population of leukocytes used for the analysis of each specimen,” the scientists report. “In contrast, there is very little heterogeneity of subpathogenic alleles (<40 repeats).”