This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
We’re pleased to release a new data set along with an allele phasing GitHub software workflow for those interested in exploring SMRT Sequencing data from an Alzheimer’s disease candidate gene study. Our team collaborated with Integrated DNA Technologies (IDT) to design a 35-gene panel targeting candidate Alzheimer’s disease genes identified as potential genetic risk loci across many GWAS and linkage studies. Long-read PacBio sequencing was applied to brain and skeletal tissue from two individuals diagnosed with Alzheimer’s disease and a wide range of variants were detected, from SNPs to indels, and larger structural variations up to several kilobases in size. Additionally, alleles were successfully phased which provides a more comprehensive understanding of the biological significance of the variants present in the samples. Here’s an example screenshot of a BIN1 gene phased into two phase blocks across a 62,641 bp region:
The samples were sequenced using the Sequel System (Sequel Chemistry 1.2) and analyzed with our newly updated Phasing Consensus Analysis for Targeted Sequencing Data GitHub repository. Data sets and related files are available on our PacBio DevNet. Captures of 7 kb genomic fragments for brain and skeletal muscle tissues were each sequenced on a single SMRT Cell, yielding roughly 8 GB of mappable data to the human reference genome.
For more about this data collection, don’t miss the upcoming webinar, “Characterizing Alzheimer’s disease candidate genes and transcripts with targeted, long-read, single-molecule sequencing” hosted by IDT on Wednesday, September 27th. We will be deep diving into the project and illustrate how coupling genomic and transcriptomic captures with xGen® Lockdown® probes enable informative results and insights beyond SNPs.
Register now to attend at 7:00 am PDT/10:00 am EDT / or at 11:00 am PDT/2:00 pm EDT.
A new publication from scientists at The Rockefeller University and PacBio presents reference-grade, phased diploid genome assemblies for two important avian models for vocal learning, Anna’s hummingbird and zebra finch. Results are expected to help establish genome quality standards for the G10K and B10K sequencing projects, in addition to providing a better foundation for neuroscience studies.
Published in GigaScience, “De Novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads” comes from lead author Jonas Korlach, senior author Erich Jarvis, and collaborators. The team undertook this project to improve the quality of genome assemblies available for these birds, demonstrating that key genes of interest were completely represented in single contigs. Existing assemblies produced with Sanger or short-read sequencing were incomplete and highly fragmented, precluding the comprehensive scientific view required for a deeper understanding of vocal learning.
By incorporating SMRT Sequencing, the team not only raised the bar for assembly quality but also phased the genomes using FALCON-Unzip, a diploid-aware assembly tool. The new zebra finch assembly represented “a 108-fold reduction in the number of contigs and a 150-fold improvement in contiguity compared to the current Sanger-based reference,” the authors write. For hummingbird, the PacBio assembly led to “a 116-fold reduction in the number of contigs and a 201-fold improvement in contiguity over the reference.” Both assemblies had contig N50s greater than 5 Mb. “These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references,” the scientists report, “including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult to sequence regions, complex repeat structure errors, and allelic differences between the two haplotypes.”
The team assessed gene content of the assemblies with CEGMA and BUSCO comparisons. In both cases, the number of complete or nearly complete genes increased. They also used RNA-seq to evaluate the reference genomes, finding that the PacBio long-read assemblies increased “total transcript read mappings compared to the Sanger-based reference … suggesting more genic regions available for read alignments,” they write.
Finally, the scientists conducted in-depth interrogations of four genes particularly important for vocal learning. EGR1, for instance, has gaps in previous zebra finch and hummingbird reference genomes. In both SMRT Sequencing assemblies, though, the gene was fully resolved and spanned in a complete contig. There were similar improvements for DUSP1, FOXP2, and SLIT1.
“We found that the long-read diploid assemblies resulted in major improvements in genome completeness and contiguity, and completely resolved the problems in all of our genes of interest,” the scientists report. “We now, for the first time, have complete and accurate assembled genes of interest that can be pursued further without the need to individually and arduously clone, sequence, and correct the assemblies one gene at a time.”
For more, check out our recent release of Iso-Seq data for hummingbird and zebra finch.
In a recent BMC Genomics paper, scientists in the Netherlands report a high-quality genome assembly for Folsomia candida, a soil-dwelling arthropod. The organism, which is known for reproducing parthenogenetically (and only when infected with Wolbachia), is frequently used in the lab for toxicity testing.
Lead author Anna Faddeeva-Vakhrusheva, senior author Dick Roelofs, and collaborators at Vrije Universiteit Amsterdam and other institutions describe their findings in “Coping with living in the soil: the genome of the parthenogenetic springtail Folsomia candida.” The team chose SMRT Sequencing to characterize the genome so they could learn more about the organism’s reproductive process and stress response.
F. candida has a diploid genome with seven pairs of chromosomes. The scientists generated a 221.7 Mb assembly with a contig N50 of 6.5 Mb. It is remarkably complete, with just 0.1% of all bases marked by gaps. Analysis revealed that repeat segments comprise more than 23% of the genome, and GC content was more than 37%. The team performed a number of quality-control and validation steps, concluding that assembly quality was excellent. The assembly also included the complete 15 kb F. candida mitochondrial genome.
The team was particularly interested in genome content acquired through horizontal gene transfer. A systematic analysis of all genes predicted by the assembly identified more than 800 acquired genes, most of which came from bacteria, fungi, and protists. The complement of horizontally transferred genes was impressive: “This number is among the highest found in metazoan genomes, being only exceeded in rotifers and some nematode species,” the scientists report.
Another highlight of the study came from a focus on F. candida’s endosymbiont Wolbachia. “Parthenogenesis is most likely imposed by Wolbachia,” the team writes. “The presence of Wolbachia is essential for reproduction: animals cured of Wolbachia by antibiotic treatment lay eggs that fail to hatch and develop.” The arthropod sequencing effort also yielded a complete assembly of the endosymbiont, which with its 1.8 Mb genome is the largest strain of Wolbachia ever discovered. Forty-eight genes were found to harbor ankyrin repeats, which are known for “mediating protein-protein and protein-DNA interactions with the host cells,” the scientists note.
Intriguingly, the team identified a functional antibiotic biosynthesis cluster, “suggesting the production of yet undiscovered antimicrobial compounds in an animal genome,” they conclude. “This high quality genome will be instrumental for evolutionary biologists investigating deep phylogenetic lineages among arthropods and will provide the basis for a more mechanistic understanding in soil ecology and ecotoxicology.”
A panel session at the recent Precision Medicine Leaders Summit, held in San Diego last month, offered great perspectives on the need to better represent global ethnic diversity in order to make the most of genomic advances for all patients.
Panelists included Robert Sebra from the Icahn School of Medicine at Mount Sinai; NCBI’s Valerie Schneider; Benedict Paten from the University of California, Santa Cruz – representing the Global Alliance for Genomics and Health; and Justin Zook, co-leader of the NIST Genome in a Bottle (GIAB) Consortium. The discussion was moderated by our own Luke Hickey.
The session kicked off with a look at a study published in the New England Journal of Medicine that found a greater number of incorrect genetic test results in black Americans than in white Americans for an inherited heart disorder. Along with other examples, that provided a good foundation for a conversation about the risk of health disparities based on genomic data. The speakers also discussed how the human reference genome and other sources contribute to genetic bias.
Clearly, benefits from precision medicine should be equally available to people from all ethnic groups. The panel talked about ongoing efforts to improve the human reference genome and other resources by including more ethnic diversity, as well as recent efforts to establish new population-specific reference genomes. Examples included GIAB projects to sequence trios, resulting in high-quality Ashkenazim Jewish ancestry genome assemblies, and international programs that have recently presented excellent assemblies for Korean, Chinese, Japanese, African, and Danish individuals.
Looking to the future, panelists spoke about improving study and test design to represent diversity. They also discussed how the community can work to make precision medicine more accurate for all ethnic groups, including data-sharing programs and more.
A compelling new paper from scientists at the Parkinson’s Institute and Clinical Center, Houston Methodist Research Institute, and several other organizations demonstrates the importance of fully sequencing repeat expansion regions for a clearer understanding of the underlying biology of the diseases they cause. This publication also offers a look at how CRISPR/Cas9 capture can be used in combination with SMRT Sequencing to access the expanded repetitive region at a base level resolution without any PCR bias.
“Parkinson’s disease associated with pure ATXN10 repeat expansion” comes from lead authors Birgitt Schüle and Karen McFarland, senior author Tetsuo Ashizawa, and collaborators. The study involved a Mexican family with one individual previously diagnosed with Parkinson’s disease and several members with spinocerebellar ataxia.
Clinical genetic testing had found an ataxia-associated pentanucleotide repeat expansion in the patient with Parkinson’s, and this team hoped to learn more. “To further genetically characterize the ATXN10 repeat expansion and to better understand the phenotypic differences of progressive cerebellar ataxia with seizures and parkinsonism,” they write, “we employed several advanced and novel molecular genetic techniques to dissect the genetic structure of the repeat expansion in this family.”
Among those techniques was a new method that combined the sequence-specific endonuclease activity of the CRISPR/Cas9 system with long-read SMRT Sequencing. The team reports that they were able to use this method to snip out genomic ATXN10 repeat expansion regions, some spanning up to 7 kb in length, and sequence them “as one continuous fragment without prior amplification of the genomic DNA.” This was done for six family members, with results indicating that most affected family members had a string of 480 ATTCT repeats followed by about 920 ATTCC repeat interruptions. Strikingly, the family member with ataxia and parkinsonism had a different expansion: more than 1,300 ATTCT repeats but no ATTCC repeats. “We propose that the absence of repeat interruptions play a role in the underlying disease process acting as a genetic modifier and leading to the clinical presentation of L-Dopa responsive parkinsonism,” the scientists write, adding that the repeat interruptions may contribute to the development of epilepsy.
“Single molecule sequencing paired with SMRT/Cas9 capture approach allowed us to characterize the genetic composition of the complete repeat expansion which revealed a novel phenotype-genotype correlation for Parkinson’s disease and ATXN10,” the team adds, highlighting the importance of adding to existing knowledge of repeat expansion types and possible phenotypes. “We conclude that the underlying genetic architecture of ATXN10 repeat expansions is critical for presentation of clinical phenotypes and presumably also the underlying pathology.”
A recent paper in the journal Angewandte Chemie describes using SMRT Sequencing to characterize biosynthesis of a psychotropic product in Psilocybe carpophores, better known as magic mushrooms. Scientists from the Hans Knöll Institute in Germany report that the work could pave the way to synthetic production for pharmaceutical use.
“Enzymatic Synthesis of Psilocybin” comes from Janis Fricke, Felix Blei, and Dirk Hoffmeister. The team aimed to uncover the enzymatic mechanisms of biosynthesis for psilocybin, culminating in the characterization of four related enzymes: PsiD, PsiK, PsiM, and PsiH. “In a combined PsiD/PsiK/PsiM reaction, psilocybin was synthesized enzymatically in a step-economic route from 4-hydroxy-l-tryptophan,” the authors write.
Scientists used PacBio sequencing to analyze Psilocybe cyanescens, resulting in a 61.3 Mb assembly with just 217 contigs (meanwhile, a short-read assembly of a closely related mushroom for the same project required more than 2,900 contigs to represent just 41.3 Mb). After identifying the genes involved in producing psilocybin, the team validated the work by splicing them into E. coli and confirming the biosynthesis event.
Since its structure was first characterized in 1959, scientists have been seeking ways to synthesize psilocybin — but without success. As the study authors note, their new results finally “may lay the foundation for its biotechnological production.”
In an article from Chemical & Engineering News, the University of Minnesota’s Courtney Aldrich said the discovery will be important “for developing a fermentation process for production of this powerful psychedelic fungal drug.”
If you’re interested in avian vocal learning or want to explore a PacBio Iso-Seq data set generated with the Sequel System, we have good news. We’ve just released data from Iso-Seq interrogations of brain tissue from two avian models of vocal learning, Anna’s hummingbird (Calypte anna) and zebra finch (Taeniopygia guttata), sequenced in collaboration with the Erich Jarvis and Olivier Fedrigo labs at the Rockefeller University.
If you’re not familiar with the Iso-Seq method, it’s the long-read sequencing answer to short-read RNA-seq studies. By using SMRT Sequencing for a transcriptome project, scientists can generate full-length isoform data, clearly capturing alternative splicing events to see the real diversity of transcripts. Unlike RNA-seq approaches, the Iso-Seq method takes advantage of long-read data to fully span transcript isoforms from the 5’ end to their poly-A tails, eliminating the need for error-prone transcript reconstruction and inference processes. With the Sequel System, Iso-Seq projects are low cost and time efficient. Currently we recommend only 1-2 SMRT Cells per tissue type for genome annotation.
For this data set, we used the Iso-Seq method to characterize the transcriptomes of two birds, with brain total RNA. The two species’ brain samples were barcoded, pooled, and sequenced using 4 SMRT Cells on the Sequel System. An average of ~460,000 reads was generated per SMRT Cell; total sequencing data yields ranged from 6.1 to 7.7 Gb per SMRT Cell. More than 15,000 isoforms were identified in each species, including thousands that had not been previously annotated in each bird and 400 to 500 new genes.
The data set contains both the raw pooled sequences and the processed, demultiplexed sequence files, separated by species and excluding any raw sequences not containing barcodes. Our initial analysis of these data is presented in this poster (Vierra et al.), which is being presented this week at the Genome 10K and Genome Science Conference at the Earlham Institute. It demonstrates how improved loading on the Sequel System simplifies library prep and how both command-line and new SMRT Link tools can be used for analysis. It also illustrates how full-length transcript data can help identify additional exons and UTRs.
Enjoy the data!
If you use the data and our analyses in our publication before we complete our study, please cite:
Michelle N. Vierra, Sarah B. Kingan , Elizabeth Tseng , Tyson Clark, Ting Hon, William J. Rowell, Jacquelyn Mountcastle, Olivier Fedrigo, Erich D. Jarvis, Jonas Korlach. From RNA to Full-Length Transcripts: The PacBio Iso-Seq Method for Transcriptome Analysis and Genome Annotation. Genome10K and Genome Science Conference Abstracts 2017.
A new paper in Scientific Reports presents results from a transcriptome analysis for Oryctolagus cuniculus. The work was done with SMRT Sequencing, which allowed scientists to discover novel transcripts and increase the diversity of known transcripts for the rabbit.
“A transcriptome atlas of rabbit revealed by PacBio single-molecule long-read sequencing” comes from lead authors Shi-Yi Chen and Feilong Deng, senior author Song-Jia Lai, and collaborators at Sichuan Agricultural University. In the paper, the scientists note that an ongoing challenge in rabbit studies has been the dearth of gene-level data. “Most of the existing gene models are just derived from in silico prediction with lack of the reliable annotation on alternative isoforms and untranslated regions,” they write.
The team turned to SMRT Sequencing to generate full-length transcripts and avoid the well-known assembly pitfalls of short-read transcript data. “We employ this technology to sequence polyadenylated RNAs of rabbit and provide a transcriptome-wide landscape in relation to gene models and alternative isoforms,” they report.
The scientists pooled and sequenced RNA samples from several organs and tissues collected from three New Zealand white rabbits. After filtering, they were left with more than 36,000 high-confidence transcripts from nearly 15,000 genes. That included quite a bit of novel information: “more than 23% of genic loci and 66% of isoforms have not been annotated yet within the current reference genome,” the scientists write. Their interest in alternative splicing was rewarded as well, with the final transcriptome containing nearly 25,000 alternative splicing events and more than 11,000 alternative polyadenylation events. Those numbers represent an order of magnitude more alternative splicing than was characterized in the reference gene models. The project also turned up a significant amount of non-coding RNAs, represented by 17% of transcripts.
The scientists followed up on these findings with several validation studies, including an analysis of genes in the major histocompatibility complex. Their analysis demonstrates “the obviously improved power of PacBio transcripts for recovering the highly homologous sequences among ten MHC genes than the assembled transcripts from short reads,” they report.
According to the paper, scientists achieved their mission of more thoroughly characterizing the rabbit transcriptome. “The length distribution of the most 5′ exons of our PacBio transcripts is consistent with former report in human,” they write, “which would indicate the comparable sequencing completeness in rabbit.”
We’re looking forward to the discussion of many more vertebrate species’ genomes at the upcoming Genome 10K and Genome Science Conference 2017 hosted this week by The Earlham Institute.
Sniffles and NGMLR, structural variant detection and alignment algorithms developed in the Schatz lab for long-read sequence data, are already familiar to many in the PacBio community. Now, a preprint is available so users can see how these open-source tools perform in a variety of conditions.
“Accurate detection of complex structural variations using single molecule sequencing” comes from lead author Fritz Sedlazeck at Baylor College of Medicine, senior author Michael Schatz at Johns Hopkins University, and collaborators. The team notes that long-read sequencing has introduced a much more comprehensive means of discovering structural variants, many of which are missed by short-read sequence data. To take advantage of that capability, the scientists developed NGMLR, “a fast and accurate aligner for long reads,” and Sniffles, which “successively scans within and between the alignments to identify all types of [structural variants],” according to the paper. Sniffles is unique in its ability to routinely detect nested variants.
The scientists describe evaluating the performance of these tools for structural variant discovery using data from several different sequencing platforms. The tools were tested on data from a breast cancer genome, healthy human genomes, and Arabidopsis. Using a simulated human data set, the team found that NGMLR and Sniffles outperformed other algorithms such as BWA-MEM and PBHoney, detecting nearly 95% of structural variants with no false discoveries. While more than 94% of variants called by PacBio were confirmed by other platforms, the scientists report that “Oxford Nanopore had substantially worse concordance. … This systematic bias for deletions in the Oxford Nanopore data is most likely an error in the base calling.”
Sedlazeck et al. also found a concerning trend in structural variant calls when using short-read data. The authors note, “Using the short-read approach we detect, on average, 27 times more translocation events compared to using Sniffles within presumably healthy human data sets,” they note. An investigation into this phenomenon determined that mis-mapping of short reads in low-complexity regions leads to insertions being misidentified as translocations. “Overall, we could rule out 1,869 (83.18%) of the Illumina-based translocation calls as false,” they report.
Finally, the scientists assessed how much coverage is necessary to see the full picture of structural variation. For a healthy human genome, 15-fold coverage of SMRT Sequencing “has a precision of ~80% and recall of 69.64%,” they write. Boosting that to 30-fold coverage achieved similar comepleteness for the much more complex cancer genome. “This translates to a potential price reduction of several tens of thousands of dollars per sample,” they add. “These requirements will be reduced even more in the years to come as the throughput and read length increase and sequencing error rates decrease.”
“The versatility of these methods enables an unprecedented view into structural variations in the human genome and other genomes from long read single molecule sequencing data,” the scientists write. They predict that these and related improvements “will usher in a new era of high quality genome sequences for a broad range of research and clinical applications, and lead to new insights into polymorphic variation, pathogenic conditions, and the forces of evolution.”
Scientists from the University of Hong Kong recently reported results of a head-to-head comparison of long-read and short-read platforms for sequencing and assembly of a bacterial genome. They determined that only SMRT Sequencing was capable of generating highly accurate, complete assemblies. “Completing bacterial genomes should no longer be regarded as a luxury, but rather as a cost-effective necessity,” the team reports.
“PacBio But Not Illumina Technology Can Achieve Fast, Accurate and Complete Closure of the High GC, Complex Burkholderia pseudomallei Two-Chromosome Genome” was published in Frontiers in Microbiology by lead author Jade Teng, senior author Patrick Woo, and collaborators. For this project, scientists compared performance of the PacBio RS II Sequencing System with the Illumina HiSeq 1500. Their target was Burkholderia pseudomallei, which has at least 68% GC content as well as “highly repetitive regions and substantial genomic diversity,” the authors report.
After sequencing, the team attempted both hybrid and single-source assemblies. Working with Illumina data alone “resulted in a draft genome with more than 200 contigs,” they note, pointing out that the platform’s reliance on PCR amplification is inherently problematic for GC-rich genomes. Three different short-read assemblers were not able to improve results. The hybrid assembly of both sequencers’ data was also “not successful,” producing 74 contigs, the team reports.
Assembling only PacBio data, which was generated from a single SMRT Cell, led to a very different result. The approach “achieved complete closure of this two-chromosome B. pseudomallei genome without additional costly bench work and further sequencing, demonstrating its utility in the complete sequencing of bacterial genomes, particularly those that are well-known to be difficult-to-sequence,” the scientists write. The chromosome contigs of the assembly aligned to the organism’s reference genome with better than 99.9% accuracy. Importantly, the assembly accurately characterized “the number of CDSs and their distributions in each subsystem, four ribosomal operons, the highest number of core and virulence proteins (coverage of query protein sequence and amino acid identity ≥80%), and MLST gene loci,” the team adds.
The Illumina assembly, on the other hand, was unable to resolve these elements. “Extraordinarily high coverage of Illumina reads were observed in several collapsed repeat regions, including regions containing varying copies of mobile element proteins and ribosomal operon,” Teng et al. report. “We reasoned that Illumina sequencing was not able to resolve these repeat regions as their sequence reads were not long enough to span different kinds of repeats with unique flanking sequences.”
The scientists also included an assessment of project cost. “To completely sequence a bacterial genome using Sanger sequencing or the second generation sequencing platforms, the main bulk of the cost, labor and time is spent in the gap-filling phase,” they write. “It has been estimated that when using these second generation sequencing platforms, around 95% of the money and time are spent in completing the last 1% of the bacterial genome.” But the calculation is very different for SMRT Sequencing. “Although the cost per base is more expensive for the PacBio RS II platform compared to short-read sequencing technology, no additional manual work after de novo assembly is required,” the team concludes, “and the benefit of obtaining an accurate number of individual replicons and an intact assembly of repetitive regions and mobile genetic elements justify the initial cost.”
Earlier this year, scientists from Korea reported results from a transcriptome study of Pacific abalone. In this paper, the team used SMRT Sequencing to demonstrate that alternative splicing and gene expression have sex-specific signatures in these organisms.
“Alternative Splicing Profile and Sex-Preferential Gene Expression in the Female and Male Pacific Abalone Haliotis discus hannai” comes from lead authors Mi Ae Kim and Jae-Sung Rhee, senior author Young Chang Sohn, and collaborators. They focused on abalone, a marine gastropod, because of its importance to Korean aquaculture: the species they studied is estimated to represent about 10,000 metric tons of production each year.
As H. discus hannai has grown in economic importance, new genomics resources have become available, including a genetic linkage map and some RNA-seq data. This latest project was designed to glean more information about abalone to provide a clearer view of its biological function. Scientists chose to focus on sex-specific transcriptomes, using the Iso-Seq method to study gene content in male and female members of the species. They analyzed several tissue types — including gonads, muscle, gills, and more — and defined 15,110 protein-coding genes in females and 12,145 in males. Of those, 519 genes in female and 391 genes in male produced alternatively spliced transcripts.
To validate these findings, the team investigated expression profiles for six genes known to be sex-preferential in related organisms. Two of the three female-specific genes and all three male-specific genes were highly expressed in their respective samples. “Taken together, these studies strongly suggest the intactness of the sex-specific isoform DB of the Pacific abalone,” the authors write.
“The information obtained in this study represents the first significant contribution to sex-specific genomic resources, as well as isoform information,” the scientists conclude. “These data will provide an essential genomic reference that could be used for further diverse genetics- and physiology-based research using abalones.”
If you weren’t at the 36th International Society for Animal Genetics Conference in Dublin, you missed more than a chance to drink Guinness and practice an Irish brogue. The PacBio team had a great time at ISAG, learning about the latest in animal science and updating attendees on the advantages of SMRT Sequencing for generating high-quality genome assemblies and annotations.
The conference drew more than 750 scientists from around the world, and we were truly impressed by the quality of research they presented in talks and posters. Long-read PacBio sequencing is already making a difference for scientists in this community, many of whom are focused on improved breeding programs. Genome assemblies powered by SMRT Sequencing were presented for many economically important species, including chicken, sheep, goat, pig, cattle, horse, camel, and Atlantic herring. There were also several presentations featuring PacBio long-read sequencing data for immune region haplotypes such as the leukocyte receptor complex and the major histocompatibility complex.
We hosted a morning seminar that demonstrated how SMRT Sequencing provides comprehensive views of animal genomes and transcriptomes. John Williams from the University of Adelaide presented a preliminary assembly of the water buffalo genome generated with Sequel System data. A member of the International Buffalo Genome Consortium, Williams described limitations in contiguity and completion for previous sequencing efforts that used short-read data for this important livestock animal. Seeking a reference-grade assembly, the scientists turned to PacBio long reads, using FALCON-Unzip to phase more than half of the diploid genome. Though the assembly is not yet polished, Williams reported a stellar contig N50 of 18.7 Mb. The other seminar speaker was our own Emily Hatas, who discussed the chicken genome annotation generated by Richard Kuo at the Roslin Institute. In that project, scientists used the Iso-Seq method with SMRT Sequencing to identify 64,000 transcripts, including more than 17,000 long non-coding RNAs that had not been previously annotated.
As sponsors of the event, we also had fun encouraging attendees to snap creative photos with the toy animals we gave away at our booth. Check out the variety of clever snapshots!
We were delighted to be back at the University of Maryland this summer for our annual East Coast User Group Meeting. The day-long event, preceded by half-day workshops on sample prep and bioinformatics, exceeded our expectations. From the packed session hall to the terrific science and great discussions, the UGM facilitated the exchange of best practices and new suggestions for optimizing SMRT Sequencing performance for a variety of applications. Below is a recap of the day’s highlights, with several of the presentations available to download.
PacBio scientist Aaron Wenger presented the Structural Variant Calling application that is included in the SMRT Link v5.0 software release. The application utilizes the read aligner NGM-LR, and features both a command-line tool called pbsv and a web interface. Noting that most of the genetic difference between any two people lies in structural variation, he showed that short-read sequencers cannot detect the vast majority of these important variants. Wenger demonstrated that even low-coverage SMRT Sequencing can be used to discover structural variants; in an experiment, 10-fold coverage revealed almost 100% of homozygous variants and nearly 90% of heterozygous variants in a human individual.
Michael Schatz from Johns Hopkins University gave a talk entitled “In Pursuit of Perfect Genome Sequencing” in which he walked through three key metrics for evaluating genome quality: correctness (basepair accuracy), completeness (no gaps in the sequence), and contiguity (sequence ordered as on the physical chromosomes). Schatz compared the leading sequencing technologies available today, and explained that PacBio SMRT Sequencing is the most capable technology for all three metrics.
Continuing the human genome theme, Ricardo Mouro Pinto from Massachusetts General Hospital spoke about using SMRT Sequencing to quantify CAG repeat instability in Huntington’s disease. Caused by a CAG repeat expansion, Huntington’s occurs when a person’s genome harbors 40 or more copies. Pinto noted that typically, the longer the repeat, the younger the person is at disease onset. The Huntington’s locus is difficult to enrich because it is resistant to PCR amplification. By using Cas9 digestion to perform non-amplification-based target enrichment followed by PacBio sequencing, Pinto’s team was able to capture wild type and disease alleles with no amplification bias. He noted that results are preliminary, and he hopes to expand the number of samples studied to get a better handle on CAG instability.
Representing the plant community, Hamid Ashrafi and Hamed Bostan from North Carolina State University tag-teamed a presentation on the blueberry genome and transcriptome. The fruit plant naturally occurs in diploid, tetraploid, and hexaploid genomes. The scientists generated a high-quality diploid assembly using SMRT Sequencing and noted that long reads were essential to get through the highly repetitive genome. Next, they used Iso-Seq to study several types of tissue from diploid, tetraploid, and hexaploid blueberry plants, finding many transcripts missed by short-read sequence data. Using both genome and transcriptome approaches was particularly important, Ashrafi noted, because SNPs explain only a small portion of natural variation for this plant, and he believes that alternative splicing and structural variants likely contribute a much larger proportion of variation. The team is still analyzing results but said that switching to long reads was “a dream come true.”
On the microbial front, Jethro Johnson from The Jackson Laboratory for Genomic Medicine gave a talk on full-length 16S rRNA sequencing, which is useful for taxa identification. By genotyping or using short-read data, Johnson said, so much of the information in the variable regions of 16S is missed that it often is impossible to accurately classify organisms. So, Johnson turned to SMRT Sequencing and circular consensus sequencing (CCS), which generates highly accurate long reads. Johnson applied CCS for a mock bacterial community of 36 species and found that SMRT Sequencing offered accurate results for identification. In studies of fecal samples, PacBio sequencing was able to provide a unique identification in cases where short-read sequencing generated ambiguous results. The team is now expanding SMRT Sequencing results to include internal transcribed spacer regions.
In a separate presentation, Phillip Tai from the University of Massachusetts Medical School highlighted the use of long-read sequencing for genome population sequencing of adeno-associated viruses. These harmless viruses have gained new interest recently as a vector for gene therapies, so Tai’s lab is interested in analyzing large groups of them to filter out any that would not be ideal vectors. By applying SMRT Sequencing to recombinant AAVs, they generate complete resolution of the vector genome, including the difficult-to-sequence inverted terminal repeats. This accomplishment could have tremendous value in the gene therapy field, he said.
The meeting also included some new tools and protocols from the community. New England Biolabs’ Bo Yan presented SMRT-cappable-seq, a method for characterizing operons across an entire bacterial genome. It involves capping the 5’ end of bacterial primary transcripts and using SMRT Sequencing to produce full-length transcripts. Yan said the protocol increases library prep efficiency and accurately defines and links the transcription start site and transcription termination site (something short reads cannot do). A validation project in E. coli revealed 840 novel operons, extending 40% of annotated operons in RegulonDB. In another talk, Manuel Tardaguila from the University of Florida discussed SQANTI, a new tool to perform quality control for long-read transcripts. The pipeline performs classification, curation, and quantification of transcripts to filter out any artifacts and ensure that scientists analyze only the highest-quality results. SQANTI incorporates PacBio data, a reference genome, and other resources to conduct its rigorous evaluation.
We’d like to thank our hosts for the meeting, the Genomics Resource Center, Institute for Genome Sciences at the University of Maryland, as well as our partners: Advanced Analytical Technologies, Diagenode, and Sage Science. And, of course, thanks to all the scientists who took time out of their busy schedules to make this event a success!
A paper from scientists at the National Marrow Donor Program, Center for International Blood and Marrow Transplant Research, Fred Hutchinson Cancer Research Center, and other institutions reports the use of SMRT Sequencing to characterize the challenging killer cell immunoglobulin-like receptor (KIR) region in eight human genomes. By sequencing full-length fosmids, they found previously unreported haplotype structures.
“Revealing Complete Complex KIR Haplotypes Phased By Long-Read Sequencing Technology” comes from lead author David Roe, senior author Martin Maiers, and collaborators. They targeted the KIR region — which has implications in autoimmune disease, transplantation, infections, and more — because it has historically been very challenging to sequence. Containing as many as 16 genes and pseudogenes, the highly homologous KIR haplotypes are shaped by tandem duplications, deletions, and frequent recombination. “These characteristics of homology, repetitiveness, and structural diversity have made the region difficult to haplotype,” the scientists note. “A sequencing approach that precisely captures the complexity of KIR haplotypes for functional annotation is desirable.”
To that end, they incorporated SMRT Sequencing, which produces reads long enough to span fosmids. “Using this method, we have for the first time comprehensively sequenced and phased sixteen KIR haplotypes from eight individuals without imputation,” the authors report. “Sixteen haplotypes from eight individuals were completely and [unambiguously] sequenced except for two haplotypes whose KIR3DL3 genes were not captured in the fosmid and a small gap in one of the haplotypes, located in a repetitive insertion spanning over 100,000 bp.” Haplotypes were as short as 69 kb and as long as 269 kb, and included four novel structures. The team also uncovered a new gene fusion as well as previously unreported structural variants.
One of the most important elements for resolving this difficult region was eliminating the need to shotgun shear the fosmids prior to sequencing. Because the longest SMRT Sequencing could span full-length fosmids, “it is therefore possible to span an entire fosmid insert with a single continuous read,” the scientists write. “Bypassing the shearing … helped improve the phasing accuracy of the individual fosmid sequences and the high-quality sequences of complete fosmids easily tiled into full haplotypes.”
This workflow made it possible to phase centromeric and telomeric regions, among other accomplishments. “Such completely de novo assembled sequences not only provide the ability to discover and annotate KIR gene alleles at the highest resolution, but also provide value as references, evolutionary informers, and source material for imputation,” the team writes.
We enjoyed attending the annual meeting of the Society for Molecular Biology and Evolution in Austin earlier this month. Some 1,500 people attend SMBE, which this year offered cutting-edge sessions on evolutionary genomics, microbiome dynamics, epigenetics, and much more.
There were several posters and presentations featuring SMRT Sequencing data, most focused on using highly accurate, long-read data to generate or improve reference genomes. The high-quality assemblies we saw are enabling evolutionary biologists to investigate genomic structure and function in a variety of organisms as well as characterize structural variants and other complex polymorphisms that are difficult to detect with conventional technologies.
If you couldn’t attend SMBE, we encourage you to check out these resources for a glimpse of how this community is making using of SMRT Sequencing:
This preprint from Mahul Chakraborty et al. reports a new, reference-grade assembly of an African D. melanogaster strain.
This poster from Fabrizio Ghiselli et al. describes a study of a unique bivalve with two mitochondrial genomes.
This poster from Patrick Reilly et al. shows results from several high-quality de novo assemblies produced with SMRT Sequencing.
This paper from Tracey Ruhlman et al. was the basis for an SMBE poster about characterizing repeat structures in the plastid genome of the flowering plant, Monsonia emarginata.
This poster from Sarah Kingan and Aaron Wenger from PacBio presents structural variation data generated with SMRT Sequencing.
A news article in Science magazine nicely captures the improvements in the quality of genome assemblies made possible by long-read sequencing and other methods. “New technologies boost genome quality” was written by Elizabeth Pennisi and includes interviews with a number of leading scientists.
One of those is Erich Jarvis, the neuroscientist at Rockefeller University who is best known in the genomics community for his work on vocal learning with songbirds and has been instrumental in both the G10K and B10K programs. As he told Pennisi, “The genome quality makes a huge difference in the type of science we can do.” For sequencing through repetitive or other challenging regions, he added, “the long read is always more accurate.”
“He and many other genomics experts are launching a quiet revolution aimed at building better genomes, one made possible by newer sequencing technologies, novel methods for locating sequences on chromosomes, and improved software for piecing DNA together,” Pennisi reports. “In the past 6 months, these approaches have led to a flood of high-quality animal and plant genomes in preprints and published papers.”
The USDA’s Tim Smith is also included, with a look at the impressive goat genome assembly he and his team recently produced. For that effort, he used SMRT Sequencing and complementary approaches, such as Hi-C and optical mapping, for optimal quality and contiguity. The finished assembly, Pennisi notes, “consists of chromosome-length pieces of DNA and only has 492 gaps, a 500-fold improvement over the first goat genome, done in late 2012.”
The article features a chart showing improvements in assembly metrics for three recently published genomes: hummingbird, goat, and maize. We have followed each of those assemblies closely, but there’s something about seeing all the data in one quick chart to really underscore how significant these advances have been — and in such a short period of time. If you have a moment, it’s definitely worth a look.
In a new bioRxiv preprint, scientists from Johns Hopkins present a major step forward in accuracy and completeness for the wheat genome. Their new assembly, generated largely from PacBio data, demonstrates the importance of using long, highly-accurate reads for resolving extremely complex, repetitive genomes.
“The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum,” comes from lead author Aleksey Zimin, senior author Steven Salzberg, and collaborators. In launching this project, the team aimed to overcome a longstanding challenge for the wheat research community. “Common bread wheat, Triticum aestivum, has one of the most complex genomes known to science, with 6 copies of each chromosome, enormous numbers of near-identical sequences scattered throughout, and an overall size of more than 15 billion bases,” they write. “Multiple past attempts to assemble the genome have failed.”
The first publication for this species, which came out in 2012, only assembled one-third of the genome. In 2014, a short-read assembly managed to capture two-thirds of the genome in a highly fragmented assembly, while a subsequent short-read-based effort delivered more sequence but in millions of contigs.
For this project, scientists adopted SMRT Sequencing to produce reads long enough to span repetitive elements that are a hallmark of the plant’s genome. The result was phenomenal: “Ours is the first assembly that contains essentially the entire length of the genome, with more than 15.3 billion bases, and its contiguity is more than ten times better than the partial assemblies published in the past,” the authors report.
The team took two approaches to creating the assembly: a hybrid Illumina-PacBio version, and an all-PacBio version. Ultimately, they merged both to create the final assembly, which has a contig N50 of 232.6 kb; the longest contig is 4.5 Mb. The PacBio-only version, produced with the FALCON assembler, relied on 36-fold genome coverage to generate an assembly of 12.94 Gb with a contig N50 of 215.3 kb. “The key factor in producing a true draft assembly for this exceptionally repetitive genome was the use of very long reads, averaging just under 10,000 bp each, which were required to span the long, ubiquitous repeats in the wheat genome,” the scientists note.
Evaluating the assembly’s quality was a tall order given the state of previous assemblies. The scientists compared it to an assembly for a diploid ancestor of bread wheat and found that 99.8% of the smaller genome aligned to the new assembly, offering “strong support for its accuracy” as well as its completeness, they write.
One of the most interesting findings of this effort was the delineation of that ancestral plant’s contributions to the bread wheat genome (known as the wheat D genome). “By aligning this assembly to the draft genome of Aegilops tauschii, the progenitor of the wheat D genome, we were able to cleanly separate the D genome component from the A and B genomes of hexaploid wheat, which is reported here for the first time,” the team explains.
Ultimately, the scientists believe the new wheat assembly offers a significant boost to the wheat community, which has never had the benefit of a well-annotated, high-quality genome for crop improvement efforts. “This represents by far the most complete and contiguous assembly of the wheat genome to date,” the scientists write, “providing a strong foundation for future genetic studies of this important food crop.”
In an effort to improve precision medicine in Chinese populations, Novogene announced plans to build a database of structural variants in 1,000 Chinese individuals using PacBio SMRT Sequencing. Databases which catalog SNVs and small indels have proven invaluable for precision medicine, serving as population controls for rare disease research and providing a list of variants for genetic association studies. Yet, most of the base pairs that differ between two human genomes are in structural variants which are not adequately represented in current databases. Furthermore, current databases do not represent the genetic background of all ethnic populations, particularly the Chinese who comprise one-fifth of the world’s population.
Novogene will perform the sequencing using a fleet of up to 10 PacBio Sequel Systems, which can produce reads with an average length of 10,000 – 18,000 bp. Long-reads are better able to map to repetitive regions of the genome and fully span large variants. Previous studies have shown long-read sequencing has five times higher sensitivity for discovery of structural variants as compared to short-read sequencing approaches1. Structural variants are already known to cause many human diseases, including Carney Complex, Potocki-Lupski Syndrome, ALS, and Smith-Magenis syndrome. Thus, complete measurement of structural variation is required for precision medicine.
Novogene’s structural variant database will include a variety of disease types across the population cohort. In a statement, Novogene Founder and CEO Ruiqiang Li said, “This more revealing and informative database should greatly improve our understanding of disease mechanisms and contribute to the development of novel diagnostic and therapeutic approaches.”
- Huddleston, J. et al. (2016) Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Research (5), 677- 685
In a Nature Genetics paper, scientists used SMRT Sequencing to detect and compare structural variations in several yeast strains in order to understand evolutionary genome dynamics. They found different rates of evolution among domesticated and wild strains, and suggest that “the influence of human activities” could explain this.
“Contrasting evolutionary genome dynamics between domesticated and wild yeasts” comes from lead author Jia-Xing Yue, senior author Gianni Liti, and collaborators at the Université Côte d’Azur, the Wellcome Trust Sanger Institute, and other institutes. Choosing long reads to facilitate accurate detection of structural variants, they used PacBio sequencing and generated “end-to-end genome assemblies for 12 strains representing major subpopulations of the partially domesticated yeast Saccharomyces cerevisiae and its wild relative Saccharomyces paradoxus,” the scientists report. “The raw PacBio de novo assemblies of both nuclear and mitochondrial genomes showed compelling completeness and accuracy, with most chromosomes assembled into single contigs, and highly complex regions accurately assembled.”
According to the team, the final 12 assemblies provided “unprecedented resolution” for analyzing subtelomeric regions, which yielded a detailed look at evolutionary genome dynamics. “In chromosomal cores, S. paradoxus shows faster accumulation of balanced rearrangements (inversions, reciprocal translocations and transpositions), whereas S. cerevisiae accumulates unbalanced rearrangements (novel insertions, deletions and duplications) more rapidly,” the scientists write. “In subtelomeres, both species show extensive interchromosomal reshuffling, with a higher tempo in S. cerevisiae.” The accelerated evolution in baker’s yeast is likely to be at least partly a function of human activity and the human-associated environments to which the organisms have been exposed, they add.
The authors note that this study is an indicator of the utility of SMRT Sequencing for population genomics. Ultimately, they say, their results offer an intriguing new explanation for “why S. cerevisiae, but not its wild relative, is one of our most biotechnologically important organisms.”
The Global Ant Genomics Alliance (GAGA) recently announced that it has adopted SMRT Sequencing as its technology of choice for generating high-quality genome assemblies. The alliance, made up of more than 50 scientists at dozens of institutions around the world, aims to sequence 200 ant species to provide a comprehensive look at genomic diversity across ant genera and to provide the scientific community with a foundation of data to enable decades worth of research.
GAGA teamed up with genomic service provider Novogene, which agreed to purchase 10 Sequel Systems earlier this year. In the announcement, GAGA noted that SMRT Sequencing enables “high quality genome assemblies with very few gaps. It is likely to become the dominant technology for de novo reference genome sequencing.”
GAGA also stated in the announcement that choosing SMRT Sequencing “should make sure that ant genomes generated under GAGA will meet future quality standards of journals and repositories.” This is particularly useful in working toward GAGA’s goals of generating a genome-based phylogeny for ants, studying the adaptive significance of physical caste phenotypes, and understanding how ants have adapted to different methods of resource acquisition.
For more information on the alliance, check out GAGA’s website. We look forward to seeing lots of reference-grade de novo genome assemblies coming from this group soon!