This blog features voices from PacBio — and our partners and colleagues — discussing the latest research, publications, and updates about SMRT Sequencing. Check back regularly or sign up to have our blog posts delivered directly to your inbox.
Search PacBio’s Blog
We’re excited to announce that we’ll be working closely with two programs that are committing significant resources toward generating reference-quality genomes of thousands of vertebrate species. Both the Genome 10K (G10K) and Bird 10,000 Genomes (B10K) initiatives have invested in SMRT Sequencing to build high-quality de novo genome assemblies for the next phase of their programs. By sequencing large numbers of vertebrates, the groups hope to develop resources that will be useful for species conservation efforts in the future.
The G10K project was established in 2009 by a consortium of biologists and genome scientists, including Duke neurobiologist Erich Jarvis, Steve O’Brien of the Dobzhansky Center for Genome Bioinformatics, David Haussler and Beth Shapiro of the UC Santa Cruz Genome Institute, and Oliver Ryder of the San Diego Zoo Institute for Conservation Research. Together they determined to sequence the genomes of 10,000 vertebrate species by 2020. The B10K project, launched in 2015 and co-led by Jarvis along with Guojie Zhang of BGI and Thomas Gilbert of the University of Copenhagen, is an initiative to generate representative draft genome sequences for all 10,500 bird species, also within the next five years.
These groups have already contributed genomic resources to the conservation biology community. They collaborated for the first phase of the projects, yielding outcomes such as the Avian Phylogenomics Project, which involved more than 200 scientists and sequenced the genomes of more than 45 new bird species. At the start they used short-read technologies, but have since discovered that with long-read SMRT Sequencing they can produce de novo assemblies of complex genomes with much higher quality.
Jarvis recently sequenced two bird species with SMRT Sequencing, generating high-quality assemblies with long, gapless contigs, half of which were several megabases in length or larger. For example, for Anna’s hummingbird (Calypte anna), the project significantly increased the number of complete genes and reduced the number of contigs compared to a previous short-read assembly, from 124,000 contigs using short-read sequencing to 1,000 using SMRT Sequencing. In a separate sequencing project for zebra finch, PacBio Sequencing fully resolved gaps in the Sanger reference and detected errors in the previous reference genome. For additional details, check out our recap of Jarvis’s talk at this year’s East Coast user group meeting.
Now, the G10K and the B10K initiatives will include Sequel Systems for the next phases of their work. They intend to sequence the genomes of several thousand vertebrate species with PacBio technology for diploid-resolved, high-quality de novo genome assemblies, and perform subsequent chromosome-level scaffolding with complementary approaches, including BioNano Genomics’ optical genome mapping, Dovetail’s proximity in vitro genome mapping, and Phase Genomics Hi-C mapping. To that end, Jarvis, who is now at The Rockefeller University and affiliated with the New York Genome Center, has ordered two Sequel Systems and plans to bring on three additional units. Several other global leaders of the G10K and B10K consortia will also contribute use of their recently acquired Sequel Systems toward their goal of creating de novo assembled vertebrate genomes, including Harris Lewin at UC Davis in the USA, Richard Durbin at the Sanger Institute in the UK, Gene Myers at the Max Planck Institute of Molecular Cell Biology & Genetics in Germany, and Guojie Zhang with affiliations at BGI in China and Denmark.
Jarvis and other members of the G10K and B10K consortia recently submitted a proposal to the MacArthur Foundation’s new 100andchange competition, hoping to secure $100 million to create a Digital Noah’s Ark Genome Library of all 8,000 endangered vertebrate species on Earth. In addition, the G10K and B10K consortiums decided that their goals and the MacArthur proposal will be stages of a longer-term larger effort to populate the Digital Noah’s Ark Genome Library with high-quality blueprint genomes of all ~66,000 vertebrate species in the world through an umbrella program called the Vertebrate Genomes Project. It’s an audacious goal and we wish them luck in the competition!
Recent de novo assemblies of individual human genomes have uncovered thousands of structural variants, many of which are accessible only with PacBio long reads [1-3].
|Personal Genome||PacBio Coverage||Deletions ≥50 bp||Insertions ≥50 bp|
A similar increase in structural variant sensitivity relative to short-read methods has been demonstrated with low-fold coverage PacBio sequencing interpreted against the reference genome . To demonstrate and evaluate the low-fold coverage approach on the PacBio Sequel System, we generated approximately 10-fold coverage of the well-studied human sample NA12878.
Purified DNA for NA12878 was obtained from Coriell, sheared to an average size of 25 kb, converted to SMRTbell templates, and size selected to 15 kb on the BluePippin system (Sage Science). The resulting library was loaded on 10 SMRT Cells. Each SMRT Cell was run for 6 hours on the Sequel System with chemistry v1.2 (an older chemistry than was used for recently released Arabidopsis data, which uses the newer chemistry v1.2.1 and has a yield of about 5 Gb per SMRT Cell and read length N50 of 16.4kb). In total, the runs generated 32.8 Gb of data contained in 3.4 million reads with half of the bases in reads longer than 11.8 kb.
|Run Time||60 hrs|
|Number of Bases||32.8 Gb|
|Number of Reads||3.4 M|
|Read Length N50||11,823 bp|
Reads were mapped to the GRCh37 human reference genome with NGM-LR, and structural variants were called with PBHoney . A total of 7,386 deletions and 7,445 insertions of at least 50 bp were identified and comprise the “10-fold SV call set.”
Visualizing Structural Variants
Ongoing improvements to the IGV browser  (available now in the development version) improve visualization for PacBio reads and structural variants. With these updates, IGV provides a clear representation of deletions, insertions, and trinucleotide repeats, and shows how long reads span structural variants.
Heterozygous 315 bp deletion at chrX:116,454,160-116,454,859
Homozygous 328 bp insertion at chr10:92,213,800-92,216,245
FMR1 trinucleotide repeat small expansion at chrX:146,993,200-146,993,950
Evaluation of 10-fold Call Set
To quantify sensitivity, the 10-fold SV call set was compared to a merged NA12878 “truth” set from the 1000 Genomes Project  and Genome in a Bottle .
|Set||Platform||Deletions ≥50 bp||Insertions ≥50 bp|
|truth: 1000 Genomes + GIAB [6,7]||Illumina||3,021||1,090|
|10-fold SV call set||PacBio Sequel||7,386||7,445|
The 10-fold SV call set recalls 86% of truth set deletions and 81% of insertions. Moreover, it includes thousands of deletions and insertions that are not in the truth sets, most of which are directly confirmed by a FALCON-Unzip de novo assembly from 60-fold PacBio RS II coverage.
In summary, this 10-fold SV call set demonstrates that low-fold coverage sequencing on the PacBio Sequel System is an affordable, effective approach for identifying structural variants and provides much improved sensitivity compared to short-read approaches. We are excited to see how this approach will be extended and applied to study genetic variation in disease cohorts, in human populations, and in other organisms.
To illustrate the low-fold coverage structural variant calling workflow, the NA12878 Sequel data is available for analysis on DNAnexus.
 Chaisson MJ, et al. (2015). Nature, 517(7536):608-11.
 Shi L, et al. (2016). Nat Commun, 7:12065.
 Seo JS, et al. (2016). Nature, 538(7624):243-7.
 English AC, et al. (2015). BMC Genomics, 16:286.
 Robinson JT, et al. (2011). Nat Biotechnol, 29(1):24-6.
 Parikh H, et al. (2016). BMC Genomics, 17:64.
 Sudmant PH, et al. (2015). Nature, 526(7571):75-81.
We’re here in rainy, but beautiful Vancouver for the American Society of Human Genetics. ASHG 2016 promises to be every bit as fascinating as always, with great speakers, excellent sessions, and thought-provoking posters.
The PacBio team will be based in booth #718, and we encourage you to stop by to see the Sequel System and learn more about how SMRT Sequencing has already made a genuine difference in our understanding of human genetics. We’re impressed by the wide variety of ASHG posters citing PacBio data this year and hope you get a chance to peruse them.
We’ll be hosting a luncheon workshop at ASHG called “Discovering and Targeting Causative Variation Underlying Human Genetic Disease Using SMRT Sequencing.” The event will be held on Thursday, October 20th, at 1:00 pm PDT in the Crystal Pavilion Ballroom at the Pan Pacific Hotel (connected to the convention center). We have a great speaker lineup:
Euan Ashley, Stanford University
Towards Precision Medicine
Melissa Laird Smith, Icahn School of Medicine at Mount Sinai
SMRT Sequencing as a Translational Research Tool to Investigate Germline, Somatic and Infectious Diseases
Michael Lutz, Duke University Medical Center
Identification and Characterization of Informative Genetic Structural Variants for Neurodegenerative Diseases
Jonas Korlach, PacBio
A Future of High-Quality Genomes, Transcriptomes & Epigenomes
We hope to see you at ASHG!
In a Nature Methods paper released today, scientists describe the new bioinformatics tools to produce diploid genome assemblies from SMRT Sequencing reads. FALCON (Fast ALignment and CONsensus for assembly) and FALCON-Unzip were developed by PacBio scientists in collaboration with researchers at Johns Hopkins University, Cold Spring Harbor Laboratory, the Joint Genome Institute, and other institutions.
“Phased diploid genome assembly with single-molecule real-time sequencing” comes from lead authors Chen-Shan Chin and Paul Peluso, senior author Michael Schatz, and collaborators. In the publication, the team details how FALCON and FALCON-Unzip work and presents data from several validation studies of organisms including Arabidopsis, the Cabernet Sauvignon grape, and the diploid fungus Clavicorona pyxidata.
“Currently available genome assemblies rarely capture the heterozygosity present within a diploid or polyploid species,” Chin et al. write. “Most assemblers output a mosaic genome sequence that arbitrarily alternates between parental alleles.” That leads to a loss of important information about differences between homologous chromosomes. To address this issue, the team developed the diploid-aware FALCON assembler and FALCON-Unzip, a tool for resolving haplotypes. Both tools are open-source.
As the authors describe it, “The FALCON assembler follows the design of the hierarchical genome assembly process (HGAP) but uses more computationally optimized components.” FALCON builds a string graph with bubbles representing differences between paired chromosomes. “FALCON-Unzip identifies read haplotypes using phasing information from heterozygous positions that it identifies,” they add. The phased reads are used to construct contigs for both haplotypes as well as the unique sequence for each chromosome, resulting in a “final diploid assembly with phased single-nucleotide polymorphisms (SNPs) and structural variants (SVs).”
The team assembled a trio of Arabidopsis plants for validating the accuracy of the haplotype speration, then applied the tools to the fungus and wine grape genomes. “In all three genomes that we studied, the FALCON/FALCON-Unzip assembly was two- to three-fold more contiguous than alternative long-read assemblers and 30- to >100-fold more contiguous than state-of-the-art short-read assemblers,” they report. In Arabidopsis, for instance, they were able to resolve haplotype chromosomes for almost the entire genome. In the V. vinifera grape, the diploid assembly revealed high variation rates in homologous regions, and in C. pyxidata it showed long stretches of much lower heterozygosity than expected.
This new view of genomes could have major implications for characterizing methylation, gene expression, and regulatory elements. “More systematic study of phased diploid references will expose the detailed cis-regulatory mechanisms of differential expression in diploid genomes to improve our general understanding of the biology beyond haploid genomes,” the scientists write. “Looking forward, we expect many new opportunities for understanding diploid and polyploid genomic diversity and its impact on genome annotation, gene regulation, and evolution.”
In a paper published today in Nature, scientists from Seoul National University, Macrogen, and other institutions present the de novo genome assembly for a Korean individual. The effort used SMRT Sequencing and other technologies to generate the assembly, fully phase all chromosomes, and perform detailed analyses of structural variation and other elements. In the process, the team generated novel sequence data that helps fill gaps in the human reference genome and continues the trend of developing important new population-specific resources.
The work, reported in “De novo assembly and phasing of a Korean human genome,” was contributed by lead authors Jeong-Sun Seo, Arang Rhie, Junsoo Kim, and Sangjin Lee, senior author Changhoon Kim, and collaborators. The authors note that standard NGS approaches could not have accomplished the high-quality genomic resource they required. “Simple alignment of short reads to a reference genome cannot be used to investigate the full range of structural variation and phased diploid architecture, which are important for precision medicine,” they write. “By contrast, the single-molecule real-time (SMRT) sequencing platform produces long reads that can resolve repetitive structures effectively.”
For this effort, the scientists performed genome sequencing with PacBio technology and then integrated data from orthogonal platforms such as BioNano Genomics. SMRT Sequencing alone produced a highly accurate de novo assembly with 3,128 contigs and a contig N50 length of nearly 18 Mb. Combined with BioNano data and polished with Illumina sequence, the final assembly “is characterized by marked contiguity that has not been achieved by non-reference assemblies of the human diploid genome so far, and improves on the previous best N50 length by 18 Mb,” the scientists note. Ninety percent of the genome is covered in the largest 91 scaffolds.
That assembly was compared to the human reference, GRCh38, where it closed 105 of 190 remaining euchromatic gaps and extended into 72 more, adding about 1 Mb of novel sequence. “These locations, previously intractable using only short reads, commonly contained simple tandem repeats,” the authors report.
The contiguity of the assembly allowed scientists to delve deeply into structural variation, identifying more than 18,000 variants—nearly 12,000 of which had never been reported before. “Of the new SVs, 86% were highly enriched for clusters of mobile and tandem repeats,” the team writes. A look at insertions found that almost half had significant variability in frequency across populations, while nearly 10 percent of them were specific to people of Asian descent. (This follows the pattern seen with other population-specific assemblies, such as the recently published Chinese genome.)
Finally, the scientists constructed separate assemblies for each haplotype to more accurately represent the diploid genome. To assess the results, they examined the HLA complex, finding that phasing had been successful despite a large amount of structural variation. “Our approach also allowed a clinically important duplication of CYP2D6 to be detected and assigned to one phase,” the scientists report. “This result demonstrates that de novo assembly-based phasing has advantages in resolving challenging hypervariable regions, and could be used further for pharmacogenomics.”
The scientists note that this work produced “the most contiguous diploid human genome assembly so far,” supporting the idea that integrating technologies leads to optimal results for detecting structural variants and other elements that have been impossible to resolve with short reads. They also remind the community that many more population-specific resources will be important for realizing the potential of genomics. “Our findings demonstrate the important genomic differences of Asian ancestral group from the others, and highlight the need for further genomic studies focused on individuals outside of European ancestry to describe the full range of functionally important variations in humans,” they write.
More than 150 SMRT Sequencing users gathered at Stanford University for our annual West Coast User Meeting & workshops earlier this month. Many thanks to all the scientists who attended and shared their research. For anyone who couldn’t make it, we’ve included some highlights from each talk below (and links to download the full presentations when possible):
The event began with Marty Badgett, our senior product manager for the Sequel System and the PacBio RS II, discussing recent technology updates. He presented the most recent results from the Sequel System highlighting resequencing and small and large genome applications. Specifically, two metagenomic- and one immunology-targeted sequencing datasets demonstrated high single-molecule accuracy with over 225,000 reads each at >QV30. Next up were large insert libraries, showing a range of data from bacterial, plant, and animal projects. These featured the benefits of a near 7-fold increase in number of reads coming from the larger Sequel SMRT Cell with half of the data coming from reads >14,500 bp each. Finally, Marty showed the history of our development efforts on the PacBio RS platform and how we are applying those understandings to future developments on the Sequel System, including improvements to read length and reductions in input library amounts.
Kicking off the user presentations, Yahya Anvar from Leiden University Medical Center presented results from using SMRT Sequencing to study drug metabolism, specifically variants in the CYP2D6 gene. Anvar’s team has developed a CYP2D6 genotyping approach that enables his group to obtain high-quality, full-length, phased CYP2D6 sequences. According to Anvar, this leads to accurate variant calling and haplotyping of the entire gene locus, including exonic, intronic, and upstream and downstream regions. In addition to accurate characterization of variants within this locus, they can reliably describe copy-number changes, rearrangements, and gene conversions that have been missed by standard genotyping assays. He concluded that this method provides a powerful framework to infer drug response phenotype.
Christine Beck, a postdoctoral fellow at Baylor College of Medicine, discussed the use of target capture for complex genomic loci. The team uses targeted large-insert capture of human chromosome region 17p11.2 combined with long-read PacBio sequencing, which has allowed them to identify novel breakpoint junctional sequences in previously intractable repetitive DNA at this locus. She detailed the use of genomic approaches to characterize additional rearrangements of this structurally complex region and described mechanistic insights into genomic rearrangement formation that have been gleaned from these data.
Aaron Wenger, a senior staff research scientist at PacBio, spoke about improved support for long reads in the Integrative Genomics Viewer (IGV). New features include a quick consensus mode that suppresses random base-pair errors, quick phasing to group reads based on the nucleotide at a selected heterozygous variant, and labels for large insertions and deletions to reveal structural variants. He also presented examples of the extended IGV to explore haplotype phasing and structural variants in a human whole genome sequence. He noted that the development build of the viewer is available to download.
In Euan Ashley’s lab at Stanford, researchers are studying cardiac disease genes using SMRT Sequencing. Graduate student Alexandra Dainis presented their use of targeted Iso-Seq to phase cardiac disease genes. They were interested in using PacBio long-read sequencing because, unlike short-read sequencing, it can capture multiple SNPs or mutations on a single sequencing read and provide phased genetic information without the need for familial sequencing or inferential phasing from population data. Dainis discussed their work in hypertrophic cardiomyopathy, an autosomal genetic disorder that remains a leading cause of sudden death in young adults. Phasing disease-causing mutations may reveal disease-associated haplotypes that could be targets for new genetic therapies. The team has phased two sarcomeric genes (MYH7 and MYBPC3) in 10 left-ventricular heart RNA samples, from both controls and diseased hearts, and used this data to phase exonic disease-causing mutations and common SNPs into haplotypes for each sample. Their goal is to proceed to the development of new, haplotype-specific therapeutics.
Continuing the theme of human studies, Tina Graves-Lindsay from the McDonnell Genome Institute at Washington University School of Medicine spoke about plans to provide additional allelic diversity to the current human reference sequence by generating high-quality, highly contiguous human genome assemblies of individuals representing diverse populations. To date, they have sequenced seven diploid genomes. Their strategy involves generating deep coverage of PacBio sequence and scaffolding using optical mapping or cross-linking technologies to give even larger, chromosome-level information. This strategy also involves the use of large insert clone sequencing in targeted regions, which are typically not resolved in the whole genome assemblies.
Jason Underwood, who is both a principal scientist at PacBio and a senior fellow at the University of Washington, talked about the challenges posed by segmental duplications. An important source of genetic instability, they are associated with both rare and common diseases and can provide seeds for evolutionary innovation. UW used the Iso-Seq method to yield full-length transcript information and distinguish between gene copies with more than 99% sequence homology. Their approach uses complementary biotinylated oligonucleotide probes to enrich for duplicate genes from cDNA. They designed probes to 20 gene families that underwent duplications specifically on the human lineage since divergence from chimpanzee. Sequence analysis of captured cDNA from fetal and adult brain revealed mean transcript sizes ranging from 1,200 bp to 2,300 bp with transcripts up to 4 kb identified with high confidence. Among the human-specific duplications, they observed new isoforms, including novel sites of transcription initiation and polyadenylation, as well as previously unannotated open-reading frames, indicating that potentially novel human-specific brain mRNAs have previously been missed by short-read profiling.
The talks also included several studies of plants and animals. Stephen Mondo of the Joint Genome Institute focused on epigenetics, specifically N6-methyldeoxyadenine (6mA), which has only been found in four species: the alga Chlamydomonas reinhardtii and Drosophila melanogaster, C. elegans, and Mus musculus. Despite appearing at low levels, 6mA is critical for proper development, as it plays an important role in regulating gene expression. JGI scientists conducted the first kingdom-wide exploration of 6mA in fungi, where they found abundant utilization of 6mA in early diverging fungi, with up to 2.8% of all adenines methylated, vastly exceeding the levels observed in other organisms. Their results demonstrated the importance of 6mA as a broadly conserved epigenomic mark in eukaryotes and implicate 6mA as an epigenomic mark transmissible across nuclear division.
Amanda Larracuente from the University of Rochester talked about work in Drosophila looking at satellite DNA (satDNA), large blocks of tandem repeats that accumulate in heterochromatic genomic regions with low recombination, such as near centromeres and on Y chromosomes. Using SMRT Sequencing and multiple algorithms and parameter combinations to determine the optimal assembly approaches for heterochromatic regions rich in satDNA, they revealed the structure of complex satDNA loci with unprecedented resolution. These assemblies are providing a platform for evolutionary and functional genomic studies of satDNAs and other repeat-rich regions of the genome.
We also heard about work with wine grapes from Dario Cantu at the University of California, Davis. The genomes of the grapes and their microbial communities can shed light on beneficial organisms and how to avoid infestations that can kill these high-value crops. Deep sequencing of rRNA and metagenomes has allowed UC Davis to characterize the microbial communities in the vineyard, while whole-genome shotgun sequencing provided them with the references necessary to apply metatranscriptomics and profile gene expression of all interacting organisms simultaneously, including the grapevine host. The highly heterozygous genome of Cabernet Sauvignon was sequenced at 140x coverage with the PacBio RS II using a combination of 20 kb and 30 kb DNA libraries, producing an assembly with a contig N50 of 2.17 Mb. SMRT Sequencing was also used to sequence the genomes of some of the most common and economically important grape pathogens. For most fungal species, entire chromosomes were reconstructed into single-contig, telomere-to-telomere assemblies.
Tim Smith from the USDA Agricultural Research Service gave an update on work with the goat genome as a model for chromosome-scale assemblies. He pointed out that highly fragmented short-read assemblies impede downstream applications. That’s why their work for de novo assembly of the domestic goat (Capra hircus) is based on PacBio long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced the most contiguous de novo mammalian assembly to date, with chromosome-length scaffolds and only 663 gaps. The assembly represents better than 250-fold improvement in contiguity compared to the previously published C. hircus assembly, and resolves many repetitive structures, including the most complete repeat family and immune gene complex representation ever produced for a ruminant species.
Our chief scientific officer, Jonas Korlach, capped off the day with a talk about how SMRT Sequencing is enabling a future of high-quality genomes, transcriptomes, and epigenomes. He said that scientific papers using SMRT Sequencing technology are being published at a rate of 25-30 per week, with more than ~650 so far this year. Now established as the gold standard for closing bacterial genomes, he also noted that there has also been an explosion in using SMRT Sequencing for methylation detection in bacteria, unleashing a new era in bacterial methylomes. He congratulated PacBio users who belong to the “1 MB Contig Club,” which now extends to characterizing transcriptomes. He also highlighted recent work on maize and sorghum as well as human genomes, where we’ve seen a number of high-quality assemblies from various ethnic populations. Korlach highlighted the differences between contigs and scaffolds, and how long strings of unknown bases in assemblies dramatically alter their utility.
We’d like to thank our host, Jodi Puglisi from Stanford, as well as the partners present at the event: Advanced Analytical Technologies, Computomics, Covaris, Diagenode, DNAnexus, PerkinElmer, and Sage Science.
Today we are pleased to release the first Arabidopsis thaliana (Ler-0) dataset and de novo genome assembly generated with the Sequel System, using two SMRT Cells and 12 hours of runtime. Only three years ago, we released our first genome assembly1 for Arabidopsis produced on the PacBio RS II using P4-C2 chemistry, 85 SMRT Cells and 255 hours of runtime. Four months later, we released a second Arabidopsis dataset1 using the improved P5-C3 chemistry, which reduced the number of SMRT Cells to 46 and runtime to 138 hours.
We produced this Sequel dataset using our latest chemistry enhancements which significantly reduce the amount of DNA required. Prior to these chemistry improvements, the amount of DNA needed to run many large genome projects on the Sequel System was prohibitive. These modifications enable the use of loading concentrations equivalent to PacBio RS II levels.
Details of the Library Protocol, Data Generation, and Assembly Process
Purified Arabidopsis (Ler-0) genomic DNA was sheared to an average size of 32 kb and converted to SMRTbell templates, followed by a 20 kb size selection performed on a BluePippin system (Sage Science). Each SMRT Cell was loaded at an on-plate concentration of 144 pM of library and run for 6 hours on the Sequel System using the modified chemistry. Collectively, the two SMRT Cells produced 10.8 Gb of data, contained in 1.1 million reads, with half of the data in reads greater than 16,400 bp in length. The data were assembled with HGAP4 in SMRT Link.
Results of Sequel System Arabidopsis genome assembly
|PacBio RS II
|PacBio RS II
|Release date||Sept 2013||Jan 2014||Sept 2016|
|Number of SMRT Cells||85||46||2|
|Run Time (hrs)||255||138||12|
|Number of Bases (Gb)||11.0||15.9||10.8|
|Number of Reads (M)||4.25||2.30||1.14|
|Read Length N50 (bp)||7,700||11,900||16,400|
|PacBio RS II
|PacBio RS II
|Release date||Sept 2013||Jan 2014||Sept 2016|
|Assembly Size (Mb)||121.7||124.5||122.9|
|Contig N50 (Mb)||6.2||6.7||10.4|
|Max Contig Length (Mb)||13.0||13.2||15.0|
The raw and assembled data is publicly available for download.
De novo assembly of an Arabidopsis genome with SMRT Sequencing is not as groundbreaking as it was three years ago. However, this model organism data release demonstrates that, with these latest improvements, the Sequel System allows for the routine generation of high-quality assemblies of large, complex eukaryotic genomes. The modified chemistry is currently in testing and will be made available broadly once testing completes.
- Kim, K. E. et al. (2014) Long-read, whole-genome shotgun sequence data for five model organisms. Scientific Data. 1, 140045.
In recent interactions with the scientific community, we’ve seen a growing number of questions around scaffolding genome assemblies. We thought it might be useful to review the concepts behind contigs and scaffolds, as well as the circumstances in which one might want to scaffold a high-quality PacBio genome assembly.
Contigs vs. Scaffolds
Contigs are continuous stretches of sequence containing only A, C, G, or T bases without gaps. SMRT Sequencing has all of the necessary performance characteristics – long reads, lack of sequence-context bias, and high consensus accuracy – to generate contiguous genome assemblies with megabase-sized contigs. Ultra-long contigs provide complete and uninterrupted sequence information across full genes, and more recently even allow separation of the two chromosomes for diploid organisms.1 The unprecedented quality of PacBio de novo genome assemblies has been described in many publications, such as the gorilla genome assembly with a contig N50 of 9.5 Mb recently featured on the cover of Science.2
Scaffolds are created by chaining contigs together using additional information about the relative position and orientation of the contigs in the genome. Contigs in a scaffold are separated by gaps, which are designated by a variable number of ‘N’ letters. Scaffolding is often used for short-read assemblies to make sense of the fragmented genome assemblies containing short contigs. However, there are three important principal deficiencies of scaffolds:
- Scaffolds miss critical information. Gaps represent missing genomic information and, in many cases, these gaps can coincide with important genomic loci. Many promoters and first exons are GC-rich in sequence, often resulting in missing or low-quality sequence reads from short-read or Sanger sequencing. Thus, genes are incompletely resolved, and their regulation cannot be understood. Another reason for gaps in scaffolded assemblies is large, repetitive elements which short-read sequencing methods struggle to bridge. Thus, duplicated genes, genes vs. pseudogenes, short tandem repeats, variable number tandem repeats, microsatellites, and many other structural genomic features are often unresolved in scaffolds.
- The length of a scaffold gap often has no relation to the true gap size. In several reference genomes, gaps are arbitrarily set to certain fixed lengths. For example, most gaps in the zebra finch reference are set to 100 Ns, while in the version 3 maize reference they are set to 1,000 Ns. This means that in most cases, the true length of sequence represented by the gap differs from the set gap size, and is sometimes off by thousands of bases. The uncertainties of gap sizes in scaffolds result in an inability to understand the true spatial relationships of functional elements in genomes and is an underestimate of the actual extent of missing information.
- Gap-flanking scaffold sequence can be low-quality, and is sometimes completely wrong. The sequences surrounding gaps often fall into areas where short-read technologies have deficiencies due to GC-bias or read-length limitations. This can result in sequence that is of lower quality and, in some cases, completely erroneous. For example, because of complex repeat structures in the human IGH locus, the right edge of a 50,000 N gap in the short-read assembly contains 1,836 bases of flanking sequence that has no support in the hg19 human genome reference or the PacBio assembly. In some ways, having incorrect flanking sequence in scaffolds is worse than having ‘N’ gaps, since that erroneous sequence is considered and included for downstream analyses.
Illustration of the difference between contigs and scaffolds in genome assemblies
The information missed by gapped scaffold assemblies complicates and may preclude downstream analysis and understanding related to functional and comparative genomics. Scaffolded short-read assemblies get nowhere near the quality of PacBio genome assemblies in terms of contiguity and completeness, and they often require labor-intensive follow-up work to close gaps, adding time and cost to projects.
Scaffolding PacBio assemblies for chromosome-scale genome representations
For even longer-range genomic connectivity, e.g. to bridge the largest segmental duplications and repeat regions, researchers can go a step further by adding scaffolding information to a PacBio assembly, often resulting in telomere-to-telomere, chromosome-scale genome representations. Several methods have been demonstrated to work very well for this purpose, including optical mapping and crosslinking approaches. Check out examples of maize and human genome sequencing to see how chromosome-level scaffolding enables more comprehensive insights.
- Chin, CS et al. (2016) Phased diploid genome assembly with Single Molecule Real-Time Sequencing. bioRxiv.
- Gordon D. et al. (2016) Long-read sequence assembly of the gorilla genome. Science. 352 (6281), aae0344.
Alzheimer’s disease (AD) is a devastating neurodegenerative disease that affects ~44 million people worldwide, making it the most common form of dementia. Pathologically it is defined by severe neuronal loss, aggregation of amyloid β (Aβ) in extracellular senile plaques in the brain, and formation of intraneuronal neurofibrillary tangles consisting of hyperphosphorylated tau protein. Studies looking into disease mechanism have shown that changes in gene expression due to alternative splicing likely contribute to the initiation and progression of AD. Hence, efforts have been made to better understand the gene expression changes in the AD brain by sequencing the transcriptome of affected brain regions.
Most transcriptome studies conducted to date have used short-read sequencing technologies, which provide the abundance of transcript reads needed for evaluating expression profiles. However, the ability to accurately identify alternative splicing and the associated expression patterns for different splice isoforms is limited by the short read-lengths. Given that the average size of a human gene transcript is several kilobases long, a 150bp to 300bp read will fail to span the entire transcript and therefore assembly will be required. In most cases this process can be very difficult, if not impossible, given the high similarity between expressed gene isoforms.
Recent studies using the PacBio isoform sequencing (Iso-Seq) method demonstrated the advantages of obtaining high-quality, full-length transcript sequences for improving genome annotation , , identifying fusion cancer genes , and discovering novel alternative splicing patterns . Here we apply the Iso-Seq method from an Alzheimer brain RNA sample. The purpose of releasing the dataset to the public is to provide researchers with a full-length transcriptome reference from which they can develop bioinformatics tools and validate their own findings.
In our final “confident” dataset, we obtained 21,742 high-quality, full-length isoforms covering 9,313 non-overlapping loci ranging from 352 bp – 9,457 bp, with an average length of 3,400 bp (Fig. 1). The total percentage of consensus bases that disagreed with the hg38 genome is 0.036% substitutions, 0.08% insertions, and 0.08% deletions, bringing the overall concordance with hg38 to 99.8%. More than half of the transcribed loci have one observed isoform, while most of the rest have about two to five isoforms (Fig. 2). When compared with the reference transcript annotation Gencode v25, more than a third of the isoforms match a reference transcript completely, while the majority of isoforms are possible novel splice forms of a known gene. In addition to the stringent “confident” dataset, we are also releasing a larger, less stringent “promiscuous” dataset. Details on the difference between the two versions can be found in the download section.
Library Preparation and Sequencing
An Alzheimer’s Disease Brain total RNA sample was purchased from BioChain. First strand cDNA library was generated using Clontech SMARTer cDNA synthesis kit followed by size selection using the SageELFTM device by Sage Science, with lanes combined to create five size libraries that roughly correspond to 1-2 kb, 2-3 kb, 3-5 kb, 5-7 kb, and > 7kb libraries. Sequencing was done using P6-C4 chemistry and 3-hr movies for the 1-2 kb fraction and 4-hr movies for the remaining fractions. Sequencing was completed in 2015. Download details on the sample preparation procedure.
The standard Iso-Seq pipeline (ToFU version 2.2.3 or equivalent to SMRT Analysis 3.1; for detailed methods see ) was used to process the data. Iso-Seq classify generated 1,107,889 FLNC reads and 1,929,319 nFL reads. The reads were then used to generate high-quality, full-length isoforms using ICE followed by Quiver polishing (HQ Quiver isoform consensus). By definition, an HQ Quiver consensus sequence must have at least two supporting full-length reads and predicted accuracy of >= 99%. The HQ Quiver consensus sequences were then aligned to human reference genome hg38 to create a final “confident” dataset of unique isoforms. To create the larger “promiscuous” dataset, additional consensus results that contained only one supporting full-length read were added. For details on the bioinformatics analysis, please see the README file on the Download Page.
Figure 1. Length distribution of final, unique, full-length isoforms.
Number of isoforms: 21,742
Min-max length: 352 bp – 9457 bp
Average length: 3400 bp
Figure 2. Number of isoforms per loci. 21,742 isoforms were grouped into 9,313 non-overlapping strand-specific loci. The average number of isoforms per loci was 2.3.
We welcome researchers to download and use the dataset for their research. For citation of the dataset, please use:
The Alzheimer brain Iso-Seq dataset was generated by Pacific Biosciences, Menlo Park, California, and additional information about the sequencing and analysis is provided at https://downloads.pacbcloud.com/public/dataset/Alzheimer_IsoSeq_2016/. The data used in the present study was retrieved from PacBio’s online database at https://downloads.pacbcloud.com/public/dataset/Alzheimer_IsoSeq_2016/ (date of retrieval).
 B. Wang, E. Tseng, M. Regulski, T. A. Clark, T. Hon, Y. Jiao, Z. Lu, A. Olson, J. C. Stein, and D. Ware, “Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing,” Nat Comms, vol. 7, p. 11708, Jun. 2016.
 S. E. Abdel-Ghany, M. Hamilton, J. L. Jacobi, P. Ngam, N. Devitt, F. Schilkey, A. Ben-Hur, and A. S. N. Reddy, “A survey of the sorghum transcriptome using single-molecule long reads,” Nat Comms, vol. 7, p. 11706, Jun. 2016.
 J. L. Weirather, P. T. Afshar, T. A. Clark, E. Tseng, L. S. Powers, J. G. Underwood, J. Zabner, J. Korlach, W. H. Wong, and K. F. Au, “Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing,” Nucleic Acids Research, vol. 43, no. 18, pp. e116–e116, Oct. 2015.
 D. I. Pretto, J. S. Eid, C. M. Yrigollen, H.-T. Tang, E. W. Loomis, C. Raske, B. Durbin-Johnson, P. J. Hagerman, and F. Tassone, “Differential increases of specific FMR1mRNA isoforms in premutation carriers,” J Med Genet, vol. 52, no. 1, pp. 42–52, Dec. 2014.
 S. P. Gordon, E. Tseng, A. Salamov, J. Zhang, X. Meng, Z. Zhao, D. Kang, J. Underwood, I. V. Grigoriev, M. Figueroa, J. S. Schilling, F. Chen, and Z. Wang, “Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing,” PLoS ONE, vol. 10, no. 7, p. e0132628, Jul. 2015.
A recent cover story in New Zealand Geographic vividly details the efforts to sequence not just the kākāpō genome, but the genomes of every single living kākāpō.
If you missed our earlier blog about this bird, the kākāpō is a member of the parrot family known for its unique attributes: it’s heavy, flightless, and mostly active at night. As author Rebekah White reports in “Decoding Kākāpō,” the remaining members of this species — about 125 of them — live on islands near New Zealand.
White recounts how scientist Jason Howard, a member of Erich Jarvis’s lab at Duke, first became interested in this unusual bird, and pushed to have its genome sequenced as part of the B10K project. After hitting obstacles with other sequencing technologies, Howard found the PacBio sequencing platform, which was finally able to get through the kākāpō genome.
Meanwhile, New Zealand native David Iorns had gotten involved, launching a crowdfunding campaign to resequence every living kākāpō, relying on the PacBio reference assembly to streamline the process. The campaign was part of our Genome Galaxy Initiative and was successfully funded earlier this year. The wealth of genomic information will be used to help save the birds, which are so inbred due to a recent population bottleneck that they struggle to reproduce naturally.
According to White, delivery of the PacBio genome data to the conservation geneticist in charge of the final kākāpō assembly “was as though all his Christmases had come at once.” The genome data will be publicly available, allowing scientists around the world to use this information to better understand evolution, traits like vocal learning, and much more.
We’re lucky to have such a vibrant community of SMRT Sequencing users, and there’s nothing better than getting them together to share tips, exchange ideas, and develop new applications. These upcoming events will facilitate just that — and we hope you can join us!
Frances C. Arrillaga Alumni Center at Stanford University, Palo Alto, Calif.
Our annual West Coast event taking place next week will include the usual day-long user group meeting as well as half-day workshops on bioinformatics and sample preparation, which will take place on September 7th at PacBio headquarters in Menlo Park. In addition to PacBio speakers who will update attendees on the technology roadmap and new applications of SMRT Sequencing, confirmed customer presentations include:
- Yahya Anvar (Leiden University Medical Center)
- Christine Beck (Baylor College of Medicine)
- Dario Cantu (UC Davis)
- Alexandra Dainis (Stanford University)
- Timothy Smith (USDA-ARS)
- Tina Graves-Lindsay (McDonnell Genome Institute)
- Stephen Mondo (Joint Genome Institute)
- Amanda Larracuente (University of Rochester)
- Jason Underwood (PacBio/University of Washington)
Hilton Washington DC North/Gaithersburg
This one-day event will take place right before the Genome in a Bottle Workshop hosted at the NIST campus. With a focus on collaborative informatics for developing and improving data analysis tools for SMRT Sequencing, topics will range from de novo assembly and genome phasing to structural variation, Iso-Seq analysis, and much more. Confirmed speakers include:
- Gene Myers (Max Planck Institute)
- Kin Fai Au (University of Iowa)
- Ali Bashir (Mount Sinai School of Medicine)
- Brett Bowman (PacBio)
- Andrew Carroll (DNAnexus)
- Jason Chin (PacBio)
- Richard Hall (PacBio)
- Sergey Koren (NHGRI)
- Maria Nattestad (CSHL)
- Nik Putnam (Dovetail Genomics)
- Mike Schatz (Johns Hopkins University)
- Yuta Suzuki (University of Tokyo)
- Elizabeth Tseng (PacBio)
- Aleksey Zimin (University of Maryland)
EMEA User Group Meeting
Hesperia Tower Hotel & Convention Center in Barcelona
Our Annual EMEA User Group Meeting will be returning to beautiful Barcelona this winter. Through a combination of presentations and breakout sessions, the meeting will provide a unique opportunity for our users to present how they are applying SMRT Sequencing to their research needs and to share best practices.
On Thursday December 1st, the event will kick off at 14:00 with presentations and workshops, followed by an evening reception. The following day will resume the presentations and workshops by both users and PacBio staff.
More information to come soon!
A case study produced by QRIScloud, an Australia-based cloud computing service, offers interesting insight into a recent project that is using SMRT Sequencing to generate a reference-quality de novo genome assembly for the grape used to make Chardonnay wine.
The sequencing effort was conducted by collaborating scientists at the Australian Wine Research Institute (AWRI) and the BC Genome Sciences Centre in Canada. This new assembly, which is still undergoing polishing and in-depth analysis, adds to very sparse genome resources for wine grapes. Until recently, the only genome assemblies available were draft-quality ones for the Pinot Noir varietal.
With PacBio long-read sequencing, scientists were able to create an assembly of dramatically higher quality, despite the complex, highly heterozygous genome. “Unraveling the underlying genetic complexities of grapevine and how genetic variation shapes wine quality is critical,” said Simon Schmidt, a project leader and senior research scientist at AWRI. “It can facilitate vine selection and enable the tailoring of wines by winemakers, allowing them to meet ever-changing consumer demands and access new markets.”
Schmidt added, “The combination of PacBio long-read sequencing, QRIScloud bioinformatics infrastructure and newly developed haplotype aware assembly software has already provided a high-quality draft genome.” The team is working on incorporating clone sequence data to improve quality even further.
Stanford’s Euan Ashley wrote a terrific review about the clinical use of genome sequencing for Nature Reviews Genetics. “Towards Precision Medicine” is well worth a read, covering topics from the ethnic background of the human reference genome to public interest in precision medicine. He also covers technical angles such as mapping of sequence reads for variant calling across challenging regions of the genome with known clinical significance.
Ashley’s premise is that many of the current standards in genomics — from sequencers to analysis tools and more — were developed for use in basic research, where the consequences of inaccurate information are less severe than they would be in a clinical setting. Throughout the review, he considers what challenges need to be overcome “to bring genomics up to clinical grade.”
What caught our attention was Ashley’s excellent description of the genomic elements that make the human genome so difficult to interpret accurately: repetitive sequence, structural variants, segmental duplications, and so on. “Much of this genomic complexity is only challenging because of the prevailing technology used to assess it: short-read sequencing,” he writes. “With extensive paralogy, originating in gene families, segmental duplication or pseudogenes, the genomic location of many short reads cannot be determined with confidence.” Repeat expansion disorders, such as Huntington disease, are marked by a long series of simple repeats that are much longer than a short read, making it all but impossible to reconstruct these regions accurately with short-read sequencers.
In another example, he cites regions like the famously polymorphic major histocompatibility complex (MHC) as stumbling blocks for short-read sequencers. “The MHC is challenging to resolve using only short-read approaches because of the lack of a comprehensive catalogue of haplotypes and the intrinsic lack of phase information — that is, knowledge of the parental chromosome of origin — in short reads,” he notes, adding that phasing data is important for a variety of clinical applications, including phasing of the HLA genes housed in this region, which are associated with more than 100 diseases and many drug reactions.
Ashley sees long-read sequencing as a potential solution to many of these problems. “Long-read sequencing facilitates de novo assembly that automatically provides phase information,” he writes. “Such sequencing provides a more complete picture of the genome.” Long reads can easily span structural variants and even long stretches of repeats, making it possible to fully reconstruct these clinically relevant regions. Ashley notes that these larger structural variants have much lower variant calling accuracy with short-read sequencing methods due to their size and issues related to mapping ambiguity. He also points out that “variants that are more disruptive of the open reading frame, such as structural variants (SVs), are generally more likely to cause disease,” and highlights over 25 clinical disorders that are caused by pathogenic structural variants as an example.
Ashely ends by providing a path forward for improved accuracy in clinical genomics through “Reducing reliance on reference sequences, making phasing routine, improving calling of indels and structural variants, characterizing complex areas of the genome through long-read sequencing and maximizing the cost effectiveness of genomic coverage.” He also reminds us of how far we’ve come, and what the future holds when we get there, “Fueled by technological advancement, fundamental discovery of genetic elements related to health and disease has been the engine of human genetics for decades,” Ashley concludes. “Building on this foundation, precision medicine will use the knowledge gained to redefine disease, to realize new therapies and to provide hope for generations of patients to come.”
Scientists from Rutgers University and the University of California, Davis, used SMRT Sequencing to study structural variation in maize. They found that this approach delivered more complete information at lower cost than standard methods and generated new findings that could be important for crop breeding.
From lead author Jiaqiang Dong, senior author Jo Messing, and collaborators, “Analysis of tandem gene copies in maize chromosomal regions reconstructed from long sequence reads” was published in PNAS recently. They chose to evaluate SMRT Sequencing for copy number detection as an alternative to short-read sequencing, which doesn’t span long repeats, and BAC cloning, which is prohibitively expensive. “The single most critical parameter is the length of each sequence read to establish overlaps without the need of genomic clone libraries,” the authors write. “Therefore, we tested the new SMRT technology to determine whether we could assemble chromosomal regions from one shotgun DNA sequencing dataset that would comprise large tandem gene copies.”
They chose maize because of its high proportion of repetitive sequence — repeats make up a remarkable 85% of its genome — and focused on the alpha zein gene family. Spread across six chromosomes, the gene family is important because it “acts as a sink for reduced nitrogen in the seed,” the authors explain. In other maize strains, as many as 48 copies of these genes were found.
The team notes that the average read length generated by the PacBio System was “26 times longer than Illumina and 8 times longer than ABI3730, providing us with significantly more contiguous information for shotgun DNA sequence assemblies.” This long-read data enabled the comprehensive genomic picture that the scientists were hoping for: “Based on this high-quality single shotgun DNA sequencing dataset, we were able to use zein gene sequences as digital probes to assemble the entire collection of orthologous regions from [the W22 strain],” they report. A detailed analysis demonstrated that the self-corrected SMRT Sequencing data had an error rate of less than 0.1%.
The use of SMRT Sequencing proved useful “for resolution of large complex repeats or tandem/dispersed gene family clusters,” the scientists conclude. “Given the effectiveness of this approach in maize, we anticipate that it will be of general use with any complex genome including human and, in particular, cancer genomics, where structural changes can be dramatic.”
PacBio customer HistoGenetics was just awarded a major, multi-year contract to perform HLA typing on as many as thousands of samples per week using SMRT Sequencing. The company is a pioneer and global leader in high-resolution sequence-based HLA typing services.
As blog readers know, HLA typing involves analysis of highly polymorphic human leukocyte antigen (HLA) genes comprised within the major histocompatibility complex (MHC) on chromosome 6. Accurate HLA typing is essential for research on donor recipient tissue matching during transplantation, autoimmune disease-association studies, drug hypersensitivity research, and several other applications. But the complexity of the region, which contains thousands of possible alleles, has made it challenging to represent with short-read sequencing technologies.
With its long reads and high consensus accuracy, SMRT Sequencing has been a natural fit for scientists trying to analyze the HLA region.
In an announcement about this news, HistoGenetics CEO Nezih Cereb said, “In HLA typing, there is no room for errors. The combination of PacBio’s high accuracy and long read lengths to accurately sequence and phase HLA genes is the new gold standard in the field.”
For more, check out this video of Dr. Cereb explaining the utility of SMRT Sequencing for HLA.
A recent publication from scientists at the University of Florida and the University of Missouri used SMRT Sequencing to analyze epigenomic changes that occur when free-living bacteria associate with a host and become symbiotic instead.
Published in the Frontiers in Microbiology journal, “Integrating DNA Methylation and Gene Expression Data in the Development of the Soybean-Bradyrhizobium N2-Fixing Symbiosis” comes from a team of collaborators including lead author Austin Davis-Richardson and senior author Eric Triplett. The scientists aimed to assess the role of epigenetics in bacterial evolution from free-living to symbiont and chose SMRT Sequencing because it generates base-specific modification information as it sequences DNA.
In its symbiotic state, the nitrogen-fixing Bradyrhizobium diazoefficiens colonizes the roots of soybean plants. Previous studies had reported significant differences in gene expression as the bacterium underwent the transition from free-living to symbiont, leading this team to investigate how much of that could be explained by changes in methylation. They sequenced a free-living bacterium and an endosymbiont, generating a single finished 9 Mb contig representing the genome of the free-living organism. This was used as a reference genome for assembly of the symbiotic strain and made it possible for scientists to identify the 681 kb symbiosis island, which contains genes known to be involved in the infection process.
The methylation analysis found 3,276 changes in five DNA motifs between the strains; 768 of them were associated with differentiation from free-living to symbiotic state, representing more than 9% of all genes and more than 35% of genes with differential expression between the strains. Of the altered genes, 80 were located in the symbiosis island but currently have no known function in the differentiation process. Of the five DNA motifs found, four had increased methylation in the free-living bacterium, and intriguingly, the fifth — the 5’-CCTTGm6AG-3’ motif — was only seen in the symbiont.
“These associations between methylation and expression changes in many B. diazoefficiens genes suggest an important role of the epigenome in bacterial differentiation to the symbiotic state,” the scientists report, noting that follow-up studies with more replicates will be needed to fully test the hypothesis.
Scientists from Yale University and Memorial Sloan Kettering Cancer Center used SMRT Sequencing to determine whether antiretroviral therapies were triggering mitochondrial genome mutations in HIV patients. The results were recently published in HIV Medicine (“High frequency of mitochondrial DNA mutations in HIV-infected treatment-experienced individuals”).
The publication, from lead author Min Li, senior author Elijah Paintsil, and collaborators, reports results from an analysis of 71 people, including 47 HIV patients who had received antiretroviral therapy (about half had mitochondrial toxicity) and 24 healthy controls. DNA was isolated from peripheral blood mononuclear cells and mitochondrial genome sequencing performed on a PacBio System.
SMRT Sequencing was chosen for its long reads and accuracy, according to the authors. “PacBio technology overcomes some of the limitations of current next-generation sequencing platforms by providing significantly longer reads (> 1 kb), single molecule sequencing, and a single-pass error rate of < 15%,” Li et al. write. “Moreover, SMRT sequencing exceeds the consensus accuracy achieved by other sequencing methods because of the random nature of the errors. The SMRT sequencing achieves results with > 99.999% accuracy.”
The team found that HIV-infected patients, regardless of whether they experienced mitochondrial toxicity or not, had a higher frequency of mitochondrial mutations — in particular, large-scale deletions — than their healthy counterparts. However, they did not find statistically significant differences in mutation load between HIV patients with and without mitochondrial toxicity. “We did not observe mtDNA depletion in HIV-infected treatment-experienced patients with toxicity as expected,” the scientists report. “To our surprise, HIV-infected treatment-experienced patients with toxicity had significantly higher mRNA expression of Pol-γ in comparison with HIV-infected treatment-experienced patients without toxicity (P < 0.05) and HIV-uninfected controls (P < 0.01). This contradicts the Pol-γ inhibition theory.”
The authors attribute some of these novel findings to sequencing and analyzing the whole mitochondrial genome, whereas previous studies have taken less comprehensive approaches. The work also paves the way for increased use of peripheral blood mononuclear cells to characterize the biology of HIV patients, compared to the more commonly used invasive tissue biopsies.
Additional studies will be needed to follow up on these discoveries, the scientists note. “There are currently two proposed mechanisms to explain the association between mtDNA mutations and [antiretroviral therapy (ART)],” they write, “ART directly or indirectly provides a permissive environment for sporadic mtDNA mutagenesis; and ART pressure leads to clonal expansion of existing mutations.”
A paper from scientists at Colorado State University and the National Center for Genome Resources provides an in-depth view of the transcriptome of sorghum, a crop that’s important for human and animal food and also shows potential as a biofuel. Through this project, the team produced a new isoform analysis pipeline for community use and identified novel genes, as well as far more alternative splicing than had been expected for this plant.
The publication, “A survey of the sorghum transcriptome using single-molecule long reads,” comes from lead author Salah Abdel-Ghany, senior author Anireddy Reddy, and collaborators. The researchers were particularly interested in alternative splicing and alternative polyadenylation, two mechanisms that increase transcript diversity and may help plants adapt to stress. “Despite the fact that several large-scale RNA-seq studies have been performed in plants to analyse [alternative splicing], currently it is not known how many distinct splice isoforms are produced,” the team writes. “This is primarily due to challenges associated with short-read sequencing in accurately reconstructing full-length splice variants.”
To directly observe full-length splice isoforms, they turned to long-read SMRT Sequencing to characterize the transcriptome of sorghum seedlings. The scientists also developed the Transcriptome Analysis Pipeline for Isoform Sequencing, or TAPIS, to identify alternative splicing events and evidence of alternative polyadenylation. “The analysis of sorghum Iso-Seq data uncovered over 7,000 novel [alternative splicing] events, ~11,000 novel splice isoforms, over 2,100 novel genes and several thousand transcripts that differ in 3′ untranslated regions due to [alternative polyadenylation],” the team reports, noting that many of the novel genes are putative long non-coding transcripts. The total number of unique transcripts was nearly 28,000, covering more than 14,000 genes.
The scientists discovered a significantly higher rate of alternative splicing than had been expected for sorghum. “Previously, it was reported that pre-mRNAs of ~1,500 genes undergo [alternative splicing] in sorghum,” they note. “In this work, we demonstrate that this number is much higher.” Indeed, they found more than 10,000 alternative splicing events, compared with fewer than 3,000 included in existing gene models. The scientists performed a number of validation studies to confirm these and other novel findings and have made their TAPIS tool available for use in other organisms.
“Pacific Biosciences single-molecule long reads obtained using the Iso-Seq protocol offer a considerable advantage in transcriptome-wide identification of full-length splice isoforms and other forms of post-transcriptional regulatory events such as [alternative polyadenylation],” the team writes.
A paper just out in Nature Communications reports the de novo genome assembly and transcriptome of a Chinese individual, generated from long-read SMRT Sequencing and other technologies. The effort revealed nearly 13 Mb of sequence not included in the GRCh38 reference genome as well as novel gene and alternative splicing content not annotated in GENCODE.
“Long-read sequencing and de novo assembly of a Chinese genome” comes from lead author Lingling Shi at Jinan University and senior author Kai Wang from the University of Southern California, as well as many other collaborators in China and the US. The team was particularly interested in finding population-specific variants, including structural variants, which required the use of long-read sequencing. Assemblies based on short-read sequence data “may have inherent technical limitations in characterizing repeat elements that span longer than the read length, yet repeats and segmental duplications are known to cover approximately half of the human genome,” the scientists write. Using SMRT Sequencing and mapping technology from BioNano Genomics, “we perform detailed characterization of the HX1 genome and demonstrate that long-read sequencing can detect functional elements in human genomes that are missed by short-read sequencing.”
For the genome assembly, the team sequenced DNA from an anonymous Chinese individual (HX1) to 103x coverage, producing a 2.93 Gb genome with a contig N50 of 8.3 Mb. Included in the results were 206 Mb of alternative haplotypes that “were constructed along with the primary contigs,” Shi et al. write. Consensus accuracy for the assembly was 99.73%, matching the accuracy of the well-known NA12878 genome assembly. In an analysis of structural variants, the team found about 20,000 insertions and deletions, with half of them classified as short tandem repeats or mobile elements. Nearly 50 exonic deletions or insertions were specific to the HX1 genome, including one previously characterized deletion that has only been seen in the Asian population.
The team also developed a new gap-filling method to make use of all this sequence data. They determined that nearly 30% of gaps in the GRCh38 reference genome could be addressed with data from HX1. “The total length of filled or shortened gaps amounts to 7.1 Mb,” they report. “We further evaluated the repeat contents within the gaps that can be closed by us, and found that simple repeats and satellite sequences were significantly enriched within the closed gaps compared with GRCh38.”
Using the Iso-Seq method, the scientists also analyzed the transcriptome of this individual and detected more than 58,000 isoforms, including “57 isoforms at 42 loci that do not overlap with any GENCODE transcript,” they write. Follow-up studies for some of the more complex data — such as “a novel transcribed element with at least five exons and six isoforms” — validated these predicted splicing events. They also found at least two genes that have never been identified with short-read data. The team looked for disease-causing variants, finding two that were classified in ClinVar as pathogenic. However, “manual review of the literature cited in the two ClinVar records indicated that both of them represented erroneous database records,” the scientists report. “This analysis highlights the need for extreme caution in interpreting ‘pathogenic’ variants documented in variant databases.”
“In summary, while short-read-based alignment and variant calling based on reference genome remain a common practice to assay personal genomes, de novo assembly by long-read sequencing may reveal novel and complementary biological insights,” Shi et al. conclude. “Furthermore, long-read RNA sequencing may identify novel transcripts that can be missed by short-read RNA sequencing.”