With SMRT Link you can unlock the power of PacBio Single Molecule, Real-Time (SMRT) Sequencing using our portfolio of software tools designed to set up and monitor sequencing runs, review performance metrics, analyze, visualize, and annotate your sequencing data.
In recent years, human genomic research has focused on comparing short-read data sets to a single human reference genome. However, it is becoming increasingly clear that significant structural variations present in individual human genomes are missed or ignored by this approach. Additionally, remapping short-read data limits the phasing of variation among individual chromosomes. This reduces the newly sequenced genome to a table of single nucleotide polymorphisms (SNPs) with little to no information as to the co-linearity (phasing) of these variants, resulting in a “mosaic” reference representing neither of the parental chromosomes. The variation between the homologous chromosomes is lost in this representation, including allelic variations, structural variations, or even genes present in only one chromosome, leading to lost information regarding allelic-specific gene expression and function. To address these limitations, we have made significant progress integrating haplotype information directly into genome assembly process with long reads. The FALCON-Unzip algorithm leverages a string graph assembly approach to facilitate identification and separation of heterozygosity during the assembly process to produce a highly contiguous assembly with phased haplotypes representing the genome in its diploid state. The outputs of the assembler are pairs of sequences (haplotigs) containing the allelic differences, including SNPs and structural variations, present in the two sets of chromosomes. The development and testing of our de-novo diploid assembler was facilitated and carefully validated using inbred reference model organisms and F1 progeny, which allowed us to ascertain the accuracy and concordance of haplotigs relative to the two inbred parental assemblies. Examination of the results confirmed that our haplotype-resolved assemblies are “Gold Level” reference genomes having a quality similar to that of Sanger-sequencing, BAC-based assembly approaches. We further sequenced and assembled two well-characterized human samples into their respective phased diploid genomes with gap-free contig N50 sizes greater than 23 Mb and haplotig N50 sizes greater than 380 kb. Results of these assemblies and a comparison between the haplotype sets are presented.
Assessing diversity and clonal variation of Australia’s grapevine germplasm: Curating the FALCON-Unzip Chardonnay de novo genome assembly
Until recently only two genome assemblies were publicly available for grapevine—both Vitis vinifera L. Cv. Pinot Noir (PN). The best available PN genome assembly (Jaillon et al. 2007) is not representative of the genome complexity that is typical of wine-grape cultivars in the field and it is highly fragmented. To assess the genetic complexities of Chardonnay grapevine, assembly of a new de novo reference genome was needed. Here we describe a draft assembly using PacBio SMRT Sequencing data and PacBio’s new phased diploid genome assembler FALCON-Unzip (Chin et al. 2016).
While genome assembly projects have been successful in many haploid and inbred species, the assembly of non-inbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.
A high-quality genome assembly of SMRT Sequences reveals long-range haplotype structure in the diploid mosquito Aedes aegypti
Aedes aegypti is a tropical and subtropical mosquito vector for Zika, yellow fever, dengue fever, chikungunya, and other diseases. The outbreak of Zika in the Americas, which can cause microcephaly in the fetus of infected women, adds urgency to the need for a high-quality reference genome in order to better understand the organism’s biology and its role in transmitting human disease. We describe the first diploid assembly of an insect genome, using SMRT sequencing and the open-source assembler FALCON-Unzip. This assembly has high contiguity (contig N50 1.3 Mb), is more complete than previous assemblies (Length 1.45 Gb with 87% BUSCO genes complete), and is high quality (mean base >QV30). Long-range haplotype structure, in some cases encompassing more than 4 Mb of extremely divergent homologous sequence, is resolved using a combination of the FALCON-Unzip assembler, genome annotation, coverage depth, and pairwise nucleotide alignments.
De novo PacBio long-read assembled avian genomes correct and add to genes important in neuroscience and conservation research
To test the impact of high-quality genome assemblies on biological research, we applied PacBio long-read sequencing in conjunction with the new, diploid-aware FALCON-Unzip assembler to a number of bird species. These included: the zebra finch, for which a consortium-generated, Sanger-based reference exists, to determine how the FALCON-Unzip assembly would compare to the current best references available; Anna’s hummingbird genome, which had been assembled with short-read sequencing methods as part of the Avian Phylogenomics phase I initiative; and two critically endangered bird species (kakapo and ‘alala) of high importance for conservations efforts, whose genomes had not previously been sequenced and assembled.
A high-quality genome assembly of SMRT sequences reveals long range haplotype structure in the diploid mosquito Aedes aegypti
Aedes aegypti is a tropical and subtropical mosquito vector for Zika, yellow fever, dengue fever, and chikungunya. We describe the first diploid assembly of an insect genome, using SMRT Sequencing and the open-source assembler FALCON-Unzip. This assembly has high contiguity (contig N50 1.3 Mb), is more complete than previous assemblies (Length 1.45 Gb with 87% BUSCO genes complete), and is high quality (mean base >QV30 after polishing). Long-range haplotype structure, in some cases encompassing more than 4 Mb of extremely divergent homologous sequence with dramatic differences in coding sequence content, is resolved using a combination of the FALCON-Unzip assembler, genome annotation, coverage depth, and pairwise nucleotide alignments.
From RNA to full-length transcripts: The PacBio Iso-Seq method for transcriptome analysis and genome annotation
A single gene may encode a surprising number of proteins, each with a distinct biological function. This is especially true in complex eukaryotes. Short- read RNA sequencing (RNA-seq) works by physically shearing transcript isoforms into smaller pieces and bioinformatically reassembling them, leaving opportunity for misassembly or incomplete capture of the full diversity of isoforms from genes of interest. The PacBio Isoform Sequencing (Iso-Seq™) method employs long reads to sequence transcript isoforms from the 5’ end to their poly-A tails, eliminating the need for transcript reconstruction and inference. These long reads result in complete, unambiguous information about alternatively spliced exons, transcriptional start sites, and poly- adenylation sites. This allows for the characterization of the full complement of isoforms within targeted genes, or across an entire transcriptome. Here we present improved genome annotations for two avian models of vocal learning, Anna’s hummingbird (Calypte anna) and zebra finch (Taeniopygia guttata), using the Iso-Seq method. We present graphical user interface and command line analysis workflows for the data sets. From brain total RNA, we characterize more than 15,000 isoforms in each species, 9% and 5% of which were previously unannotated in hummingbird and zebra finch, respectively. We highlight one example where capturing full-length transcripts identifies additional exons and UTRs.
A high quality reference genome is an essential resource for plant and animal breeding and functional and evolutionary studies. The common hop (Humulus lupulus, Cannabaceae) is an economically important crop plant used to flavor and preserve beer. Its genome is large (flow cytometrybased estimates of diploid length >5.4Gb1), highly repetitive, and individual plants display high levels of heterozygosity, which make assembly of an accurate and contiguous reference genome challenging with conventional short-read methods. We present a contig assembly of Cascade Hops using PacBio long reads and the diploid genome assembler, FALCON-Unzip2. The assembly has dramatically improved contiguity and completeness over earlier short-read assemblies. The genome is primarily assembled as haplotypes due to the outbred nature of the organism. We explore patterns of haplotype divergence across the assembly and present strategies to deduplicate haplotypes prior to scaffolding
High-quality de novo genome assembly and intra-individual mitochondrial instability in the critically endangered kakapo
The kakapo (Strigops habroptila) is a large, flightless parrot endemic to New Zealand. It is highly endangered with only ~150 individuals remaining, and intensive conservation efforts are underway to save this iconic species from extinction. These include genetic studies to understand critical genes relevant to fertility, adaptation and disease resistance, and genetic diversity across the remaining population for future breeding program decisions. To aid with these efforts, we have generated a high-quality de novo genome assembly using PacBio long-read sequencing. Using the new diploid-aware FALCON-Unzip assembler, the resulting genome of 1.06 Gb has a contig N50 of 5.6 Mb (largest contig 29.3 Mb), >350-times more contiguous compared to a recent short-read assembly of a closely related parrot (kea) species. We highlight the benefits of the higher contiguity and greater completeness of the kakapo genome assembly through examples of fully resolved genes important in wildlife conservation (contrasted with fragmented and incomplete gene resolution in short-read assemblies), in some cases even providing sequence for regions orthologous to gaps of missing sequence in the chicken reference genome. We also highlight the complete resolution of the kakapo mitochondrial genome, fully containing the mitochondrial control region which is missing from the previous dedicated kakapomitochondrial genome NCBI entry. For this region, we observed a marked heterogeneity in the number of tandem repeats in different mtDNAmolecules from a single bird tissue, highlighting the enhanced molecular resolution uniquely afforded by long-read, single-molecule PacBio sequencing.
Incomplete annotation of genomes represents a major impediment to understanding biological processes, functional differences between species, and evolutionary mechanisms. Often, genes that are large, embedded within duplicated genomic regions, or associated with repeats are difficult to study by short-read expression profiling and assembly. In addition, most genes in eukaryotic organisms produce alternatively spliced isoforms, broadening the diversity of proteins encoded by the genome, which are difficult to resolve with short-read methods. Short-read RNA sequencing (RNA-seq) works by physically shearing transcript isoforms into smaller pieces and bioinformatically reassembling them, leaving opportunity for misassembly or incomplete capture of the full diversity of isoforms from genes of interest. In contrast, Single Molecule, Real-Time (SMRT) Sequencing directly sequences full-length transcripts without the need for assembly and imputation. Here we apply the Iso-Seq method (long-read RNA sequencing) to detect full-length isoforms and the new IsoPhase algorithm to retrieve allele-specific isoform information for two avian models of vocal learning, Anna’s hummingbird (Calypte anna) and zebra finch (Taeniopygia guttata).
Plant and animal whole genome sequencing has proven to be challenging, particularly due to genome size, high density of repetitive elements and heterozygosity. The Sequel System delivers long reads, high consensus accuracy and uniform coverage, enabling more complete, accurate, and contiguous assemblies of these large complex genomes. The latest Sequel chemistry increases yield up to 8 Gb per SMRT Cell for long insert libraries >20 kb and up to 10 Gb per SMRT Cell for libraries >40 kb. In addition, the recently released SMRTbell Express Template Prep Kit reduces the time (~3 hours) and DNA input (~3 µg), making the workflow easy to use for multi- SMRT Cell projects. Here, we recommend the best practices for whole genome sequencing and de novo assembly of complex plant and animal genomes. Guidelines for constructing large-insert SMRTbell libraries (>30 kb) to generate optimal read lengths and yields using the latest Sequel chemistry are presented. We also describe ways to maximize library yield per preparation from as littles as 3 µg of sheared genomic DNA. The combination of these advances makes plant and animal whole genome sequencing a practical application of the Sequel System.
FALCON-Phase integrates PacBio and HiC data for de novo assembly, scaffolding and phasing of a diploid Puerto Rican genome (HG00733)
Haplotype-resolved genomes are important for understanding how combinations of variants impact phenotypes. The study of disease, quantitative traits, forensics, and organ donor matching are aided by phased genomes. Phase is commonly resolved using familial data, population-based imputation, or by isolating and sequencing single haplotypes using fosmids, BACs, or haploid tissues. Because these methods can be prohibitively expensive, or samples may not be available, alternative approaches are required. de novo genome assembly with PacBio Single Molecule, Real-Time (SMRT) data produces highly contiguous, accurate assemblies. For non-inbred samples, including humans, the separate resolution of haplotypes results in higher base accuracy and more contiguous assembled sequences. Two primary methods exist for phased diploid genome assembly. The first, TrioCanu requires Illumina data from parents and PacBio data from the offspring. The long reads from the child are partitioned into maternal and paternal bins using parent-specific sequences; the separate PacBio read bins are then assembled, generating two fully phased genomes. An alternative approach (FALCON-Unzip) does not require parental information and separates PacBio reads, during genome assembly, using heterozygous SNPs. The length of haplotype phase blocks in FALCON-Unzip is limited by the magnitude and distribution of heterozygosity, the length of sequence reads, and read coverage. Because of this, FALCON-Unzip contigs typically contain haplotype-switch errors between phase blocks, resulting in primary contig of mixed parental origin. We developed FALCON-Phase, which integrates Hi-C data downstream of FALCON-Unzip to resolve phase switches along contigs. We applied the method to a human (Puerto Rican, HG00733) and non-human genome assemblies and evaluated accuracy using samples with trio data. In a cattle genome, we observe >96% accuracy in phasing when compared to TrioCanu assemblies as well as parental SNPs. For a high-quality PacBio assembly (>90-fold Sequel coverage) of a Puerto Rican individual we scaffolded the FALCON-Phase contigs, and re-phased the contigs creating a de novo scaffolded, phased diploid assembly with chromosome-scale contiguity.
A low DNA input protocol for high-quality PacBio de novo genome assemblies from single invertebrate individuals
A high-quality reference genome is an essential tool for studies of plant and animal genomics. PacBio Single Molecule, Real-Time (SMRT) Sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. PacBio is the core technology for many large genome initiatives, however, relatively high DNA input requirements (5 µg for standard library protocol) have placed PacBio out of reach for many projects on small, non-inbred organisms that may have lower DNA content. Here we present high-quality de novo genome assemblies from single invertebrate individuals for two different species: the Anopheles coluzzii mosquito and the Schistosoma mansoni parasitic flatworm. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 50-100 ng of starting genomic DNA. The libraries were run on the Sequel System with chemistry v3.0 and software v6.0, generating a range of 21-32 Gb of sequence per SMRT Cell with 20 hour movies, and followed by diploid de novo genome assembly with FALCON-Unzip. The resulting assemblies had high contiguity (contig N50s over 3 Mb for both species) and completeness (as determined by conserved BUSCO gene analysis). We were also able to resolve maternal and paternal haplotypes for 1/3 of the genome in both cases. By sequencing and assembling material from a single diploid individual, only two haplotypes are present, simplifying the assembly process compared to samples from multiple pooled individuals. This new low-input approach puts PacBio-based assemblies in reach for small, highly heterozygous organisms that comprise much of the diversity of life. The method presented here can be applied to samples with starting DNA amounts around 100 ng per 250 Mb – 1 Gb genome size.
A high-quality reference genome is an essential tool for studying the genetics of traits and disease, organismal, comparative and conservation biology, and population genomics. PacBio Single Molecule, Real-Time (SMRT) Sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. Improvements in throughput and concomitant reductions in cost have made PacBio an attractive core technology for many large genome initiatives. However, relatively high DNA input requirements (3 µg for standard library protocol) have placed PacBio out of reach for many projects on small organisms that may have lower DNA content or on projects with limited input DNA for other reasons. Here we present a modified SMRTbell library construction protocol without DNA shearing or size selection that can be used to generate a SMRTbell library from just 150 ng of starting genomic DNA. Remarkably, the protocol enables high quality de novo assemblies from single invertebrate individuals and is applied to taxonomically diverse samples. By sequencing and assembling material from a single diploid individual, only two haplotypes are present, simplifying the assembly process compared to samples from multiple pooled individuals. The libraries were run on the Sequel System with chemistry v3.0 and software v6.0, generating ~11 Gb of sequence per SMRT Cell with 10 hour movies, and followed by de novo genome assembly with FALCON. The resulting assemblies had high contiguity (contig N50s over 1 Mb) and completeness (as determined by conserved BUSCO gene analysis) when at least 30-fold unique molecular coverage is obtained. This new low-input approach now puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life. The method presented here is scalable and can be applied to samples with starting DNA amounts of 150 ng per 300 Mb genome size.