AGBT 2013 Presentation Slides: Cold Spring Harbor Laboratory’s Michael Schatz presented strategies for de novo assembly of crop genomes with PacBio technolgy.
Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers in large genome complexities, such as long, highly repetitive, low-complexity regions and duplication events, and differentiating between transcript isoforms that are difficult to resolve with short-read technologies. We present solutions available for both reference genome improvement (>100 MB) and transcriptome research to best leverage long reads that have exceeded 20 Kb in length. Benefits for these applications are further realized with consistent use of size-selection of input sample using the BluePippin™ device from Sage Science. Highlights from our genome assembly projects using the latest P5-C3 chemistry on model organisms will be shared. Assembly contig N50 have exceeded 6 Mb and we observed longest contig exceeding 12.5 Mb with an average base quality of QV50. Additionally, the value of long, intact reads to provide a no-assembly approach to investigate transcript isoforms using our Iso-Seq Application will be presented.
Generating de novo reference genome assemblies for non-model organisms is a laborious task that often requires a large amount of data from several sequencing platforms and cytogenetic surveys. By using PacBio sequence data and new library creation techniques, we present a de novo, high quality reference assembly for the goat (Capra hircus) that demonstrates a primarily sequencing-based approach to efficiently create new reference assemblies for Eukaryotic species. This goat reference genome was created using 38 million PacBio P5-C3 reads generated from a San Clemente goat using the Celera Assembler PBcR pipeline with PacBio read self-correction. In order to generate the assembly, corrected and filtered reads were pre-assembled into a consensus model using PBDAGCON, and subsequently assembled using the Celera Assembly version 8.2. We generated 5,902 contigs using this method with a contig N50 size of 2.56 megabases. In order to generate chromosome-sized scaffolds, we used the LACHESIS scaffolding method to identify cis-chromosome Hi-C interactions in order to link contigs together. We then compared our new assembly to the existing goat reference assembly to identify large-scale discrepancies. In our comparison, we identified 247 disagreements between the two assemblies consisting of 123 inversions and 124 chromosome-contig relocations. The high quality of this data illustrates how this methodology can be used to efficiently generate new reference genome assemblies without the use of expensive fluorescent cytometry or large quantities of data from multiple sequencing platforms.
The goat (Capra hircus) remains an important livestock species due to the species’ ability to forage and provide milk, meat and wool in arid environments. The current goat reference assembly and annotation borrows heavily from other loosely related livestock species, such as cattle, and may not reflect the unique structural and functional characteristics of the species. We present preliminary data from a new de novo reference assembly for goat that primarily utilizes 38 million PacBio P5-C3 reads generated from an inbred San Clemente goat. This assembly consists of only 5,902 contigs with a contig N50 size of 2.56 megabases which were grouped into scaffolds using cis-chromosome associations generated by the analysis of Hi-C sequence reads. To provide accurate functional genetic annotation, we utilized existing RNA-seq data and generated new data consisting of over 784 million reads from a combination of 27 different developmental timepoints/tissues. This dataset provides a tangible improvement over existing goat genomics resources by correcting over 247 misassemblies in the current goat reference genome and by annotating predicted gene models with actual expressed transcript data. Our goal is to provide a high quality resource to researchers to enable future genomic selection and functional prediction within the field of goat genomics.
Goat is an important source of milk, meat, and fiber, especially in developing countries. An advantage of goats as livestock is the low maintenance requirements and high adaptability compared to other milk producers. The global population of domestic goats exceeds 800 million. In Africa, goat production is characterized by low productivity levels, and attempts to introduce more productive breeds have met with poor success due in part to nutritional constraints. It has been suggested that incorporation of selective breeding within the herds adapted for survival could represent one approach to improving food security across Africa. A recently produced genome assembly of a Chinese Yunnan breed goat, based on 192 Gb of short reads across a range of insert sizes from 180 bp to 20 kb, reported a contig N50 of 18.7 kb. The scaffold N50 was improved from 2.2 Mb to 3.1 Mb by addition of fosmid end sequence, with an estimated 140 million Ns in gaps and 91% coverage. The assembly has proven somewhat problematic for pursuing genome-wide association analysis with SNP arrays, apparently due in part to errors in ordering of markers using the draft genome. In order to provide a higher quality assembly, we sequenced a highly inbred, San Clemente breed goat genome using 458 SMRT cells on the Pacific Biosciences platform. These cells generated 193.5 Gbases of sequence after processing into subreads, with mean 5110 bases and max subread length of 40.5 kb. This sequence data generated an assembly using the recently reported MHAP error correction approach and Celera Assembler v8.2. The contig N50 was 2.5 Mb, with the largest contig spanning 19.5 Mb. Additional characteristics of the assembly will be presented.
2015 SMRT Informatics Developers Conference Presentation Slides: Sergey Koren of National Biodefense Analysis and Countermeasures Center (NBACC) provided an overview of the MHAP algorithm, a method for assembling large genomes with Sing-Molecule Sequencing and locality sensitive hashing. Using MHAP, Koren produced a human assembly (CHM1) with a contig N50 of >23 Mb.
Goats are specialized in dairy, meat and fiber production, being adapted to a wide range of environmental conditions and having a large economic impact in developing countries. In the last years, there have been dramatic advances in the knowledge of the structure and diversity of the goat genome/transcriptome and in the development of genomic tools, rapidly narrowing the gap between goat and related species such as cattle and sheep. Major advances are: 1) publication of a de novo goat genome reference sequence; 2) Development of whole genome high density RH maps, and; 3) Design of a commercial 50K SNP array. Moreover, there are currently several projects aiming at improving current genomic tools and resources. An improved assembly of the goat genome using PacBio reads is being produced, and the design of new SNP arrays is being studied to accommodate the specific needs of this species in the context of very large scale genotyping projects (i.e. breed characterization at an international scale and genomic selection) and parentage analysis. As in other species, the focus has now turned to the identification of causative mutations underlying the phenotypic variation of traits. In addition, since 2014, the ADAPTmap project (www.goatadaptmap.org) has gathered data to explore the diversity of caprine populations at a worldwide scale by using a wide variety of approaches and data.
Reference genome assemblies provide important context in genetics by standardizing the order of genes and providing a universal set of coordinates for individual nucleotides. Often due to the high complexity of genic regions and higher copy number of genes involved in immune function, immunity-related genes are often misassembled in current reference assemblies. This problem is particularly ubiquitous in the reference genomes of non-model organisms as they often do not receive the years of curation necessary to resolve annotation and assembly errors. In this study, we reassemble a reference genome of the goat (Capra hircus) using modern PacBio technology in tandem with BioNano Genomics Irys optical maps and Lachesis clustering in order to provide a high quality reference assembly without the need for extensive filtering. Initial PacBio assemblies using P5C4 chemistry achieved contig N50’s of 4 Megabases and a BUSCO completion score of 84.0%, which is comparable to several finished model organism reference assemblies. We used BioNano Genomics’ Irys platform to generate 336 scaffolds from this data with a scaffold N50 of 24 megabases and total genome coverage of 98%. Lachesis interaction maps were used with a clustering algorithm to associate Irys scaffolds into the expected 30 chromosome physical maps. Comparisons of the initial hybrid scaffolds generated from the long read contigs and optical map information to a previously generated RH map revealed that the entirety of the Goat autosome 20 physical map was contained within one scaffold. Additionally, the BioNano scaffolding resolved several difficult regions that contained genes related to innate immunity which were problem regions in previous reference genome assemblies.
From Sequencing to Chromosomes: New de novo assembly and scaffolding methods improve the goat reference genome
Single-molecule sequencing is now routinely used to assemble complete, high-quality microbial genomes, but these assembly methods have not scaled well to large genomes. To address this problem, we previously introduced the MinHash Alignment Process (MHAP) for overlapping single-molecule reads using probabilistic, locality-sensitive hashing. Integrating MHAP with Celera Assembler (CA) has enabled reference-grade assemblies of model organisms, revealing novel heterochromatic sequences and filling low-complexity gap sequences in the GRCh38 human reference genome. We have applied our methods to assemble the San Clemente goat genome. Combining single-molecule sequencing from Pacific Biosciences and BioNano Genomics generates and assembly that is over 150-fold more contiguous than the latest Capra hircus reference. In combination with Hi-C sequencing, the assembly surpasses reference assemblies, de novo, with minimal manual intervention. The autosomes are each assembled into a single scaffold. Our assembly provides a more complete gene reconstruction, better alignments with Goat 52k chip, and improved allosome reconstruction. In addition to providing increased continuity of sequence, our assembly achieves a higher BUSCO completion score (84%) than the existing goat reference assembly suggesting better quality annotation of gene models. Our results demonstrate that single-molecule sequencing can produce near-complete eukaryotic genomes at modest cost and minimal manual effort.
Reference quality de novo genome assemblies were once solely the domain of large, well-funded genome projects. While next-generation short read technology removed some of the cost barriers, accurate chromosome-scale assembly remains a real challenge. Here we present efforts to de novo assemble the goat (Capra hircus) genome. Through the combination of single-molecule technologies from Pacific Biosciences (sequencing) and BioNano Genomics (optical mapping) coupled with high-throughput chromosome conformation capture sequencing (Hi-C), an inbred San Clemente goat genome has been sequenced and assembled to a high degree of completeness at a relatively modest cost. Starting with 38 million PacBio reads, we integrated the MinHash Alignment Process (MHAP) with the Celera Assembler (CA) to produce an assembly composed of 3110 contigs with a contig N50 size of 4.7 Mb. This assembly was scaffolded with BioNano genome maps derived from a single IrysChip into 333 scaffolds with an N50 of 23.1 Mb including the complete scaffolding of chromosome 20. Finally, cis-chromosome associations were determined by Hi-C, yielding complete reconstruction of all autosomes into single scaffolds with a final N50 of 91.7 Mb. We hope to demonstrate that our methods are not only cost effective, but improve our ability to annotate challenging genomic regions such as highly repetitive immune gene clusters.
Given a massive collection of sequences, it is infeasible to perform pairwise alignment for basic tasks like sequence clustering and search. To address this problem, we demonstrate that the MinHash technique, first applied to clustering web pages, can be applied to biological sequences with similar effect, and extend this idea to include biologically relevant distance and significance measures. Our new tool, Mash, uses MinHash locality-sensitive hashing to reduce large sequences to a representative sketch and rapidly estimate pairwise distances between genomes or metagenomes. Using Mash, we explored several use cases, including a 5,000-fold size reduction and clustering of all 55,000 NCBI RefSeq genomes in 46 CPU hours. The resulting 93 MB sketch database includes all RefSeq genomes, effectively delineates known species boundaries, reconstructs approximate phylogenies, and can be searched in seconds using assembled genomes or raw sequencing runs from Illumina, Pacific Biosciences, and Oxford Nanopore. For metagenomics, Mash scales to thousands of samples and can replicate Human Microbiome Project and Global Ocean Survey results in a fraction of the time. Other potential applications include any problem where an approximate, global sequence distance is acceptable, e.g. to triage and cluster sequence data, assign species labels to unknown genomes, quickly identify mis- tracked samples, and search massive genomic databases. In addition, the Mash distance metric is based on simple set intersections, which are compatible with homomorphic encryption schemes. To facilitate integration with other software, Mash is implemented as a lightweight C++ toolkit and freely released under a BSD license athttps://github.com/marbl/mash
A high-quality genome assembly of SMRT Sequences reveals long-range haplotype structure in the diploid mosquito Aedes aegypti
Aedes aegypti is a tropical and subtropical mosquito vector for Zika, yellow fever, dengue fever, chikungunya, and other diseases. The outbreak of Zika in the Americas, which can cause microcephaly in the fetus of infected women, adds urgency to the need for a high-quality reference genome in order to better understand the organism’s biology and its role in transmitting human disease. We describe the first diploid assembly of an insect genome, using SMRT sequencing and the open-source assembler FALCON-Unzip. This assembly has high contiguity (contig N50 1.3 Mb), is more complete than previous assemblies (Length 1.45 Gb with 87% BUSCO genes complete), and is high quality (mean base >QV30). Long-range haplotype structure, in some cases encompassing more than 4 Mb of extremely divergent homologous sequence, is resolved using a combination of the FALCON-Unzip assembler, genome annotation, coverage depth, and pairwise nucleotide alignments.
A high-quality genome assembly of SMRT sequences reveals long range haplotype structure in the diploid mosquito Aedes aegypti
Aedes aegypti is a tropical and subtropical mosquito vector for Zika, yellow fever, dengue fever, and chikungunya. We describe the first diploid assembly of an insect genome, using SMRT Sequencing and the open-source assembler FALCON-Unzip. This assembly has high contiguity (contig N50 1.3 Mb), is more complete than previous assemblies (Length 1.45 Gb with 87% BUSCO genes complete), and is high quality (mean base >QV30 after polishing). Long-range haplotype structure, in some cases encompassing more than 4 Mb of extremely divergent homologous sequence with dramatic differences in coding sequence content, is resolved using a combination of the FALCON-Unzip assembler, genome annotation, coverage depth, and pairwise nucleotide alignments.
FALCON-Phase integrates PacBio and HiC data for de novo assembly, scaffolding and phasing of a diploid Puerto Rican genome (HG00733)
Haplotype-resolved genomes are important for understanding how combinations of variants impact phenotypes. The study of disease, quantitative traits, forensics, and organ donor matching are aided by phased genomes. Phase is commonly resolved using familial data, population-based imputation, or by isolating and sequencing single haplotypes using fosmids, BACs, or haploid tissues. Because these methods can be prohibitively expensive, or samples may not be available, alternative approaches are required. de novo genome assembly with PacBio Single Molecule, Real-Time (SMRT) data produces highly contiguous, accurate assemblies. For non-inbred samples, including humans, the separate resolution of haplotypes results in higher base accuracy and more contiguous assembled sequences. Two primary methods exist for phased diploid genome assembly. The first, TrioCanu requires Illumina data from parents and PacBio data from the offspring. The long reads from the child are partitioned into maternal and paternal bins using parent-specific sequences; the separate PacBio read bins are then assembled, generating two fully phased genomes. An alternative approach (FALCON-Unzip) does not require parental information and separates PacBio reads, during genome assembly, using heterozygous SNPs. The length of haplotype phase blocks in FALCON-Unzip is limited by the magnitude and distribution of heterozygosity, the length of sequence reads, and read coverage. Because of this, FALCON-Unzip contigs typically contain haplotype-switch errors between phase blocks, resulting in primary contig of mixed parental origin. We developed FALCON-Phase, which integrates Hi-C data downstream of FALCON-Unzip to resolve phase switches along contigs. We applied the method to a human (Puerto Rican, HG00733) and non-human genome assemblies and evaluated accuracy using samples with trio data. In a cattle genome, we observe >96% accuracy in phasing when compared to TrioCanu assemblies as well as parental SNPs. For a high-quality PacBio assembly (>90-fold Sequel coverage) of a Puerto Rican individual we scaffolded the FALCON-Phase contigs, and re-phased the contigs creating a de novo scaffolded, phased diploid assembly with chromosome-scale contiguity.
HiFi sequencing on the PacBio Sequel II System enables complete microbial community profiling of complex metagenomic samples using whole genome shotgun sequences. With HiFi sequencing, highly accurate long reads overcome the challenges posed by the presence of intergenic and extragenic repeat elements in microbial genomes, thus greatly improving phylogenetic profiling and sequence assembly. Recent improvements in library construction protocols enable HiFi sequencing starting from as low as 5 ng of input DNA. Here, we demonstrate comparative analyses of a control sample of known composition and a human fecal sample from varying amounts of input genomic DNA (1 ug, 200 ng, 5 ng), and present the corresponding library preparation workflows for standard, low input, and Ultra-Low methods. We demonstrate that the metagenome assembly, taxonomic assignment, and gene finding analyses are comparable across all methods for both samples, providing access to HiFi sequencing even for DNA-limited sample types.