Understanding the genetic basis of infectious diseases is critical to enacting effective treatments, and several large-scale sequencing initiatives are underway to collect this information. Sequencing bacterial samples is typically performed by mapping sequence reads against genomes of known reference strains. While such resequencing informs on the spectrum of single-nucleotide differences relative to the chosen reference, it can miss numerous other forms of variation known to influence pathogenicity: structural variations (duplications, inversions), acquisition of mobile elements (phages, plasmids), homonucleotide length variation causing phase variation, and epigenetic marks (methylation, phosphorothioation) that influence gene expression to switch bacteria from non- pathogenic to pathogenic states. Therefore, sequencing methods which provide complete, de novo genome assemblies and epigenomes are necessary to fully characterize infectious disease agents in an unbiased, hypothesis-free manner. Hybrid assembly methods have been described that combine long sequence reads from SMRT DNA Sequencing with short reads (SMRT CCS (circular consensus) or second-generation reads), wherein the short reads are used to error-correct the long reads which are then used for assembly. We have developed a new paradigm for microbial de novo assemblies in which SMRT sequencing reads from a single long insert library are used exclusively to close the genome through a hierarchical genome assembly process, thereby obviating the need for a second sample preparation, sequencing run, and data set. We have applied this method to achieve closed de novo genomes with accuracies exceeding QV50 (>99.999%) for numerous disease outbreak samples, including E. coli, Salmonella, Campylobacter, Listeria, Neisseria, and H. pylori. The kinetic information from the same SMRT Sequencing reads is utilized to determine epigenomes. Approximately 70% of all methyltransferase specificities we have determined to date represent previously unknown bacterial epigenetic signatures. With relatively short sequencing run times and automated analysis pipelines, it is possible to go from an unknown DNA sample to its complete de novo genome and epigenome in about a day.
Understanding the genetic basis of infectious diseases is critical to enacting effective treatments, and several large-scale sequencing initiatives are underway to collect this information. Sequencing bacterial samples is typically performed by mapping sequence reads against genomes of known reference strains. While such resequencing informs on the spectrum of single nucleotide differences relative to the chosen reference, it can miss numerous other forms of variation known to influence pathogenicity: structural variations (duplications, inversions), acquisition of mobile elements (phages, plasmids), homonucleotide length variation causing phase variation, and epigenetic marks (methylation, phosphorothioation) that influence gene expression to switch bacteria from non-pathogenic to pathogenic states. Therefore, sequencing methods which provide complete, de novo genome assemblies and epigenomes are necessary to fully characterize infectious disease agents in an unbiased, hypothesis-free manner. Hybrid assembly methods have been described that combine long sequence reads from SMRT DNA sequencing with short, high-accuracy reads (SMRT (circular consensus sequencing) CCS or second-generation reads) to generate long, highly accurate reads that are then used for assembly. We have developed a new paradigm for microbial de novo assemblies in which long SMRT sequencing reads (average readlengths >5,000 bases) are used exclusively to close the genome through a hierarchical genome assembly process, thereby obviating the need for a second sample preparation, sequencing run and data set. We have applied this method to achieve closed de novo genomes with accuracies exceeding QV50 (>99.999%) to numerous disease outbreak samples, including E. coli, Salmonella, Campylobacter, Listeria, Neisseria, and H. pylori. The kinetic information from the same SMRT sequencing reads is utilized to determine epigenomes. Approximately 70% of all methyltransferase specificities we have determined to date represent previously unknown bacterial epigenetic signatures. The process has been automated and requires less than 1 day from an unknown DNA sample to its complete de novo genome and epigenome.
Second-generation sequencing has brought about tremendous insights into the genetic underpinnings of biology. However, there are many functionally important and medically relevant regions of genomes that are currently difficult or impossible to sequence, resulting in incomplete and fragmented views of genomes. Two main causes are (i) limitations to read DNA of extreme sequence content (GC-rich or AT-rich regions, low complexity sequence contexts) and (ii) insufficient read lengths which leave various forms of structural variation unresolved and result in mapping ambiguities.
Generating de novo reference genome assemblies for non-model organisms is a laborious task that often requires a large amount of data from several sequencing platforms and cytogenetic surveys. By using PacBio sequence data and new library creation techniques, we present a de novo, high quality reference assembly for the goat (Capra hircus) that demonstrates a primarily sequencing-based approach to efficiently create new reference assemblies for Eukaryotic species. This goat reference genome was created using 38 million PacBio P5-C3 reads generated from a San Clemente goat using the Celera Assembler PBcR pipeline with PacBio read self-correction. In order to generate the assembly, corrected and filtered reads were pre-assembled into a consensus model using PBDAGCON, and subsequently assembled using the Celera Assembly version 8.2. We generated 5,902 contigs using this method with a contig N50 size of 2.56 megabases. In order to generate chromosome-sized scaffolds, we used the LACHESIS scaffolding method to identify cis-chromosome Hi-C interactions in order to link contigs together. We then compared our new assembly to the existing goat reference assembly to identify large-scale discrepancies. In our comparison, we identified 247 disagreements between the two assemblies consisting of 123 inversions and 124 chromosome-contig relocations. The high quality of this data illustrates how this methodology can be used to efficiently generate new reference genome assemblies without the use of expensive fluorescent cytometry or large quantities of data from multiple sequencing platforms.
The goat (Capra hircus) remains an important livestock species due to the species’ ability to forage and provide milk, meat and wool in arid environments. The current goat reference assembly and annotation borrows heavily from other loosely related livestock species, such as cattle, and may not reflect the unique structural and functional characteristics of the species. We present preliminary data from a new de novo reference assembly for goat that primarily utilizes 38 million PacBio P5-C3 reads generated from an inbred San Clemente goat. This assembly consists of only 5,902 contigs with a contig N50 size of 2.56 megabases which were grouped into scaffolds using cis-chromosome associations generated by the analysis of Hi-C sequence reads. To provide accurate functional genetic annotation, we utilized existing RNA-seq data and generated new data consisting of over 784 million reads from a combination of 27 different developmental timepoints/tissues. This dataset provides a tangible improvement over existing goat genomics resources by correcting over 247 misassemblies in the current goat reference genome and by annotating predicted gene models with actual expressed transcript data. Our goal is to provide a high quality resource to researchers to enable future genomic selection and functional prediction within the field of goat genomics.
Highly contiguous de novo human genome assembly and long-range haplotype phasing using SMRT Sequencing
The long reads, random error, and unbiased sampling of SMRT Sequencing enables high quality, de novo assembly of the human genome. PacBio long reads are capable of resolving genomic variations at all size scales, including SNPs, insertions, deletions, inversions, translocations, and repeat expansions, all of which are important in understanding the genetic basis for human disease and difficult to access via other technologies. In demonstration of this, we report a new high-quality, diploid aware de novo assembly of Craig Venter’s well-studied genome.
2015 SMRT Informatics Developers Conference Presentation Slides: Shinichi Morishita of the University of Tokyo presented on how his team has been using SMRT Sequencing to better understand methylomes, metagenomes and structural variation of various eukaryotic genomes.
Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing
The human reference sequence has provided a foundation for studies of genome structure, human variation, evolutionary biology, and disease. At the time the reference was originally completed there were some loci recalcitrant to closure; however, the degree to which structural variation and diversity affected our ability to produce a representative genome sequence at these loci was still unknown. Many of these regions in the genome are associated with large, repetitive sequences and exhibit complex allelic diversity such producing a single, haploid representation is not possible. To overcome this challenge, we have sequenced DNA from two hydatidiform moles (CHM1 and CHM13), which are essentially haploid. CHM13 was sequenced with the latest PacBio technology (P6-C5) to 52X genome coverage and assembled using Daligner and Falcon v0.2 (GCA_000983455.1, CHM13_1.1). Compared to the first mole (CHM1) PacBio assembly (GCA_001007805.1, 54X) contig N50 of 4.5Mb, the contig N50 of CHM13_1.1 is almost 13Mb, and there is a 13-fold reduction in the number of contigs. This demonstrates the improved contiguity of sequence generated with the new chemistry. We annotated 50,188 RefSeq transcripts of which only 0.63% were split transcripts, and the repetitive and segmental duplication content was within the expected range. These data all indicate an extremely high quality assembly. Additionally, we sequenced CHM13 DNA using Illumina SBS technology to 60X coverage, aligned these reads to the GRCh37, GRCh38, and CHM13_1.1 assemblies and performed variant calling using the SpeedSeq pipeline. The number of single nucleotide variants (SNV) and indels was comparable between GRCh37 and GRCh38. Regions that showed increased SNV density in GRCh38 compared to GRCh37 could be attributed to the addition of centromeric alpha satellite sequence to the reference assembly. Alternatively, regions of decreased SNV density in GRCh38 were concentrated in regions that were improved from BAC based sequencing of CHM1 such as 1p12 and 1q21 containing the SRGAP2 gene family. The alignment of PacBio reads to GRCh37 and GRCh38 assemblies allowed us to resolve complex loci such as the MHC region where the best alignment was to the DBB (A2-B57-DR7) haplotype. Finally, we will discuss how combining the two high quality mole assemblies can be used for benchmarking and novel bioinformatics tool development.
Highly accurate read mapping of third generation sequencing reads for improved structural variation analysis
Characterizing genomic structural variations (SV) is vital for understanding how genomes evolve. Furthermore, SVs are known for playing a role in a wide range of diseases including cancer, autism, and schizophrenia. Nevertheless, due to their complexity they remain harder to detect and less understood than single nucleotide variations. Recently, third-generation sequencing has proven to be an invaluable tool for detecting SVs. The markedly higher read length not only allows single reads to span a SV, it also enables reliable mapping to repetitive regions of the genome. These regions often contain SVs and are inaccessible to short-read mapping. However, current sequencing technologies like PacBio show a raw read error rate of 10% or more consisting mostly of insertions and deletions. Especially in repetitive regions the high error rate causes current mapping methods to fail finding exact borders for SVs, to split up large deletions and insertions into several small ones, or in some cases, like inversions, to fail reporting them at all. Furthermore, for complex SVs it is not possible to find one end-to-end alignment for a given read. The decision of when to split a read into two or more separate alignments without knowledge of the underlying SV poses an even bigger challenge to current read mappers. Here we present NextGenMap-LR for long single molecule PacBio reads which addresses these issues. NextGenMap-LR uses a fast k-mer search to quickly find anchor regions between parts of a read and the reference and evaluates them using a vectorized implementation of the Smith-Waterman (SW) algorithm. The resulting high-quality anchors are then used to determine whether a read spans an SV and has to be split or can be aligned contiguously. Finally, NextGenMap-LR uses a banded SW algorithm to compute the final alignment(s). In this last step, to account for both the sequencing error and real genomic variations, we employ a non-affine gap model that penalizes gap extensions for longer gaps less than for shorter ones. Based on simulated as well as verified human breast cancer SV data we show how our approach significantly improves mapping of long reads around SVs. The non-affine gap model is especially effective at more precisely identifying the position of the breakpoint, and the enhanced scoring scheme enables subsequent variation callers to identify SVs that would have been missed otherwise.
Structural Variants (SVs), which include deletions, insertions, duplications, inversions and chromosomal rearrangements, have been shown to effect organism phenotypes, including changing gene expression, increasing disease risk, and playing an important role in cancer development. Still it remains challenging to detect all types of SVs from high throughput sequencing data and it is even harder to detect more complex SVs such as a duplication nested within an inversion. To overcome these challenges we developed algorithms for SV analysis using longer third generation sequencing reads. The increased read lengths allow us to span more complex SVs and accurately assess SVs in repetitive regions, two of the major limitations when using short Illumina data. Our enhanced open-source analysis method Sniffles accurately detects structural variants based on split read mapping and assessment of the alignments. Sniffles uses a self-balancing interval tree in combination with a plane sweep algorithm to manage and assess the identified SVs. Central to its high accuracy is its advanced scoring model that can distinguish erroneous alignments from true breakpoints flanking SVs. In experiments with simulated and real genomes (e.g human breast cancer), we find that Sniffles outperforms all other SV analysis approaches in both the sensitivity of finding events as well as the specificity of those events. Sniffles is available at: https://github.com/fritzsedlazeck/Sniffles
Goats are specialized in dairy, meat and fiber production, being adapted to a wide range of environmental conditions and having a large economic impact in developing countries. In the last years, there have been dramatic advances in the knowledge of the structure and diversity of the goat genome/transcriptome and in the development of genomic tools, rapidly narrowing the gap between goat and related species such as cattle and sheep. Major advances are: 1) publication of a de novo goat genome reference sequence; 2) Development of whole genome high density RH maps, and; 3) Design of a commercial 50K SNP array. Moreover, there are currently several projects aiming at improving current genomic tools and resources. An improved assembly of the goat genome using PacBio reads is being produced, and the design of new SNP arrays is being studied to accommodate the specific needs of this species in the context of very large scale genotyping projects (i.e. breed characterization at an international scale and genomic selection) and parentage analysis. As in other species, the focus has now turned to the identification of causative mutations underlying the phenotypic variation of traits. In addition, since 2014, the ADAPTmap project (www.goatadaptmap.org) has gathered data to explore the diversity of caprine populations at a worldwide scale by using a wide variety of approaches and data.
Highly contiguous de novo human genome assembly and long-range haplotype phasing using SMRT Sequencing
The long reads, random error, and unbiased sampling of SMRT Sequencing enables high quality, de novo assembly of the human genome. PacBio long reads are capable of resolving genomic variations at all size scales, including SNPs, insertions, deletions, inversions, translocations, and repeat expansions, all of which are both important in understanding the genetic basis for human disease, and difficult to access via other technologies. In demonstration of this, we report a new high-quality, diploid-aware de novo assembly of Craig Venter’s well-studied genome.
Over the past decades neurological disorders have been extensively studied producing a large number of candidate genomic regions and candidate genes. The SNPs identified in these studies rarely represent the true disease-related functional variants. However, more recently a shift in focus from SNPs to larger structural variants has yielded breakthroughs in our understanding of neurological disorders.Here we have developed candidate gene screening methods that combine enrichment of long DNA fragments with long-read sequencing that is optimized for structural variation discovery. We have also developed a novel, amplification-free enrichment technique using the CRISPR/Cas9 system to target genomic regions.We sequenced gDNA and full-length cDNA extracted from the temporal lobe for two Alzheimer’s patients for 35 GWAS candidate genes. The multi-kilobase long reads allowed for phasing across the genes and detection of a broad range of genomic variants including SNPs to multi-kilobase insertions, deletions and inversions. In the full-length cDNA data we detected differential allelic isoform complexity, novel exons as well as transcript isoforms. By combining the gDNA data with full-length isoform characterization allows to build a more comprehensive view of the underlying biological disease mechanisms in Alzheimer’s disease. Using the novel PCR-free CRISPR-Cas9 enrichment method we screened several genes including the hexanucleotide repeat expansion C9ORF72 that is associated with 40% of familiar ALS cases. This method excludes any PCR bias or errors from an otherwise hard to amplify region as well as preserves the basemodication in a single molecule fashion which allows you to capture mosaicism present in the sample.
Fast and effective variant calling algorithms have been crucial to the successful application of DNA sequencing in human genetics. In particular, joint calling – in which reads from multiple individuals are pooled to increase power for shared variants – is an important tool for population surveys of variation. Joint calling was applied by the 1000 Genomes Project to identify variants across many individuals each sequenced to low coverage (about 5-fold). This approach successfully found common small variants, but broadly missed structural variants and large indels for which short-read sequencing has limited sensitivity. To support use of large variants in rare disease and common trait association studies, it is necessary to perform population-scale surveys with a technology effective at detecting indels and structural variants, such as PacBio SMRT Sequencing. For these studies, it is important to have a joint calling workflow that works with PacBio reads. We have developed pbsv, an indel and structural variant caller for PacBio reads, that provides a two-step joint calling workflow similar to that used to build the ExAC database. The first stage, discovery, is performed separately for each sample and consolidates whole genome alignments into a sparse representation of potentially variant loci. The second stage, calling, is performed on all samples together and considers only the signatures identified in the discovery stage. We applied the pbsv joint calling workflow to PacBio reads from twenty human genomes, with coverage ranging from 5-fold to 80-fold per sample for a total of 460-fold. The analysis required only 102 CPU hours, and identified over 800,000 indels and structural variants, including hundreds of inversions and translocations, many times more than discovered with short-read sequencing. The workflow is scalable to thousands of samples. The ongoing application of this workflow to thousands of samples will provide insight into the evolution and functional importance of large variants in human evolution and disease.
Although the accuracy of the human reference genome is critical for basic and clinical research, structural variants (SVs) have been difficult to assess because data capable of resolving them have been limited. To address potential bias, we sequenced a diversity panel of nine human genomes to high depth using long-read, single-molecule, real-time sequencing data. Systematically identifying and merging SVs =50 bp in length for these nine and one public genome yielded 83,909 sequence-resolved insertions, deletions, and inversions. Among these, 2,839 (2.0 Mbp) are shared among all discovery genomes with an additional 13,349 (6.9 Mbp) present in the majority of humans, indicating minor alleles or errors in the reference, which is partially explained by an enrichment for GC-content and repetitive DNA. Genotyping 83% of these in 290 additional genomes confirms that at least one allele of the most common SVs in unique euchromatin are now sequence-resolved. We observe a 9-fold increase within 5 Mbp of chromosome telomeric ends and correlation with de novo single-nucleotide variant mutations showing that such variation is nonrandomly distributed defining potential hotspots of mutation. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. To illustrate the utility of sequence-resolved SVs in resequencing experiments, we mapped 30 diverse high-coverage Illumina-sequenced samples to GRCh38 with and without contigs containing SV insertions as alternate sequences, and we found these additional sequences recover 6.4% of unmapped reads. For reads mapped within the SV insertion, 25.7% have a better mapping quality, and 18.7% improved by 1,000-fold or more. We reveal 72,964 occurrences of 15,814 unique variants that were not discoverable with the reference sequence alone, and we note that 7% of the insertions contain an SV in at least one sample indicating that there are additional alleles in the population that remain to be discovered. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity. We present a summary of our findings and discuss ideas for revealing variation that was once difficult to ascertain.