Single Molecule, Real-Time (SMRT) Sequencing holds promise for addressing new frontiers to understand molecular mechanisms in evolution and gain insight into adaptive strategies. With read lengths exceeding 10 kb, we are able to sequence high-quality, closed microbial genomes with associated plasmids, and investigate large genome complexities, such as long, highly repetitive, low-complexity regions and multiple tandem-duplication events. Improved genome quality, observed at 99.9999% (QV60) consensus accuracy, and significant reduction of gap regions in reference genomes (up to and beyond 50%) allow researchers to better understand coding sequences with high confidence, investigate potential regulatory mechanisms in noncoding regions, and make inferences about evolutionary strategies that are otherwise missed by the coverage biases associated with short- read sequencing technologies. Additional benefits afforded by SMRT Sequencing include the simultaneous capability to detect epigenomic modifications and obtain full-length cDNA transcripts that obsolete the need for assembly. With direct sequencing of DNA in real-time, this has resulted in the identification of numerous base modifications and motifs, which genome-wide profiles have linked to specific methyltransferase activities. Our new offering, the Iso-Seq Application, allows for the accurate differentiation between transcript isoforms that are difficult to resolve with short-read technologies. PacBio reads easily span transcripts such that both 5’/3’ primers for cDNA library generation and the poly-A tail are observed. As such, exon configuration and intron retention events can be analyzed without ambiguity. This technological advance is useful for characterizing transcript diversity and improving gene structure annotations in reference genomes. We review solutions available with SMRT Sequencing, from targeted sequencing efforts to obtaining reference genomes (>100 Mb). This includes strategies for identifying microsatellites and conducting phylogenetic comparisons with targeted gene families. We highlight how to best leverage our long reads that have exceeded 20 kb in length for research investigations, as well as currently available bioinformatics strategies for analysis. Benefits for these applications are further realized with consistent use of size selection of input sample using the BluePippin™ device from Sage Science as demonstrated in our genome improvement projects. Using the latest P5-C3 chemistry on model organisms, these efforts have yielded an observed contig N50 of ~6 Mb, with the longest contig exceeding 12.5 Mb and an average base quality of QV50.
PacBio bioinformatician, Elizabeth Tseng, reviews the bioinformatics strategies utilizing PacBio long-read sequencing data for isoform sequencing for full-length transcript sequencing without assembly.
The Human Leukocyte Antigen (HLA) genes located on chromosome 6 are responsible for regulating immune function via antigen presentation and are one of the determining factors for stem cell and organ transplantation compatibility. Additionally various alleles within this region have been implicated in autoimmune disorders, cancer, vaccine response and both non-infectious and infectious disease risk. The HLA region is highly variable; containing repetitive regions; and co-dominantly expressed genes. This complicates short read mapping and means that assessing the effect of variation within a gene requires full phase information to resolve haplotypes.One solution to the problem of HLA identification is the use of statistical inference to suggest the most likely diploid alleles given the genotypes observed. The assumption of this approach is the availability of an extensive reference panel. Whilst there exists good population genetics data for imputing European populations, there remains a paucity of information about variation in African populations. Filling this gap is one of the aims of the Genome Diversity in Africa Project and as a first step we are performing a pilot study to identify the optimal method for determining HLA type information for large numbers of samples from African populations.To that end we have obtained samples from 125 consented African participants selected from 5 populations across Africa (Morrocan, Ashanti, Igbo, Kalenjin, and Zulu). The methods included in our pilot study are Sanger sequencing (ABI), NGS on HiSeqX Ten platform (Illumina); long-range PCR combined with single molecule real-time (SMRT) sequencing (PacBio); and for a subset of samples library preparation on GemCode Platform (10x Genomics), which delivers valuable long range contextual information, combined with Illumina NGS sequencing.Results from capillary sequencing suggests the presence of a minimum of two novel alleles. Long Range PCR have been performed initially on a subset of samples using both primers sourced from GenDX and designed as described in Shiina et al (2012). Initial results from both primer sets were promising on Promega DNA test samples but only the GenDX primers proved effective on the African samples, producing consistently PCR products of the expected size in the Igbo, Ashanti, Morrocan and Zulu samples. We will present early results from our evaluation of the different sequencing technologies
Highly accurate read mapping of third generation sequencing reads for improved structural variation analysis
Characterizing genomic structural variations (SV) is vital for understanding how genomes evolve. Furthermore, SVs are known for playing a role in a wide range of diseases including cancer, autism, and schizophrenia. Nevertheless, due to their complexity they remain harder to detect and less understood than single nucleotide variations. Recently, third-generation sequencing has proven to be an invaluable tool for detecting SVs. The markedly higher read length not only allows single reads to span a SV, it also enables reliable mapping to repetitive regions of the genome. These regions often contain SVs and are inaccessible to short-read mapping. However, current sequencing technologies like PacBio show a raw read error rate of 10% or more consisting mostly of insertions and deletions. Especially in repetitive regions the high error rate causes current mapping methods to fail finding exact borders for SVs, to split up large deletions and insertions into several small ones, or in some cases, like inversions, to fail reporting them at all. Furthermore, for complex SVs it is not possible to find one end-to-end alignment for a given read. The decision of when to split a read into two or more separate alignments without knowledge of the underlying SV poses an even bigger challenge to current read mappers. Here we present NextGenMap-LR for long single molecule PacBio reads which addresses these issues. NextGenMap-LR uses a fast k-mer search to quickly find anchor regions between parts of a read and the reference and evaluates them using a vectorized implementation of the Smith-Waterman (SW) algorithm. The resulting high-quality anchors are then used to determine whether a read spans an SV and has to be split or can be aligned contiguously. Finally, NextGenMap-LR uses a banded SW algorithm to compute the final alignment(s). In this last step, to account for both the sequencing error and real genomic variations, we employ a non-affine gap model that penalizes gap extensions for longer gaps less than for shorter ones. Based on simulated as well as verified human breast cancer SV data we show how our approach significantly improves mapping of long reads around SVs. The non-affine gap model is especially effective at more precisely identifying the position of the breakpoint, and the enhanced scoring scheme enables subsequent variation callers to identify SVs that would have been missed otherwise.
A chromosome-level sequence assembly reveals the structure of the Arabidopsis thaliana Nd-1 genome and its gene set.
In addition to the BAC-based reference sequence of the accession Columbia-0 from the year 2000, several short read assemblies of THE plant model organism Arabidopsis thaliana were published during the last years. Also, a SMRT-based assembly of Landsberg erecta has been generated that identified translocation and inversion polymorphisms between two genotypes of the species. Here we provide a chromosome-arm level assembly of the A. thaliana accession Niederzenz-1 (AthNd-1_v2c) based on SMRT sequencing data. The best assembly comprises 69 nucleome sequences and displays a contig length of up to 16 Mbp. Compared to an earlier Illumina short read-based NGS assembly (AthNd-1_v1), a 75 fold increase in contiguity was observed for AthNd-1_v2c. To assign contig locations independent from the Col-0 gold standard reference sequence, we used genetic anchoring to generate a de novo assembly. In addition, we assembled the chondrome and plastome sequences. Detailed analyses of AthNd-1_v2c allowed reliable identification of large genomic rearrangements between A. thaliana accessions contributing to differences in the gene sets that distinguish the genotypes. One of the differences detected identified a gene that is lacking from the Col-0 gold standard sequence. This de novo assembly extends the known proportion of the A. thaliana pan-genome.
In order to provide a comprehensive resource for human structural variants (SVs), we generated long-read sequence data and analyzed SVs for fifteen human genomes. We sequence resolved 99,604 insertions, deletions, and inversions including 2,238 (1.6 Mbp) that are shared among all discovery genomes with an additional 13,053 (6.9 Mbp) present in the majority, indicating minor alleles or errors in the reference. Genotyping in 440 additional genomes confirms the most common SVs in unique euchromatin are now sequence resolved. We report a ninefold SV bias toward the last 5 Mbp of human chromosomes with nearly 55% of all VNTRs (variable number of tandem repeats) mapping to this portion of the genome. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity. Copyright © 2018 Elsevier Inc. All rights reserved.
WGS of 1058 Enterococcus faecium from Copenhagen, Denmark, reveals rapid clonal expansion of vancomycin-resistant clone ST80 combined with widespread dissemination of a vanA-containing plasmid and acquisition of a heterogeneous accessory genome.
From 2012 to 2015, a sudden significant increase in vancomycin-resistant (vanA) Enterococcus faecium (VREfm) was observed in the Capital Region of Denmark. Clonal relatedness of VREfm and vancomycin-susceptible E. faecium (VSEfm) was investigated, transmission events between hospitals were identified and the pan-genome and plasmids from the largest VREfm clonal group were characterized.WGS of 1058 E. faecium isolates was carried out on the Illumina platform to perform SNP analysis and to identify the pan-genome. One isolate was also sequenced on the PacBio platform to close the genome. Epidemiological data were collected from laboratory information systems.Phylogeny of 892 VREfm and 166 VSEfm revealed a polyclonal structure, with a single clonal group (ST80) accounting for 40% of the VREfm isolates. VREfm and VSEfm co-occurred within many clonal groups; however, no VSEfm were related to the dominant VREfm group. A similar vanA plasmid was identified in =99% of isolates belonging to the dominant group and 69% of the remaining VREfm. Ten plasmids were identified in the completed genome, and ~29% of this genome consisted of dispensable accessory genes. The size of the pan-genome among isolates in the dominant group was 5905 genes.Most probably, VREfm emerged owing to importation of a successful VREfm clone which rapidly transmitted to the majority of hospitals in the region whilst simultaneously disseminating a vanA plasmid to pre-existing VSEfm. Acquisition of a heterogeneous accessory genome may account for the success of this clone by facilitating adaptation to new environmental challenges. © The Author(s) 2019. Published by Oxford University Press on behalf of the British Society for Antimicrobial Chemotherapy. All rights reserved. For permissions, please email: email@example.com.
Normalization of cDNA is widely used to improve the coverage of rare transcripts in analysis of transcriptomes employing next-generation sequencing. Recently, long-read technology has been emerging as a powerful tool for sequencing and construction of transcriptomes, especially for complex genomes containing highly similar transcripts and transcript-spliced isoforms. Here, we analyzed the transcriptome of sugarcane, with a highly polyploidy plant genome, by PacBio isoform sequencing (Iso-Seq) of two different cDNA library preparations, with and without a normalization step. The results demonstrated that, while the two libraries included many of the same transcripts, many longer transcripts were removed and many new generally shorter transcripts were detected by normalization. For the same input cDNA and the same data yield, the normalized library recovered more total transcript isoforms, number of predicted gene families and orthologous groups, resulting in a higher representation for the sugarcane transcriptome, compared to the non-normalized library. The non-normalized library, on the other hand, included a wider transcript length range with more longer transcripts above ~1.25 kb, more transcript isoforms per gene family and gene ontology terms per transcript. A large proportion of the unique transcripts comprising ~52% of the normalized library were expressed at a lower level than the unique transcripts from the non-normalized library, across three tissue types tested including leaf, stalk and root. About 83% of the total 5,348 predicted long noncoding transcripts was derived from the normalized library, of which ~80% was derived from the lowly expressed fraction. Functional annotation of the unique transcripts suggested that each library enriched different functional transcript fractions. This demonstrated the complementation of the two approaches in obtaining a complete transcriptome of a complex genome at the sequencing depth used in this study.