June 1, 2021  |  

Making the most of long reads: towards efficient assemblers for reference quality, de novo reconstructions

2015 SMRT Informatics Developers Conference Presentation Slides: Gene Myers, Ph.D., Founding Director, Systems Biology Center, Max Planck Institute delivered the keynote presentation. He talked about building efficient assemblers, the importance of random error distribution in sequencing data, and resolving tricky repeats with very long reads. He also encouraged developers to release assembly modules openly, and noted that data should be straightforward to parse since sharing data interfaces is easier than sharing software interfaces.


June 1, 2021  |  

High-quality de novo genome assembly and intra-individual mitochondrial instability in the critically endangered kakapo

The kakapo (Strigops habroptila) is a large, flightless parrot endemic to New Zealand. It is highly endangered with only ~150 individuals remaining, and intensive conservation efforts are underway to save this iconic species from extinction. These include genetic studies to understand critical genes relevant to fertility, adaptation and disease resistance, and genetic diversity across the remaining population for future breeding program decisions. To aid with these efforts, we have generated a high-quality de novo genome assembly using PacBio long-read sequencing. Using the new diploid-aware FALCON-Unzip assembler, the resulting genome of 1.06 Gb has a contig N50 of 5.6 Mb (largest contig 29.3 Mb), >350-times more contiguous compared to a recent short-read assembly of a closely related parrot (kea) species. We highlight the benefits of the higher contiguity and greater completeness of the kakapo genome assembly through examples of fully resolved genes important in wildlife conservation (contrasted with fragmented and incomplete gene resolution in short-read assemblies), in some cases even providing sequence for regions orthologous to gaps of missing sequence in the chicken reference genome. We also highlight the complete resolution of the kakapo mitochondrial genome, fully containing the mitochondrial control region which is missing from the previous dedicated kakapomitochondrial genome NCBI entry. For this region, we observed a marked heterogeneity in the number of tandem repeats in different mtDNAmolecules from a single bird tissue, highlighting the enhanced molecular resolution uniquely afforded by long-read, single-molecule PacBio sequencing.


June 1, 2021  |  

Single cell isoform sequencing (scIso-Seq) identifies novel full-length mRNAs and cell type-specific expression

Single cell RNA-seq (scRNA-seq) is an emerging field for characterizing cell heterogeneity in complex tissues. However, most scRNA-seq methodologies are limited to gene count information due to short read lengths. Here, we combine the microfluidics scRNA-seq technique, Drop-Seq, with PacBio Single Molecule, Real-Time (SMRT) Sequencing to generate full-length transcript isoforms that can be confidently assigned to individual cells. We generated single cell Iso-Seq (scIso-Seq) libraries for chimp and human cerebral organoid samples on the Dolomite Nadia platform and sequenced each library with two SMRT Cells 8M on the PacBio Sequel II System. We developed a bioinformatics pipeline to identify, classify, and filter full-length isoforms at the single-cell level. We show that scIso-Seq reveals full-length isoform information not accessible using short reads that can reveal differences between cell types and amongst different species.


June 1, 2021  |  

TLA & long-read sequencing: Efficient targeted sequencing and phasing of the CFTR gene

Background: The sequencing and haplotype phasing of entire gene sequences improves the understanding of the genetic basis of disease and drug response. One example is cystic fibrosis (CF). Cystic fibrosis transmembrane conductance regulator (CFTR) modulator therapies have revolutionized CF treatment, but only in a minority of CF subjects. Observed heterogeneity in CFTR modulator efficacy is related to the range of CFTR mutations; revertant mutations can modify the response to CFTR modulators, and other intronic variations in the ~200 kb CFTR gene have been linked to disease severity. Heterogeneity in the CFTR gene may also be linked to differential responses to CFTR modulators. The Targeted Locus Amplification (TLA) technology from Cergentis can be used to selectively amplify, sequence and phase the entire CFTR gene. With PacBio long-read SMRT Sequencing, TLA amplicons are sequenced intact and long-range phasing information of all fragments in entire amplicons is retrieved. Experimental Design and Methods: The TLA process produces amplicons consisting of 5-10 proximity ligated DNA fragments. TLA was performed on cell line and genomic DNA from Coriell GM12878, which has few heterozygous SNVs in CFTR, and the IB3 cell line, with known haplotypes but heterozygous for the delta508 mutation. All sample types were prepared with high and low density TLA primer sets, targeting coverage of >100 kb of the CFTR gene. Conclusion: We have demonstrated the power and utility of TLA with long-read SMRT Sequencing as a valuable research tool in sequencing and phasing across very long regions of the human genome. This process can be done in an efficient manner, multiplexing multiple genes and samples per SMRT Cell in a process amenable to high-throughput sequencing.


April 21, 2020  |  

Characterization of Reference Materials for Genetic Testing of CYP2D6 Alleles: A GeT-RM Collaborative Project.

Pharmacogenetic testing increasingly is available from clinical and research laboratories. However, only a limited number of quality control and other reference materials currently are available for the complex rearrangements and rare variants that occur in the CYP2D6 gene. To address this need, the Division of Laboratory Systems, CDC-based Genetic Testing Reference Material Coordination Program, in collaboration with members of the pharmacogenetic testing and research communities and the Coriell Cell Repositories (Camden, NJ), has characterized 179 DNA samples derived from Coriell cell lines. Testing included the recharacterization of 137 genomic DNAs that were genotyped in previous Genetic Testing Reference Material Coordination Program studies and 42 additional samples that had not been characterized previously. DNA samples were distributed to volunteer testing laboratories for genotyping using a variety of commercially available and laboratory-developed tests. These publicly available samples will support the quality-assurance and quality-control programs of clinical laboratories performing CYP2D6 testing.Published by Elsevier Inc.


April 21, 2020  |  

Chromosome-length haplotigs for yak and cattle from trio binning assembly of an F1 hybrid

Background Assemblies of diploid genomes are generally unphased, pseudo-haploid representations that do not correctly reconstruct the two parental haplotypes present in the individual sequenced. Instead, the assembly alternates between parental haplotypes and may contain duplications in regions where the parental haplotypes are sufficiently different. Trio binning is an approach to genome assembly that uses short reads from both parents to classify long reads from the offspring according to maternal or paternal haplotype origin, and is thus helped rather than impeded by heterozygosity. Using this approach, it is possible to derive two assemblies from an individual, accurately representing both parental contributions in their entirety with higher continuity and accuracy than is possible with other methods.Results We used trio binning to assemble reference genomes for two species from a single individual using an interspecies cross of yak (Bos grunniens) and cattle (Bos taurus). The high heterozygosity inherent to interspecies hybrids allowed us to confidently assign >99% of long reads from the F1 offspring to parental bins using unique k-mers from parental short reads. Both the maternal (yak) and paternal (cattle) assemblies contain over one third of the acrocentric chromosomes, including the two largest chromosomes, in single haplotigs.Conclusions These haplotigs are the first vertebrate chromosome arms to be assembled gap-free and fully phased, and the first time assemblies for two species have been created from a single individual. Both assemblies are the most continuous currently available for non-model vertebrates.MbmegabaseskbkilobasesMYAmillions of years agoMHCmajor histocompatibility complexSMRTsingle molecule real time


April 21, 2020  |  

Acquired N-Linked Glycosylation Motifs in B-Cell Receptors of Primary Cutaneous B-Cell Lymphoma and the Normal B-Cell Repertoire.

Primary cutaneous follicle center lymphoma (PCFCL) is a rare mature B-cell lymphoma with an unknown etiology. PCFCL resembles follicular lymphoma (FL) by cytomorphologic and microarchitectural criteria. FL B cells are selected for N-linked glycosylation motifs in their B-cell receptors (BCRs) that are acquired during continuous somatic hypermutation. The stimulation of mannosylated BCR by lectins on the tumor microenvironment is therefore a candidate driver in FL pathogenesis. We investigated whether the same mechanism could play a role in PCFCL pathogenesis. Full-length functional variable, diversity, and joining gene sequences of 18 PCFCL and 8 primary cutaneous diffuse large B-cell lymphoma, leg-type were identified by unbiased Anchoring Reverse Transcription of Immunoglobulin Sequences and Amplification by Nested PCR and BCR reconstruction from RNA sequencing data. Low BCR variation demonstrated negligible ongoing somatic hypermutation in PCFCL and primary cutaneous diffuse large B-cell lymphoma, leg-type, and indicated that the PCFCL microarchitecture does not act as a functional germinal center. Similar to FL but in contrast to primary cutaneous diffuse large B-cell lymphoma, leg-type, BCR genes of 15 PCFCLs (83%) had acquired N-linked glycosylation motifs. These motifs were located at the BCR positions converted to N-linked glycosylation motifs in normal B-cell repertoires with low prevalence but mostly at different positions than those found in FL. The cutaneous localization of PCFCL might suggest a role for lectins from commensal skin bacteria in PCFCL lymphomagenesis.Copyright © 2019 The Authors. Published by Elsevier Inc. All rights reserved.


April 21, 2020  |  

The Chinese chestnut genome: a reference for species restoration

Forest tree species are increasingly subject to severe mortalities from exotic pests, diseases, and invasive organisms, accelerated by climate change. Forest health issues are threatening multiple species and ecosystem sustainability globally. While sources of resistance may be available in related species, or among surviving trees, introgression of resistance genes into threatened tree species in reasonable time frames requires genome-wide breeding tools. Asian species of chestnut (Castanea spp.) are being employed as donors of disease resistance genes to restore native chestnut species in North America and Europe. To aid in the restoration of threatened chestnut species, we present the assembly of a reference genome with chromosome-scale sequences for Chinese chestnut (C. mollissima), the disease-resistance donor for American chestnut restoration. We also demonstrate the value of the genome as a platform for research and species restoration, including new insights into the evolution of blight resistance in Asian chestnut species, the locations in the genome of ecologically important signatures of selection differentiating American chestnut from Chinese chestnut, the identification of candidate genes for disease resistance, and preliminary comparisons of genome organization with related species.


April 21, 2020  |  

The use of Online Tools for Antimicrobial Resistance Prediction by Whole Genome Sequencing in MRSA and VRE.

The antimicrobial resistance (AMR) crisis represents a serious threat to public health and has resulted in concentrated efforts to accelerate development of rapid molecular diagnostics for AMR. In combination with publicly-available web-based AMR databases, whole genome sequencing (WGS) offers the capacity for rapid detection of antibiotic resistance genes. Here we studied the concordance between WGS-based resistance prediction and phenotypic susceptibility testing results for methicillin-resistant Staphylococcus aureus (MRSA) and vancomycin resistant Enterococcus (VRE) clinical isolates using publicly-available tools and databases.Clinical isolates prospectively collected at the University of Pittsburgh Medical Center between December 2016 and December 2017 underwent WGS. Antibiotic resistance gene content was assessed from assembled genomes by BLASTn search of online databases. Concordance between WGS-predicted resistance profile and phenotypic susceptibility as well as sensitivity, specificity, positive and negative predictive values (NPV, PPV) were calculated for each antibiotic/organism combination, using the phenotypic results as the gold standard.Phenotypic susceptibility testing and WGS results were available for 1242 isolate/antibiotic combinations. Overall concordance was 99.3% with a sensitivity, specificity, PPV, NPV of 98.7% (95% CI, 97.2-99.5%), 99.6% (95 % CI, 98.8-99.9%), 99.3% (95% CI, 98.0-99.8%), 99.2% (95% CI, 98.3-99.7%), respectively. Additional identification of point mutations in housekeeping genes increased the concordance to 99.4% and the sensitivity to 99.3% (95% CI, 98.2-99.8%) and NPV to 99.4% (95% CI, 98.4-99.8%).WGS can be used as a reliable predicator of phenotypic resistance for both MRSA and VRE using readily-available online tools.Copyright © 2019. Published by Elsevier Ltd.


April 21, 2020  |  

High satellite repeat turnover in great apes studied with short- and long-read technologies.

Satellite repeats are a structural component of centromeres and telomeres, and in some instances their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50?bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: (1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and (2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males vs. females; using Y chromosome assemblies or FIuorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59?kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions. © The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.


April 21, 2020  |  

The landscape of SNCA transcripts across synucleinopathies: New insights from long reads sequencing analysis

Dysregulation of alpha-synuclein expression has been implicated in the pathogenesis of synucleinopathies, in particular Parkinsontextquoterights Disease (PD) and Dementia with Lewy bodies (DLB). Previous studies have shown that the alternatively spliced isoforms of the SNCA gene are differentially expressed in different parts of the brain for PD and DLB patients. Similarly, SNCA isoforms with skipped exons can have a functional impact on the protein domains. The large intronic region of the SNCA gene was also shown to harbor structural variants that affect transcriptional levels. Here we apply the first study of using long read sequencing with targeted capture of both the gDNA and cDNA of the SNCA gene in brain tissues of PD, DLB, and control samples using the PacBio Sequel system. The targeted full-length cDNA (Iso-Seq) data confirmed complex usage of known alternative start sites and variable 3textquoteright UTR lengths, as well as novel 5textquoteright starts and 3textquoteright ends not previously described. The targeted gDNA data allowed phasing of up to 81% of the ~114kb SNCA region, with the longest phased block excedding 54 kb. We demonstrate that long gDNA and cDNA reads have the potential to reveal long-range information not previously accessible using traditional sequencing methods. This approach has a potential impact in studying disease risk genes such as SNCA, providing new insights into the genetic etiologies, including perturbations to the landscape the gene transcripts, of human complex diseases such as synucleinopathies.


April 21, 2020  |  

Schizophrenia risk variants influence multiple classes of transcripts of sorting nexin 19 (SNX19).

Genome-wide association studies (GWAS) have identified many genomic loci associated with risk for schizophrenia, but unambiguous identification of the relationship between disease-associated variants and specific genes, and in particular their effect on risk conferring transcripts, has proven difficult. To better understand the specific molecular mechanism(s) at the schizophrenia locus in 11q25, we undertook cis expression quantitative trait loci (cis-eQTL) mapping for this 2 megabase genomic region using postmortem human brain samples. To comprehensively assess the effects of genetic risk upon local expression, we evaluated multiple transcript features: genes, exons, and exon-exon junctions in multiple brain regions-dorsolateral prefrontal cortex (DLPFC), hippocampus, and caudate. Genetic risk variants strongly associated with expression of SNX19 transcript features that tag multiple rare classes of SNX19 transcripts, whereas they only weakly affected expression of an exon-exon junction that tags the majority of abundant transcripts. The most prominent class of SNX19 risk-associated transcripts is predicted to be overexpressed, defined by an exon-exon splice junction between exons 8 and 10 (junc8.10) and that is predicted to encode proteins that lack the characteristic nexin C terminal domain. Risk alleles were also associated with either increased or decreased expression of multiple additional classes of transcripts. With RACE, molecular cloning, and long read sequencing, we found a number of novel SNX19 transcripts that further define the set of potential etiological transcripts. We explored epigenetic regulation of SNX19 expression and found that DNA methylation at CpG sites near the primary transcription start site and within exon 2 partially mediate the effects of risk variants on risk-associated expression. ATAC sequencing revealed that some of the most strongly risk-associated SNPs are located within a region of open chromatin, suggesting a nearby regulatory element is involved. These findings indicate a potentially complex molecular etiology, in which risk alleles for schizophrenia generate epigenetic alterations and dysregulation of multiple classes of SNX19 transcripts.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.