Menu
July 7, 2019

Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory.

Recent methods have been developed to perform high-throughput sequencing of DNA by Single Molecule Sequencing (SMS). While Next-Generation sequencing methods may produce reads up to several hundred bases long, SMS sequencing produces reads up to tens of kilobases long. Existing alignment methods are either too inefficient for high-throughput datasets, or not sensitive enough to align SMS reads, which have a higher error rate than Next-Generation sequencing.We describe the method BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands of bases long, with divergence between the read and genome dominated by insertion and deletion error. The method is benchmarked using both simulated reads and reads from a bacterial sequencing project. We also present a combinatorial model of sequencing error that motivates why our approach is effective.The results indicate that it is possible to map SMS reads with high accuracy and speed. Furthermore, the inferences made on the mapability of SMS reads using our combinatorial model of sequencing error are in agreement with the mapping accuracy demonstrated on simulated reads.


July 7, 2019

Medulloblastoma exome sequencing uncovers subtype-specific somatic mutations.

Medulloblastomas are the most common malignant brain tumours in children. Identifying and understanding the genetic events that drive these tumours is critical for the development of more effective diagnostic, prognostic and therapeutic strategies. Recently, our group and others described distinct molecular subtypes of medulloblastoma on the basis of transcriptional and copy number profiles. Here we use whole-exome hybrid capture and deep sequencing to identify somatic mutations across the coding regions of 92 primary medulloblastoma/normal pairs. Overall, medulloblastomas have low mutation rates consistent with other paediatric tumours, with a median of 0.35 non-silent mutations per megabase. We identified twelve genes mutated at statistically significant frequencies, including previously known mutated genes in medulloblastoma such as CTNNB1, PTCH1, MLL2, SMARCA4 and TP53. Recurrent somatic mutations were newly identified in an RNA helicase gene, DDX3X, often concurrent with CTNNB1 mutations, and in the nuclear co-repressor (N-CoR) complex genes GPS2, BCOR and LDB1. We show that mutant DDX3X potentiates transactivation of a TCF promoter and enhances cell viability in combination with mutant, but not wild-type, ß-catenin. Together, our study reveals the alteration of WNT, hedgehog, histone methyltransferase and now N-CoR pathways across medulloblastomas and within specific subtypes of this disease, and nominates the RNA helicase DDX3X as a component of pathogenic ß-catenin signalling in medulloblastoma.


July 7, 2019

An Inv(16)(p13.3q24.3)-encoded CBFA2T3-GLIS2 fusion protein defines an aggressive subtype of pediatric acute megakaryoblastic leukemia.

To define the mutation spectrum in non-Down syndrome acute megakaryoblastic leukemia (non-DS-AMKL), we performed transcriptome sequencing on diagnostic blasts from 14 pediatric patients and validated our findings in a recurrency/validation cohort consisting of 34 pediatric and 28 adult AMKL samples. Our analysis identified a cryptic chromosome 16 inversion (inv(16)(p13.3q24.3)) in 27% of pediatric cases, which encodes a CBFA2T3-GLIS2 fusion protein. Expression of CBFA2T3-GLIS2 in Drosophila and murine hematopoietic cells induced bone morphogenic protein (BMP) signaling and resulted in a marked increase in the self-renewal capacity of hematopoietic progenitors. These data suggest that expression of CBFA2T3-GLIS2 directly contributes to leukemogenesis. Copyright © 2012 Elsevier Inc. All rights reserved.


July 7, 2019

Structural variation analysis with strobe reads.

Structural variation including deletions, duplications and rearrangements of DNA sequence are an important contributor to genome variation in many organisms. In human, many structural variants are found in complex and highly repetitive regions of the genome making their identification difficult. A new sequencing technology called strobe sequencing generates strobe reads containing multiple subreads from a single contiguous fragment of DNA. Strobe reads thus generalize the concept of paired reads, or mate pairs, that have been routinely used for structural variant detection. Strobe sequencing holds promise for unraveling complex variants that have been difficult to characterize with current sequencing technologies.We introduce an algorithm for identification of structural variants using strobe sequencing data. We consider strobe reads from a test genome that have multiple possible alignments to a reference genome due to sequencing errors and/or repetitive sequences in the reference. We formulate the combinatorial optimization problem of finding the minimum number of structural variants in the test genome that are consistent with these alignments. We solve this problem using an integer linear program. Using simulated strobe sequencing data, we show that our algorithm has better sensitivity and specificity than paired read approaches for structural variation identification.braphael@brown.edu


July 7, 2019

LoRTE: Detecting transposon-induced genomic variants using low coverage PacBio long read sequences.

Population genomic analysis of transposable elements has greatly benefited from recent advances of sequencing technologies. However, the short size of the reads and the propensity of transposable elements to nest in highly repeated regions of genomes limits the efficiency of bioinformatic tools when Illumina or 454 technologies are used. Fortunately, long read sequencing technologies generating read length that may span the entire length of full transposons are now available. However, existing TE population genomic softwares were not designed to handle long reads and the development of new dedicated tools is needed.LoRTE is the first tool able to use PacBio long read sequences to identify transposon deletions and insertions between a reference genome and genomes of different strains or populations. Tested against simulated and genuine Drosophila melanogaster PacBio datasets, LoRTE appears to be a reliable and broadly applicable tool to study the dynamic and evolutionary impact of transposable elements using low coverage, long read sequences.LoRTE is an efficient and accurate tool to identify structural genomic variants caused by TE insertion or deletion. LoRTE is available for download at http://www.egce.cnrs-gif.fr/?p=6422.


July 7, 2019

The comparative landscape of duplications in Heliconius melpomene and Heliconius cydno.

Gene duplications can facilitate adaptation and may lead to interpopulation divergence, causing reproductive isolation. We used whole-genome resequencing data from 34 butterflies to detect duplications in two Heliconius species, Heliconius cydno and Heliconius melpomene. Taking advantage of three distinctive signals of duplication in short-read sequencing data, we identified 744 duplicated loci in H. cydno and H. melpomene and evaluated the accuracy of our approach using single-molecule sequencing. We have found that duplications overlap genes significantly less than expected at random in H. melpomene, consistent with the action of background selection against duplicates in functional regions of the genome. Duplicate loci that are highly differentiated between H. melpomene and H. cydno map to four different chromosomes. Four duplications were identified with a strong signal of divergent selection, including an odorant binding protein and another in close proximity with a known wing colour pattern locus that differs between the two species. Heredity advance online publication, 7 December 2016; doi:10.1038/hdy.2016.107.


July 7, 2019

Structure and evolution of the filaggrin gene repeated region in primates

The evolutionary dynamics of repeat sequences is quite complex, with some duplicates never having differentiated from each other. Two models can explain the complex evolutionary process for repeated genes—concerted and birth-and-death, of which the latter is driven by duplications maintained by selection. Copy number variations caused by random duplications and losses in repeat regions may modulate molecular pathways and therefore affect phenotypic characteristics in a population, resulting in individuals that are able to adapt to new environments. In this study, we investigated the filaggrin gene (FLG), which codes for filaggrin—an important component of the outer layers of mammalian skin—and contains tandem repeats that exhibit copy number variation between and within species. To examine which model best fits the evolutionary pathway for the complete tandem repeats within a single exon of FLG, we determined the repeat sequences in crab-eating macaque (Macaca fascicularis), orangutan (Pongo abelii), gorilla (Gorilla gorilla), and chimpanzee (Pan troglodytes) and compared these with the sequence in human (Homo sapiens).


July 7, 2019

A vast genomic deletion in the C56BL/6 genome affects different genes within the Ifi200 cluster on chromosome 1 and mediates obesity and insulin resistance.

Obesity, the excessive accumulation of body fat, is a highly heritable and genetically heterogeneous disorder. The complex, polygenic basis for the disease consisting of a network of different gene variants is still not completely known.In the current study we generated a BAC library of the obese-prone NZO strain to clarify the genomic alteration within the gene cluster Ifi200 on chr.1 including Ifi202b, an obesity gene that is in contrast to NZO not expressed in the lean B6 mouse. With the PacBio sequencing data of NZO BAC clones we identified a deletion spanning approximately 261.8 kb in the B6 reference genome. The deletion affects different members of the Ifi200 gene family which also includes the original first exon and 5′-regulatory parts of the Ifi202b gene and suggests to be the relevant cause of its expression deficiency in B6. In addition, the generation and characterization of congenic mice carrying the critical fragment on the B6 background demonstrate its crucial role for obesity and insulin resistance.Our data reveal the reconstruction of a complex genomic region on mouse chr.1 resulting from deletions and duplications of Ifi200 genes and suggest to be relevant for the development of obesity. The results further demonstrate the complexity of the disease and highlight the importance for studying rare genetic variants as they can be causal for large effects.


July 7, 2019

Variant tolerant read mapping using min-hashing

DNA read mapping is a ubiquitous task in bioinformatics, and many tools have been developed to solve the read mapping problem. However, there are two trends that are changing the landscape of readmapping: First, new sequencing technologies provide very long reads with high error rates (up to 15%). Second, many genetic variants in the population are known, so the reference genome is not considered as a single string over ACGT, but as a complex object containing these variants. Most existing read mappers do not handle these new circumstances appropriately.


July 7, 2019

A genomic view of short tandem repeats.

Short tandem repeats (STRs) are some of the fastest mutating loci in the genome. Tools for accurately profiling STRs from high-throughput sequencing data have enabled genome-wide interrogation of more than a million STRs across hundreds of individuals. These catalogs have revealed that STRs are highly multiallelic and may contribute more de novo mutations than any other variant class. Recent studies have leveraged these catalogs to show that STRs play a widespread role in regulating gene expression and other molecular phenotypes. These analyses suggest that STRs are an underappreciated but rich reservoir of variation that likely make significant contributions to Mendelian diseases, complex traits, and cancer. Copyright © 2017 Elsevier Ltd. All rights reserved.


July 7, 2019

Efficient CNV breakpoint analysis reveals unexpected structural complexity and correlation of dosage-sensitive genes with clinical severity in genomic disorders.

Genomic disorders are the clinical conditions manifested by submicroscopic genomic rearrangements including copy number variants (CNVs). The CNVs can be identified by array-based comparative genomic hybridization (aCGH), the most commonly used technology for molecular diagnostics of genomic disorders. However, clinical aCGH only informs CNVs in the probe-interrogated regions. Neither orientational information nor the resulting genomic rearrangement structure is provided, which is a key to uncovering mutational and pathogenic mechanisms underlying genomic disorders. Long-range polymerase chain reaction (PCR) is a traditional approach to obtain CNV breakpoint junction, but this method is inefficient when challenged by structural complexity such as often found at the PLP1 locus in association with Pelizaeus-Merzbacher disease (PMD). Here we introduced ‘capture and single-molecule real-time sequencing’ (cap-SMRT-seq) and newly developed ‘asymmetry linker-mediated nested PCR walking’ (ALN-walking) for CNV breakpoint sequencing in 49 subjects with PMD-associated CNVs. Remarkably, 29 (94%) of the 31 CNV breakpoint junctions unobtainable by conventional long-range PCR were resolved by cap-SMRT-seq and ALN-walking. Notably, unexpected CNV complexities, including inter-chromosomal rearrangements that cannot be resolved by aCGH, were revealed by efficient breakpoint sequencing. These sequence-based structures of PMD-associated CNVs further support the role of DNA replicative mechanisms in CNV mutagenesis, and facilitate genotype-phenotype correlation studies. Intriguingly, the lengths of gained segments by CNVs are strongly correlated with clinical severity in PMD, potentially reflecting the functional contribution of other dosage-sensitive genes besides PLP1. Our study provides new efficient experimental approaches (especially ALN-walking) for CNV breakpoint sequencing and highlights their importance in uncovering CNV mutagenesis and pathogenesis in genomic disorders.© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.


July 7, 2019

Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly.

The human reference genome assembly plays a central role in nearly all aspects of today’s basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health. © 2017 Schneider et al.; Published by Cold Spring Harbor Laboratory Press.


July 7, 2019

HySA: a Hybrid Structural variant Assembly approach using next-generation and single-molecule sequencing technologies.

Achieving complete, accurate, and cost-effective assembly of human genomes is of great importance for realizing the promise of precision medicine. The abundance of repeats and genetic variations in human genomes and the limitations of existing sequencing technologies call for the development of novel assembly methods that can leverage the complementary strengths of multiple technologies. We propose a Hybrid Structural variant Assembly (HySA) approach that integrates sequencing reads from next-generation sequencing and single-molecule sequencing technologies to accurately assemble and detect structural variants (SVs) in human genomes. By identifying homologous SV-containing reads from different technologies through a bipartite-graph-based clustering algorithm, our approach turns a whole genome assembly problem into a set of independent SV assembly problems, each of which can be effectively solved to enhance the assembly of structurally altered regions in human genomes. We used data generated from a haploid hydatidiform mole genome (CHM1) and a diploid human genome (NA12878) to test our approach. The result showed that, compared with existing methods, our approach had a low false discovery rate and substantially improved the detection of many types of SVs, particularly novel large insertions, small indels (10-50 bp), and short tandem repeat expansions and contractions. Our work highlights the strengths and limitations of current approaches and provides an effective solution for extending the power of existing sequencing technologies for SV discovery.© 2017 Fan et al.; Published by Cold Spring Harbor Laboratory Press.


July 7, 2019

Resequencing array for gene variant detection in malignant hyperthermia and butyrylcholinestherase deficiency.

Malignant hyperthermia (MH) and butyrylcholinestherase (BCHE) deficiency are two relevant pharmacogenetic disorders in anesthetic practice linked with sequence variants, the former in the RyR1 and CACNA1S genes, the latter in the BCHE gene. Genotyping for known pathogenic variants in these genes is useful to help identify susceptible individuals, and others may exist but remain unknown, because full-length sequence of these genes is, in general, not investigated. To facilitate this task, we developed a resequencing DNA array, the perioperative patient safety (POPS) array, to be able to screen the entire coding sequences of the RyR1, CACNA1S and BCHE genes. MH-susceptible individuals (n?=?121) identified with the in vitro contracture test, the standard diagnostic tool for MH susceptibility, were genotyped with the arrays. Compared with capillary sequencing, call rates with the arrays could achieve 100% at maximal sensitivity, although to reduce false positive rates, sensitivity was adjusted to 0.85, 0.87 and 0.66 for RyR1, CACNA1S and BCHE respectively, with overall base call specificity exceeding 99%. Detection of 29 predetermined RyR1 variants in 44 individuals was successful in 97% of the cases, among them all 16 variants of established diagnostic value. In a trial application of the arrays, 21 MH-susceptible subjects with no known RyR1 or CACNA1S variants were screened, resulting in the discovery of new variants, all confirmed by capillary sequencing. In conclusion, arrays offer an efficient high-throughput alternative for diagnostic genotyping of candidate genes affecting MH susceptibility, BCHE deficiency and other neuromuscular disorders, simultaneously enabling a comprehensive search for rare variants in these genes. Copyright © 2017 Elsevier B.V. All rights reserved.


July 7, 2019

Detection and assessment of copy number variation using PacBio long-read and Illumina sequencing in New Zealand dairy cattle.

Single nucleotide polymorphisms have been the DNA variant of choice for genomic prediction, largely because of the ease of single nucleotide polymorphism genotype collection. In contrast, structural variants (SV), which include copy number variants (CNV), translocations, insertions, and inversions, have eluded easy detection and characterization, particularly in nonhuman species. However, evidence increasingly shows that SV not only contribute a substantial proportion of genetic variation but also have significant influence on phenotypes. Here we present the discovery of CNV in a prominent New Zealand dairy bull using long-read PacBio (Pacific Biosciences, Menlo Park, CA) sequencing technology and the Sniffles SV discovery tool (version 0.0.1; https://github.com/fritzsedlazeck/Sniffles). The CNV identified from long reads were compared with CNV discovered in the same bull from Illumina sequencing using CNVnator (read depth-based tool; Illumina Inc., San Diego, CA) as a means of validation. Subsequently, further validation was undertaken using whole-genome Illumina sequencing of 556 cattle representing the wider New Zealand dairy cattle population. Very limited overlap was observed in CNV discovered from the 2 sequencing platforms, in part because of the differences in size of CNV detected. Only a few CNV were therefore able to be validated using this approach. However, the ability to use CNVnator to genotype the 557 cattle for copy number across all regions identified as putative CNV allowed a genome-wide assessment of transmission level of copy number based on pedigree. The more highly transmissible a putative CNV region was observed to be, the more likely the distribution of copy number was multimodal across the 557 sequenced animals. Furthermore, visual assessment of highly transmissible CNV regions provided evidence supporting the presence of CNV across the sequenced animals. This transmission-based approach was able to confirm a subset of CNV that segregates in the New Zealand dairy cattle population. Genome-wide identification and validation of CNV is an important step toward their inclusion in genomic selection strategies.The Authors. Published by the Federation of Animal Science Societies and Elsevier Inc. on behalf of the American Dairy Science Association®. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.