Menu
July 19, 2019

Resolving complex tandem repeats with long reads.

Resolving tandemly repeated genomic sequences is a necessary step in improving our understanding of the human genome. Short tandem repeats (TRs), or microsatellites, are often used as molecular markers in genetics, and clinically, variation in microsatellites can lead to genetic disorders like Huntington’s diseases. Accurately resolving repeats, and in particular TRs, remains a challenging task in genome alignment, assembly and variation calling. Though tools have been developed for detecting microsatellites in short-read sequencing data, these are limited in the size and types of events they can resolve. Single-molecule sequencing technologies may potentially resolve a broader spectrum of TRs given their increased length, but require new approaches given their significantly higher raw error profiles. However, due to inherent error profiles of the single-molecule technologies, these reads presents a unique challenge in terms of accurately identifying and estimating the TRs.Here we present PacmonSTR, a reference-based probabilistic approach, to identify the TR region and estimate the number of these TR elements in long DNA reads. We present a multistep approach that requires as input, a reference region and the reference TR element. Initially, the TR region is identified from the long DNA reads via a 3-stage modified Smith-Waterman approach and then, expected number of TR elements is calculated using a pair-Hidden Markov Models-based method. Finally, TR-based genotype selection (or clustering: homozygous/heterozygous) is performed with Gaussian mixture models, using the Akaike information criteria, and coverage expectations. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.


July 19, 2019

Identification of a common risk haplotype for canine idiopathic epilepsy in the ADAM23 gene.

Idiopathic epilepsy is a common neurological disease in human and domestic dogs but relatively few risk genes have been identified to date. The seizure characteristics, including focal and generalised seizures, are similar between the two species, with gene discovery facilitated by the reduced genetic heterogeneity of purebred dogs. We have recently identified a risk locus for idiopathic epilepsy in the Belgian Shepherd breed on a 4.4 megabase region on CFA37.We have expanded a previous study replicating the association with a combined analysis of 157 cases and 179 controls in three additional breeds: Schipperke, Finnish Spitz and Beagle (pc?=?2.9e-07, pGWAS?=?1.74E-02). A targeted resequencing of the 4.4 megabase region in twelve Belgian Shepherd cases and twelve controls with opposite haplotypes identified 37 case-specific variants within the ADAM23 gene. Twenty-seven variants were validated in 285 cases and 355 controls from four breeds, resulting in a strong replication of the ADAM23 locus (praw?=?2.76e-15) and the identification of a common 28 kb-risk haplotype in all four breeds. Risk haplotype was present in frequencies of 0.49-0.7 in the breeds, suggesting that ADAM23 is a low penetrance risk gene for canine epilepsy.These results implicate ADAM23 in common canine idiopathic epilepsy, although the causative variant remains yet to be identified. ADAM23 plays a role in synaptic transmission and interacts with known epilepsy genes, LGI1 and LGI2, and should be considered as a candidate gene for human epilepsies.


July 19, 2019

SMRT Sequencing of long tandem nucleotide repeats in SCA10 reveals unique insight of repeat expansion structure.

A large, non-coding ATTCT repeat expansion causes the neurodegenerative disorder, spinocerebellar ataxia type 10 (SCA10). In a subset of SCA10 patients, interruption motifs are present at the 5′ end of the expansion and strongly correlate with epileptic seizures. Thus, interruption motifs are a predictor of the epileptic phenotype and are hypothesized to act as a phenotypic modifier in SCA10. Yet, the exact internal sequence structure of SCA10 expansions remains unknown due to limitations in current technologies for sequencing across long extended tracts of tandem nucleotide repeats. We used the third generation sequencing technology, Single Molecule Real Time (SMRT) sequencing, to obtain full-length contiguous expansion sequences, ranging from 2.5 to 4.4 kb in length, from three SCA10 patients with different clinical presentations. We obtained sequence spanning the entire length of the expansion and identified the structure of known and novel interruption motifs within the SCA10 expansion. The exact interruption patterns in expanded SCA10 alleles will allow us to further investigate the potential contributions of these interrupting sequences to the pathogenic modification leading to the epilepsy phenotype in SCA10. Our results also demonstrate that SMRT sequencing is useful for deciphering long tandem repeats that pose as “gaps” in the human genome sequence.


July 19, 2019

Genetic variation and the de novo assembly of human genomes.

The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.


July 19, 2019

Towards precision medicine.

There is great potential for genome sequencing to enhance patient care through improved diagnostic sensitivity and more precise therapeutic targeting. To maximize this potential, genomics strategies that have been developed for genetic discovery – including DNA-sequencing technologies and analysis algorithms – need to be adapted to fit clinical needs. This will require the optimization of alignment algorithms, attention to quality-coverage metrics, tailored solutions for paralogous or low-complexity areas of the genome, and the adoption of consensus standards for variant calling and interpretation. Global sharing of this more accurate genotypic and phenotypic data will accelerate the determination of causality for novel genes or variants. Thus, a deeper understanding of disease will be realized that will allow its targeting with much greater therapeutic precision.


July 19, 2019

Single-molecule sequencing revealing the presence of distinct JC polyomavirus populations in patients with progressive multifocal leukoencephalopathy.

Progressive multifocal leukoencephalopathy (PML) is a fatal disease caused by reactivation of JC polyomavirus (JCPyV) in immunosuppressed individuals and lytic infection by neurotropic JCPyV in glial cells. The exact content of neurotropic mutations within individual JCPyV strains has not been studied to our knowledge.We exploited the capacity of single-molecule real-time sequencing technology to determine the sequence of complete JCPyV genomes in single reads. The method was used to precisely characterize individual neurotropic JCPyV strains of 3 patients with PML without the bias caused by assembly of short sequence reads.In the cerebrospinal fluid sample of a 73-year-old woman with rapid PML onset, 3 distinct JCPyV populations could be identified. All viral populations were characterized by rearrangements within the noncoding regulatory region (NCCR) and 1 point mutation, S267L in the VP1 gene, suggestive of neurotropic strains. One patient with PML had a single neurotropic strain with rearranged NCCR, and 1 patient had a single strain with small NCCR alterations.We report here, for the first time, full characterization of individual neurotropic JCPyV strains in the cerebrospinal fluid of patients with PML. It remains to be established whether PML pathogenesis is driven by one or several neurotropic strains in an individual.


July 19, 2019

CGG repeat-induced FMR1 silencing depends on the expansion size in human iPSCs and neurons carrying unmethylated full mutations.

In fragile X syndrome (FXS), CGG repeat expansion greater than 200 triplets is believed to trigger FMR1 gene silencing and disease etiology. However, FXS siblings have been identified with more than 200 CGGs, termed unmethylated full mutation (UFM) carriers, without gene silencing and disease symptoms. Here, we show that hypomethylation of the FMR1 promoter is maintained in induced pluripotent stem cells (iPSCs) derived from two UFM individuals. However, a subset of iPSC clones with large CGG expansions carries silenced FMR1. Furthermore, we demonstrate de novo silencing upon expansion of the CGG repeat size. FMR1 does not undergo silencing during neuronal differentiation of UFM iPSCs, and expression of large unmethylated CGG repeats has phenotypic consequences resulting in neurodegenerative features. Our data suggest that UFM individuals do not lack the cell-intrinsic ability to silence FMR1 and that inter-individual variability in the CGG repeat size required for silencing exists in the FXS population. Copyright © 2016 The Author(s). Published by Elsevier Inc. All rights reserved.


July 19, 2019

Detecting AGG interruptions in male and female FMR1 premutation carriers by single-molecule sequencing.

The FMR1 gene contains an unstable CGG repeat in its 5′ untranslated region. Premutation alleles range between 55 and 200 repeat units and confer a risk for developing fragile X-associated tremor/ataxia syndrome or fragile X-associated primary ovarian insufficiency. Furthermore, the premutation allele often expands to a full mutation during female germline transmission giving rise to the fragile X syndrome. The risk for a premutation to expand depends mainly on the number of CGG units and the presence of AGG interruptions in the CGG repeat. Unfortunately, the detection of AGG interruptions is hampered by technical difficulties. Here, we demonstrate that single-molecule sequencing enables the determination of not only the repeat size, but also the complete repeat sequence including AGG interruptions in male and female alleles with repeats ranging from 45 to 100 CGG units. We envision this method will facilitate research and diagnostic analysis of the FMR1 repeat expansion. © 2016 WILEY PERIODICALS, INC.


July 19, 2019

Parkinson’s disease associated with pure ATXN10 repeat

Large, non-coding pentanucleotide repeat expansions of ATTCT in intron 9 of the ATXN10 gene typically cause progressive spinocerebellar ataxia with or without seizures and present neuropathologically with Purkinje cell loss resulting in symmetrical cerebellar atrophy. These ATXN10 repeat expansions can be interrupted by sequence motifs which have been attributed to seizures and are likely to act as genetic modifiers. We identified a Mexican kindred with multiple affected family members with ATXN10 expansions. Four affected family members showed clinical features of spinocerebellar ataxia type 10 (SCA10). However, one affected individual presented with early-onset levodopa-responsive parkinsonism, and one family member carried a large repeat ATXN10 expansion, but was clinically unaffected. To characterize the ATXN10 repeat, we used a novel technology of single-molecule real-time (SMRT) sequencing and CRISPR/Cas9-based capture. We sequenced the entire span of ~5.3–7.0kb repeat expansions. The Parkinson’s patient carried an ATXN10 expansion with no repeat interruption motifs as well as an unaffected sister. In the siblings with typical SCA10, we found a repeat pattern of ATTCC repeat motifs that have not been associated with seizures previously. Our data suggest that the absence of repeat interruptions is likely a genetic modifier for the clinical presentation of L-Dopa responsive parkinsonism, whereas repeat interruption motifs contribute clinically to epilepsy. Repeat interruptions are important genetic modifiers of the clinical phenotype in SCA10. Advanced sequencing techniques now allow to better characterize the underlying genetic architecture for determining accurate phenotype–genotype correlations.


July 19, 2019

De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads.

Reference-quality genomes are expected to provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna’s hummingbird reference, 2 vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range, representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references, including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult-to-sequence regions, complex repeat structure errors, and allelic differences between the 2 haplotypes. These improvements were validated by single long-genome and transcriptome reads and resulted for the first time in completely resolved protein-coding genes widely studied in neuroscience and specialized in vocal learning species. These findings demonstrate the impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating the high-quality assemblies necessary for understanding gene structure, function, and evolution.© The Authors 2017. Published by Oxford University Press.


July 19, 2019

Dissecting the causal mechanism of X-linked Dystonia-Parkinsonism by integrating genome and transcriptome assembly.

X-linked Dystonia-Parkinsonism (XDP) is a Mendelian neurodegenerative disease that is endemic to the Philippines and is associated with a founder haplotype. We integrated multiple genome and transcriptome assembly technologies to narrow the causal mutation to the TAF1 locus, which included a SINE-VNTR-Alu (SVA) retrotransposition into intron 32 of the gene. Transcriptome analyses identified decreased expression of the canonical cTAF1 transcript among XDP probands, and de novo assembly across multiple pluripotent stem-cell-derived neuronal lineages discovered aberrant TAF1 transcription that involved alternative splicing and intron retention (IR) in proximity to the SVA that was anti-correlated with overall TAF1 expression. CRISPR/Cas9 excision of the SVA rescued this XDP-specific transcriptional signature and normalized TAF1 expression in probands. These data suggest an SVA-mediated aberrant transcriptional mechanism associated with XDP and may provide a roadmap for layered technologies and integrated assembly-based analyses for other unsolved Mendelian disorders. Copyright © 2018 Elsevier Inc. All rights reserved.


July 19, 2019

Detailed analysis of HTT repeat elements in human blood using targeted amplification-free long-read sequencing.

Amplification of DNA is required as a mandatory step during library preparation in most targeted sequencing protocols. This can be a critical limitation when targeting regions that are highly repetitive or with extreme guanine-cytosine (GC) content, including repeat expansions associated with human disease. Here, we used an amplification-free protocol for targeted enrichment utilizing the CRISPR/Cas9 system (No-Amp Targeted sequencing) in combination with single molecule, real-time (SMRT) sequencing for studying repeat elements in the huntingtin (HTT) gene, where an expanded CAG repeat is causative for Huntington disease. We also developed a robust data analysis pipeline for repeat element analysis that is independent of alignment of reads to a reference genome. The method was applied to 11 diagnostic blood samples, and for all 22 alleles the resulting CAG repeat count agreed with previous results based on fragment analysis. The amplification-free protocol also allowed for studying somatic variability of repeat elements in our samples, without the interference of PCR stutter. In summary, with No-Amp Targeted sequencing in combination with our analysis pipeline, we could accurately study repeat elements that are difficult to investigate using PCR-based methods.© 2018 The Authors. Human Mutation published by Wiley Periodicals, Inc.


July 19, 2019

De novo repeat interruptions are associated with reduced somatic instability and mild or absent clinical features in myotonic dystrophy type 1.

Myotonic dystrophy type 1 (DM1) is a multisystem disorder, caused by expansion of a CTG trinucleotide repeat in the 3′-untranslated region of the DMPK gene. The repeat expansion is somatically unstable and tends to increase in length with time, contributing to disease progression. In some individuals, the repeat array is interrupted by variant repeats such as CCG and CGG, stabilising the expansion and often leading to milder symptoms. We have characterised three families, each including one person with variant repeats that had arisen de novo on paternal transmission of the repeat expansion. Two individuals were identified for screening due to an unusual result in the laboratory diagnostic test, and the third due to exceptionally mild symptoms. The presence of variant repeats in all three expanded alleles was confirmed by restriction digestion of small pool PCR products, and allele structures were determined by PacBio sequencing. Each was different, but all contained CCG repeats close to the 3′-end of the repeat expansion. All other family members had inherited pure CTG repeats. The variant repeat-containing alleles were more stable in the blood than pure alleles of similar length, which may in part account for the mild symptoms observed in all three individuals. This emphasises the importance of somatic instability as a disease mechanism in DM1. Further, since patients with variant repeats may have unusually mild symptoms, identification of these individuals has important implications for genetic counselling and for patient stratification in DM1 clinical trials.


July 19, 2019

Long-read sequencing across the C9orf72 ‘GGGGCC’ repeat expansion: implications for clinical use and genetic discovery efforts in human disease.

Many neurodegenerative diseases are caused by nucleotide repeat expansions, but most expansions, like the C9orf72 ‘GGGGCC’ (G4C2) repeat that causes approximately 5-7% of all amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD) cases, are too long to sequence using short-read sequencing technologies. It is unclear whether long-read sequencing technologies can traverse these long, challenging repeat expansions. Here, we demonstrate that two long-read sequencing technologies, Pacific Biosciences’ (PacBio) and Oxford Nanopore Technologies’ (ONT), can sequence through disease-causing repeats cloned into plasmids, including the FTD/ALS-causing G4C2 repeat expansion. We also report the first long-read sequencing data characterizing the C9orf72 G4C2 repeat expansion at the nucleotide level in two symptomatic expansion carriers using PacBio whole-genome sequencing and a no-amplification (No-Amp) targeted approach based on CRISPR/Cas9.Both the PacBio and ONT platforms successfully sequenced through the repeat expansions in plasmids. Throughput on the MinION was a challenge for whole-genome sequencing; we were unable to attain reads covering the human C9orf72 repeat expansion using 15 flow cells. We obtained 8× coverage across the C9orf72 locus using the PacBio Sequel, accurately reporting the unexpanded allele at eight repeats, and reading through the entire expansion with 1324 repeats (7941 nucleotides). Using the No-Amp targeted approach, we attained >?800× coverage and were able to identify the unexpanded allele, closely estimate expansion size, and assess nucleotide content in a single experiment. We estimate the individual’s repeat region was >?99% G4C2 content, though we cannot rule out small interruptions.Our findings indicate that long-read sequencing is well suited to characterizing known repeat expansions, and for discovering new disease-causing, disease-modifying, or risk-modifying repeat expansions that have gone undetected with conventional short-read sequencing. The PacBio No-Amp targeted approach may have future potential in clinical and genetic counseling environments. Larger and deeper long-read sequencing studies in C9orf72 expansion carriers will be important to determine heterogeneity and whether the repeats are interrupted by non-G4C2 content, potentially mitigating or modifying disease course or age of onset, as interruptions are known to do in other repeat-expansion disorders. These results have broad implications across all diseases where the genetic etiology remains unclear.


July 19, 2019

Characterization of a human-specific tandem repeat associated with bipolar disorder and schizophrenia.

Bipolar disorder (BD) and schizophrenia (SCZ) are highly heritable diseases that affect more than 3% of individuals worldwide. Genome-wide association studies have strongly and repeatedly linked risk for both of these neuropsychiatric diseases to a 100 kb interval in the third intron of the human calcium channel gene CACNA1C. However, the causative mutation is not yet known. We have identified a human-specific tandem repeat in this region that is composed of 30 bp units, often repeated hundreds of times. This large tandem repeat is unstable using standard polymerase chain reaction and bacterial cloning techniques, which may have resulted in its incorrect size in the human reference genome. The large 30-mer repeat region is polymorphic in both size and sequence in human populations. Particular sequence variants of the 30-mer are associated with risk status at several flanking single-nucleotide polymorphisms in the third intron of CACNA1C that have previously been linked to BD and SCZ. The tandem repeat arrays function as enhancers that increase reporter gene expression in a human neural progenitor cell line. Different human arrays vary in the magnitude of enhancer activity, and the 30-mer arrays associated with increased psychiatric disease risk status have decreased enhancer activity. Changes in the structure and sequence of these arrays likely contribute to changes in CACNA1C function during human evolution and may modulate neuropsychiatric disease risk in modern human populations. Copyright © 2018. Published by Elsevier Inc.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.