Long expansions of short tandem repeats (STRs), i.e. DNA repeats of 2-6 nt, are associated with some genetic diseases. Cost-efficient high-throughput sequencing can quickly produce billions of short reads that would be useful for uncovering disease-associated STRs. However, enumerating STRs in short reads remains largely unexplored because of the difficulty in elucidating STRs much longer than 100 bp, the typical length of short reads.We propose ab initio procedures for sensing and locating long STRs promptly by using the frequency distribution of all STRs and paired-end read information. We validated the reproducibility of this method using biological replicates and used it to locate an STR associated with a brain disease (SCA31). Subsequently, we sequenced this STR site in 11 SCA31 samples using SMRT(TM) sequencing (Pacific Biosciences), determined 2.3-3.1 kb sequences at nucleotide resolution and revealed that (TGGAA)- and (TAAAATAGAA)-repeat expansions determined the instability of the repeat expansions associated with SCA31. Our method could also identify common STRs, (AAAG)- and (AAAAG)-repeat expansions, which are remarkably expanded at four positions in an SCA31 sample. This is the first proposed method for rapidly finding disease-associated long STRs in personal genomes using hybrid sequencing of short and long reads.Our TRhist software is available at http://email@example.comSupplementary data are available at Bioinformatics online.
The architecture of a scrambled genome reveals massive levels of genomic rearrangement during development.
Programmed DNA rearrangements in the single-celled eukaryote Oxytricha trifallax completely rewire its germline into a somatic nucleus during development. This elaborate, RNA-mediated pathway eliminates noncoding DNA sequences that interrupt gene loci and reorganizes the remaining fragments by inversions and permutations to produce functional genes. Here, we report the Oxytricha germline genome and compare it to the somatic genome to present a global view of its massive scale of genome rearrangements. The remarkably encrypted genome architecture contains >3,500 scrambled genes, as well as >800 predicted germline-limited genes expressed, and some posttranslationally modified, during genome rearrangements. Gene segments for different somatic loci often interweave with each other. Single gene segments can contribute to multiple, distinct somatic loci. Terminal precursor segments from neighboring somatic loci map extremely close to each other, often overlapping. This genome assembly provides a draft of a scrambled genome and a powerful model for studies of genome rearrangement. Copyright © 2014 Elsevier Inc. All rights reserved.
Despite modern sequencing efforts, the difficulty in assembly of highly repetitive sequences has prevented resolution of human genome gaps, including some in the coding regions of genes with important biological functions. One such gene, MUC5AC, encodes a large, secreted mucin, which is one of the two major secreted mucins in human airways. The MUC5AC region contains a gap in the human genome reference (hg19) across the large, highly repetitive, and complex central exon. This exon is predicted to contain imperfect tandem repeat sequences and multiple conserved cysteine-rich (CysD) domains. To resolve the MUC5AC genomic gap, we used high-fidelity long PCR followed by single molecule real-time (SMRT) sequencing. This technology yielded long sequence reads and robust coverage that allowed for de novo sequence assembly spanning the entire repetitive region. Furthermore, we used SMRT sequencing of PCR amplicons covering the central exon to identify genetic variation in four individuals. The results demonstrated the presence of segmental duplications of CysD domains, insertions/deletions (indels) of tandem repeats, and single nucleotide variants. Additional studies demonstrated that one of the identified tandem repeat insertions is tagged by nonexonic single nucleotide polymorphisms. Taken together, these data illustrate the successful utility of SMRT sequencing long reads for de novo assembly of large repetitive sequences to fill the gaps in the human genome. Characterization of the MUC5AC gene and the sequence variation in the central exon will facilitate genetic and functional studies for this critical airway mucin.
PacBio-LITS: a large-insert targeted sequencing method for characterization of human disease-associated chromosomal structural variations.
Generation of long (>5 Kb) DNA sequencing reads provides an approach for interrogation of complex regions in the human genome. Currently, large-insert whole genome sequencing (WGS) technologies from Pacific Biosciences (PacBio) enable analysis of chromosomal structural variations (SVs), but the cost to achieve the required sequence coverage across the entire human genome is high.We developed a method (termed PacBio-LITS) that combines oligonucleotide-based DNA target-capture enrichment technologies with PacBio large-insert library preparation to facilitate SV studies at specific chromosomal regions. PacBio-LITS provides deep sequence coverage at the specified sites at substantially reduced cost compared with PacBio WGS. The efficacy of PacBio-LITS is illustrated by delineating the breakpoint junctions of low copy repeat (LCR)-associated complex structural rearrangements on chr17p11.2 in patients diagnosed with Potocki-Lupski syndrome (PTLS; MIM#610883). We successfully identified previously determined breakpoint junctions in three PTLS cases, and also were able to discover novel junctions in repetitive sequences, including LCR-mediated breakpoints. The new information has enabled us to propose mechanisms for formation of these structural variants.The new method leverages the cost efficiency of targeted capture-sequencing as well as the mappability and scaffolding capabilities of long sequencing reads generated by the PacBio platform. It is therefore suitable for studying complex SVs, especially those involving LCRs, inversions, and the generation of chimeric Alu elements at the breakpoints. Other genomic research applications, such as haplotype phasing and small insertion and deletion validation could also benefit from this technology.
Fc? receptors (Fc?Rs) are key immune receptors responsible for the effective control of both humoral and innate immunity and are central to maintaining the balance between generating appropriate responses to infection and preventing autoimmunity. When this balance is lost, pathology results in increased susceptibility to cancer, autoimmunity, and infection. In contrast, optimal Fc?R engagement facilitates effective disease resolution and response to monoclonal antibody immunotherapy. The underlying genetics of the Fc?R gene family are a central component of this careful balance. Complex in humans and generated through ancestral duplication events, here we review the evolution of the gene family in mammals, the potential importance of copy number, and functionally relevant single nucleotide polymorphisms, as well as discussing current approaches and limitations when exploring genetic variation in this region. © 2015 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Analysis of the complete Mycoplasma hominis LBD-4 genome sequence reveals strain-variable prophage insertion and distinctive repeat-containing surface protein arrangements.
The complete genome sequence of Mycoplasma hominis LBD-4 has been determined and the gene content ascribed. The 715,165-bp chromosome contains 620 genes, including 14 carried by a strain-variable prophage genome related to Mycoplasma fermentans MFV-1 and Mycoplasma arthritidis MAV-1. Comparative analysis with the genome of M. hominis PG21(T) reveals distinctive arrangements of repeat-containing surface proteins. Copyright © 2015 Calcutt and Foecking.
The intractability of homogeneous a-satellite arrays has impeded understanding of human centromeres. Artificial centromeres are produced from higher-order repeats (HORs) present at centromere edges, although the exact sequences and chromatin conformations of centromere cores remain unknown. We use high-resolution chromatin immunoprecipitation (ChIP) of centromere components followed by clustering of sequence data as an unbiased approach to identify functional centromere sequences. We find that specific dimeric a-satellite units shared by multiple individuals dominate functional human centromeres. We identify two recently homogenized a-satellite dimers that are occupied by precisely positioned CENP-A (cenH3) nucleosomes with two ~100-base pair (bp) DNA wraps in tandem separated by a CENP-B/CENP-C-containing linker, whereas pericentromeric HORs show diffuse positioning. Precise positioning is largely maintained, whereas abundance decreases exponentially with divergence, which suggests that young a-satellite dimers with paired ~100-bp particles mediate evolution of functional human centromeres. Our unbiased strategy for identifying functional centromeric sequences should be generally applicable to tandem repeat arrays that dominate the centromeres of most eukaryotes.
Complete sequences of organelle genomes from the medicinal plant Rhazya stricta (Apocynaceae) and contrasting patterns of mitochondrial genome evolution across asterids.
Rhazya stricta is native to arid regions in South Asia and the Middle East and is used extensively in folk medicine to treat a wide range of diseases. In addition to generating genomic resources for this medicinally important plant, analyses of the complete plastid and mitochondrial genomes and a nuclear transcriptome from Rhazya provide insights into inter-compartmental transfers between genomes and the patterns of evolution among eight asterid mitochondrial genomes.The 154,841 bp plastid genome is highly conserved with gene content and order identical to the ancestral organization of angiosperms. The 548,608 bp mitochondrial genome exhibits a number of phenomena including the presence of recombinogenic repeats that generate a multipartite organization, transferred DNA from the plastid and nuclear genomes, and bidirectional DNA transfers between the mitochondrion and the nucleus. The mitochondrial genes sdh3 and rps14 have been transferred to the nucleus and have acquired targeting presequences. In the case of rps14, two copies are present in the nucleus; only one has a mitochondrial targeting presequence and may be functional. Phylogenetic analyses of both nuclear and mitochondrial copies of rps14 across angiosperms suggests Rhazya has experienced a single transfer of this gene to the nucleus, followed by a duplication event. Furthermore, the phylogenetic distribution of gene losses and the high level of sequence divergence in targeting presequences suggest multiple, independent transfers of both sdh3 and rps14 across asterids. Comparative analyses of mitochondrial genomes of eight sequenced asterids indicates a complicated evolutionary history in this large angiosperm clade with considerable diversity in genome organization and size, repeat, gene and intron content, and amount of foreign DNA from the plastid and nuclear genomes.Organelle genomes of Rhazya stricta provide valuable information for improving the understanding of mitochondrial genome evolution among angiosperms. The genomic data have enabled a rigorous examination of the gene transfer events. Rhazya is unique among the eight sequenced asterids in the types of events that have shaped the evolution of its mitochondrial genome. Furthermore, the organelle genomes of R. stricta provide valuable genomic resources for utilizing this important medicinal plant in biotechnology applications.
Sex chromosomes harbour a primary sex-determining signal that triggers sexual development of the organism. However, diverse sex chromosome systems have been evolved in vertebrates. Here we use positional cloning to identify the sex-determining locus of a medaka-related fish, Oryzias dancena, and find that the locus on the Y chromosome contains a cis-regulatory element that upregulates neighbouring Sox3 expression in developing gonad. Sex-reversed phenotypes in Sox3(Y) transgenic fish, and Sox3(Y) loss-of-function mutants all point to its critical role in sex determination. Furthermore, we demonstrate that Sox3 initiates testicular differentiation by upregulating expression of downstream Gsdf, which is highly conserved in fish sex differentiation pathways. Our results not only provide strong evidence for the independent recruitment of Sox3 to male determination in distantly related vertebrates, but also provide direct evidence that a novel sex determination pathway has evolved through co-option of a transcriptional regulator potentially interacted with a conserved downstream component.
Potential impact on kidney infection: a whole-genome analysis of Leptospira santarosai serovar Shermani.
Leptospira santarosai serovar Shermani is the most frequently encountered serovar, and it causes leptospirosis and tubulointerstitial nephritis in Taiwan. This study aims to complete the genome sequence of L. santarosai serovar Shermani and analyze the transcriptional responses of L. santarosai serovar Shermani to renal tubular cells. To assemble this highly repetitive genome, we combined reads that were generated from four next-generation sequencing platforms by using hybrid assembly approaches to finish two-chromosome contiguous sequences without gaps by validating the data with optical restriction maps and Sanger sequencing. Whole-genome comparison studies revealed a 28-kb region containing genes that encode transposases and hypothetical proteins in L. santarosai serovar Shermani, but this region is absent in other pathogenic Leptospira spp. We found that lipoprotein gene expression in both L. santarosai serovar Shermani and L. interrogans serovar Copenhageni were upregulated upon interaction with renal tubular cells, and LSS19962, a L. santarosai serovar Shermani-specific gene within a 28-kb region that encodes hypothetical proteins, was upregulated in L. santarosai serovar Shermani-infected renal tubular cells. Lipoprotein expression during leptospiral infection might facilitate the interactions of leptospires within kidneys. The availability of the whole-genome sequence of L. santarosai serovar Shermani would make it the first completed sequence of this species, and its comparison with that of other Leptospira spp. may provide invaluable information for further studies in leptospiral pathogenesis.
Filling in the gap of human chromosome 4: Single Molecule Real Time sequencing of macrosatellite repeats in the facioscapulohumeral muscular dystrophy locus.
A majority of facioscapulohumeral muscular dystrophy (FSHD) is caused by contraction of macrosatellite repeats called D4Z4 that are located in the subtelomeric region of human chromosome 4q35. Sequencing the FSHD locus has been technically challenging due to its long size and nearly identical nature of repeat elements. Here we report sequencing and partial assembly of a BAC clone carrying an entire FSHD locus by a single molecule real time (SMRT) sequencing technology which could produce long reads up to about 18 kb containing D4Z4 repeats. De novo assembly by Hierarchical Genome Assembly Process 1 (HGAP.1) yielded a contig of 41 kb containing all but a part of the most distal D4Z4 element. The validity of the sequence model was confirmed by an independent approach employing anchored multiple sequence alignment by Kalign using reads containing unique flanking sequences. Our data will provide a basis for further optimization of sequencing and assembly conditions of D4Z4.