Structural variants (SVs, differences >50 base pairs) account for most of the base pairs that differ between two human genomes, and are known to cause over 1,000 genetic disorders including…
In this presentation Fritz Sedlazeck describes his latest work to obtain comprehensive genomes leveraging long-read sequencing and linked reads.
In this PacBio User Group Meeting presentation, Jonas Korlach and Roberto Lleras share the latest updates to the structural variation application and analysis tools.
Hybrid sterility is one of the earliest postzygotic isolating mechanisms to evolve between two recently diverged species. Here we identify causes underlying hybrid infertility of two recently diverged fission yeast species Schizosaccharomyces pombe and S. kambucha, which mate to form viable hybrid diploids that efficiently complete meiosis, but generate few viable gametes. We find that chromosomal rearrangements and related recombination defects are major but not sole causes of hybrid infertility. At least three distinct meiotic drive alleles, one on each S. kambucha chromosome, independently contribute to hybrid infertility by causing nonrandom spore death. Two of these driving loci are linked by a chromosomal translocation and thus constitute a novel type of paired meiotic drive complex. Our study reveals how quickly multiple barriers to fertility can arise. In addition, it provides further support for models in which genetic conflicts, such as those caused by meiotic drive alleles, can drive speciation.DOI: http://dx.doi.org/10.7554/eLife.02630.001. Copyright © 2014, Zanders et al.
As resequencing projects become more prevalent across a larger number of species, accurate variant identification will further elucidate the nature of genetic diversity and become increasingly relevant in genomic studies. However, the identification of larger genomic variants via DNA sequencing is limited by both the incomplete information provided by sequencing reads and the nature of the genome itself. Long-read sequencing technologies provide high-resolution access to structural variants often inaccessible to shorter reads.We present PBHoney, software that considers both intra-read discordance and soft-clipped tails of long reads (>10,000 bp) to identify structural variants. As a proof of concept, we identify four structural variants and two genomic features in a strain of Escherichia coli with PBHoney and validate them via de novo assembly. PBHoney is available for download at http://sourceforge.net/projects/pb-jelly/.Implementing two variant-identification approaches that exploit the high mappability of long reads, PBHoney is demonstrated as being effective at detecting larger structural variants using whole-genome Pacific Biosciences RS II Continuous Long Reads. Furthermore, PBHoney is able to discover two genomic features: the existence of Rac-Phage in isolate; evidence of E. coli’s circular genome.
Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability.
Recurrent deletions of chromosome 15q13.3 associate with intellectual disability, schizophrenia, autism and epilepsy. To gain insight into the instability of this region, we sequenced it in affected individuals, normal individuals and nonhuman primates. We discovered five structural configurations of the human chromosome 15q13.3 region ranging in size from 2 to 3 Mb. These configurations arose recently (~0.5-0.9 million years ago) as a result of human-specific expansions of segmental duplications and two independent inversion events. All inversion breakpoints map near GOLGA8 core duplicons-a ~14-kb primate-specific chromosome 15 repeat that became organized into larger palindromic structures. GOLGA8-flanked palindromes also demarcate the breakpoints of recurrent 15q13.3 microdeletions, the expansion of chromosome 15 segmental duplications in the human lineage and independent structural changes in apes. The significant clustering (P = 0.002) of breakpoints provides mechanistic evidence for the role of this core duplicon and its palindromic architecture in promoting the evolutionary and disease-related instability of chromosome 15.
Robertsonian translocations resulting in fusions between sex chromosomes and autosomes shape karyotype evolution by creating new sex chromosomes from autosomes. These translocations can also reverse sex chromosomes back into autosomes, which is especially intriguing given the dramatic differences between autosomes and sex chromosomes. To study the genomic events following a Y chromosome reversal, we investigated an autosome-Y translocation in Drosophila pseudoobscura. The ancestral Y chromosome fused to a small autosome (the dot chromosome) approximately 10–15 Mya. We used single molecule real-time sequencing reads to assemble the D. pseudoobscura dot chromosome, including this Y-to-dot translocation. We find that the intervening sequence between the ancestral Y and the rest of the dot chromosome is only ~78 Kb and is not repeat-dense, suggesting that the centromere now falls outside, rather than between, the fused chromosomes. The Y-to-dot region is 100 times smaller than the D. melanogaster Y chromosome, owing to changes in repeat landscape. However, we do not find a consistent reduction in intron sizes across the Y-to-dot region. Instead, deletions in intergenic regions and possibly a small ancestral Y chromosome size may explain the compact size of the Y-to-dot translocation.
Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms.
Resequencing or reference-based assemblies reveal large parts of the small-scale sequence variation. However, they typically fail to separate such local variation into colinear and rearranged variation, because they usually do not recover the complement of large-scale rearrangements, including transpositions and inversions. Besides the availability of hundreds of genomes of diverse Arabidopsis thaliana accessions, there is so far only one full-length assembled genome: the reference sequence. We have assembled 117 Mb of the A. thaliana Landsberg erecta (Ler) genome into five chromosome-equivalent sequences using a combination of short Illumina reads, long PacBio reads, and linkage information. Whole-genome comparison against the reference sequence revealed 564 transpositions and 47 inversions comprising ~3.6 Mb, in addition to 4.1 Mb of nonreference sequence, mostly originating from duplications. Although rearranged regions are not different in local divergence from colinear regions, they are drastically depleted for meiotic recombination in heterozygotes. Using a 1.2-Mb inversion as an example, we show that such rearrangement-mediated reduction of meiotic recombination can lead to genetically isolated haplotypes in the worldwide population of A. thaliana Moreover, we found 105 single-copy genes, which were only present in the reference sequence or the Ler assembly, and 334 single-copy orthologs, which showed an additional copy in only one of the genomes. To our knowledge, this work gives first insights into the degree and type of variation, which will be revealed once complete assemblies will replace resequencing or other reference-dependent methods.
Extensive sequence divergence between the reference genomes of two elite indica rice varieties Zhenshan 97 and Minghui 63.
Asian cultivated rice consists of two subspecies: Oryza sativa subsp. indica and O. sativa subsp. japonica Despite the fact that indica rice accounts for over 70% of total rice production worldwide and is genetically much more diverse, a high-quality reference genome for indica rice has yet to be published. We conducted map-based sequencing of two indica rice lines, Zhenshan 97 (ZS97) and Minghui 63 (MH63), which represent the two major varietal groups of the indica subspecies and are the parents of an elite Chinese hybrid. The genome sequences were assembled into 237 (ZS97) and 181 (MH63) contigs, with an accuracy >99.99%, and covered 90.6% and 93.2% of their estimated genome sizes. Comparative analyses of these two indica genomes uncovered surprising structural differences, especially with respect to inversions, translocations, presence/absence variations, and segmental duplications. Approximately 42% of nontransposable element related genes were identical between the two genomes. Transcriptome analysis of three tissues showed that 1,059-2,217 more genes were expressed in the hybrid than in the parents and that the expressed genes in the hybrid were much more diverse due to their divergence between the parental genomes. The public availability of two high-quality reference genomes for the indica subspecies of rice will have large-ranging implications for plant biology and crop genetic improvement.
An Inv(16)(p13.3q24.3)-encoded CBFA2T3-GLIS2 fusion protein defines an aggressive subtype of pediatric acute megakaryoblastic leukemia.
To define the mutation spectrum in non-Down syndrome acute megakaryoblastic leukemia (non-DS-AMKL), we performed transcriptome sequencing on diagnostic blasts from 14 pediatric patients and validated our findings in a recurrency/validation cohort consisting of 34 pediatric and 28 adult AMKL samples. Our analysis identified a cryptic chromosome 16 inversion (inv(16)(p13.3q24.3)) in 27% of pediatric cases, which encodes a CBFA2T3-GLIS2 fusion protein. Expression of CBFA2T3-GLIS2 in Drosophila and murine hematopoietic cells induced bone morphogenic protein (BMP) signaling and resulted in a marked increase in the self-renewal capacity of hematopoietic progenitors. These data suggest that expression of CBFA2T3-GLIS2 directly contributes to leukemogenesis. Copyright © 2012 Elsevier Inc. All rights reserved.
Following completion of the Human Genome Project, most studies of human genetic variation have centered on single nucleotide polymorphisms (SNPs). SNPs are numerous in individual genomes and serve as useful genetic markers in association studies across a population. These markers have been leveraged to identify genetic loci for disease risk and draw associations with numerous traits of interest. Despite their usefulness, SNPs do not tell the whole story. For example, most SNPs are associated with only a small increased risk of disease, and they usually cannot identify on their own which genes are causal. This has resulted in what many researchers have referred to as missing or hidden heritability.
Until recently, most population-scale genome sequencing studies have focused on identifying single nucleotide variants (SNVs) to explore genetic differences between individuals. Like so many SNV-based genome-wide association studies, however, these efforts have had difficulty identifying causative genetic mechanisms underlying most complex functions. More and more, the genomics community has realised that structural variation is likely responsible for many of the traits and phenotypes that scientists have not been able to attribute to SNVs. This class of variants, defined as genetic differences of 50 bp or larger, accounts for most of the DNA sequence differences between any two people. Structural variants (SVs) are also already known to cause many common and rare diseases including ALS, schizophrenia, leukemia, Carney complex, and Huntington’s disease. Despite the importance of SVs, these larger variants have been understudied and underreported compared to their single-nucleotide counterparts. One reason is that they remain difficult to detect. Their length often means they cannot be fully spanned using short sequencing reads. They also often occur in highly repetitive or GC-rich regions of the genome, making them challenging targets. As such, this class of human genetic variation has remained vastly under-explored in global populations and is now ripe for discovery.
Several unrelated lineages such as plastids, viruses and plasmids, have converged on quadripartite genomes of similar size with large and small single copy regions and a large inverted repeat (IR). Except for Erodium (Geraniaceae), saguaro cactus and some legumes, the plastomes of all photosynthetic angiosperms display this structure. The functional significance of the IR is not understood and Erodium provides a system to examine the role of the IR in the long-term stability of these genomes. We compared the degree of genomic rearrangement in plastomes of Erodium that differ in the presence and absence of the IR.We sequenced 17 new Erodium plastomes. Using 454, Illumina, PacBio and Sanger sequences, 16 genomes were assembled and categorized along with one incomplete and two previously published Erodium plastomes. We conducted phylogenetic analyses among these species using a dataset of 19 protein-coding genes and determined if significantly higher evolutionary rates had caused the long branch seen previously in phylogenetic reconstructions within the genus. Bioinformatic comparisons were also performed to evaluate plastome evolution across the genus.Erodium plastomes fell into four types (Type 1-4) that differ in their substitution rates, short dispersed repeat content and degree of genomic rearrangement, gene and intron content and GC content. Type 4 plastomes had significantly higher rates of synonymous substitutions (dS) for all genes and for 14 of the 19 genes non-synonymous substitutions (dN) were significantly accelerated. We evaluated the evidence for a single IR loss in Erodium and in doing so discovered that Type 4 plastomes contain a novel IR.The presence or absence of the IR does not affect plastome stability in Erodium. Rather, the overall repeat content shows a negative correlation with genome stability, a pattern in agreement with other angiosperm groups and recent findings on genome stability in bacterial endosymbionts.© The Author 2016. Published by Oxford University Press on behalf of the Annals of Botany Company. All rights reserved. For Permissions, please email: email@example.com.
Structural variation detection using next-generation sequencing data: A comparative technical review.
Structural variations (SVs) are mutations in the genome of size at least fifty nucleotides. They contribute to the phenotypic differences among healthy individuals, cause severe diseases and even cancers by breaking or linking genes. Thus, it is crucial to systematically profile SVs in the genome. In the past decade, many next-generation sequencing (NGS)-based SV detection methods have been proposed due to the significant cost reduction of NGS experiments and their ability to unbiasedly detect SVs to the base-pair resolution. These SV detection methods vary in both sensitivity and specificity, since they use different SV-property-dependent and library-property-dependent features. As a result, predictions from different SV callers are often inconsistent. Besides, the noises in the data (both platform-specific sequencing error and artificial chimeric reads) impede the specificity of SV detection. Poorly characterized regions in the human genome (e.g., repeat regions) greatly impact the reads mapping and in turn affect the SV calling accuracy. Calling of complex SVs requires specialized SV callers. Apart from accuracy, processing speed of SV caller is another factor deciding its usability. Knowing the pros and cons of different SV calling techniques and the objectives of the biological study are essential for biologists and bioinformaticians to make informed decisions. This paper describes different components in the SV calling pipeline and reviews the techniques used by existing SV callers. Through simulation study, we also demonstrate that library properties, especially insert size, greatly impact the sensitivity of different SV callers. We hope the community can benefit from this work both in designing new SV calling methods and in selecting the appropriate SV caller for specific biological studies. Copyright © 2016 Elsevier Inc. All rights reserved.
Complete circular genome sequence of successful ST8/SCCmecIV community-associated methicillin-resistant Staphylococcus aureus (OC8) in Russia: one-megabase genomic inversion, IS256’s spread, and evolution of Russia ST8-IV.
ST8/SCCmecIV community-associated methicillin-resistant Staphylococcus aureus (CA-MRSA) has been a common threat, with large USA300 epidemics in the United States. The global geographical structure of ST8/SCCmecIV has not yet been fully elucidated. We herein determined the complete circular genome sequence of ST8/SCCmecIVc strain OC8 from Siberian Russia. We found that 36.0% of the genome was inverted relative to USA300. Two IS256, oppositely oriented, at IS256-enriched hot spots were implicated with the one-megabase genomic inversion (MbIN) and vSaß split. The behavior of IS256 was flexible: its insertion site (att) sequences on the genome and junction sequences of extrachromosomal circular DNA were all divergent, albeit with fixed sizes. A similar multi-IS256 system was detected, even in prevalent ST239 healthcare-associated MRSA in Russia, suggesting IS256’s strong transmission potential and advantage in evolution. Regarding epidemiology, all ST8/SCCmecIVc strains from European, Siberian, and Far Eastern Russia, examined had MbIN, and geographical expansion accompanied divergent spa types and resistance to fluoroquinolones, chloramphenicol, and often rifampicin. Russia ST8/SCCmecIVc has been associated with life-threatening infections such as pneumonia and sepsis in both community and hospital settings. Regarding virulence, the OC8 genome carried a series of toxin and immune evasion genes, a truncated giant surface protein gene, and IS256 insertion adjacent to a pan-regulatory gene. These results suggest that unique single ST8/spa1(t008)/SCCmecIVc CA-MRSA (clade, Russia ST8-IVc) emerged in Russia, and this was followed by large geographical expansion, with MbIN as an epidemiological marker, and fluoroquinolone resistance, multiple virulence factors, and possibly a multi-IS256 system as selective advantages.