In this PacBio User Group Meeting presentation, Mitchell Vollger of the University of Washington used HiFi reads from SMRT Sequencing to study segmental duplications in the human genome. The technique…
In this LabRoots webinar, Jonas Korlach the CSO of PacBio provides an introduction to PacBio HiFi sequence reads, which are both long (up to 25 kb currently) and accurate (>99%)…
We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA ) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33-79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.
Despite modern sequencing efforts, the difficulty in assembly of highly repetitive sequences has prevented resolution of human genome gaps, including some in the coding regions of genes with important biological functions. One such gene, MUC5AC, encodes a large, secreted mucin, which is one of the two major secreted mucins in human airways. The MUC5AC region contains a gap in the human genome reference (hg19) across the large, highly repetitive, and complex central exon. This exon is predicted to contain imperfect tandem repeat sequences and multiple conserved cysteine-rich (CysD) domains. To resolve the MUC5AC genomic gap, we used high-fidelity long PCR followed by single molecule real-time (SMRT) sequencing. This technology yielded long sequence reads and robust coverage that allowed for de novo sequence assembly spanning the entire repetitive region. Furthermore, we used SMRT sequencing of PCR amplicons covering the central exon to identify genetic variation in four individuals. The results demonstrated the presence of segmental duplications of CysD domains, insertions/deletions (indels) of tandem repeats, and single nucleotide variants. Additional studies demonstrated that one of the identified tandem repeat insertions is tagged by nonexonic single nucleotide polymorphisms. Taken together, these data illustrate the successful utility of SMRT sequencing long reads for de novo assembly of large repetitive sequences to fill the gaps in the human genome. Characterization of the MUC5AC gene and the sequence variation in the central exon will facilitate genetic and functional studies for this critical airway mucin.
Obtaining high-quality sequence continuity of complex regions of recent segmental duplication remains one of the major challenges of finishing genome assemblies. In the human and mouse genomes, this was achieved by targeting large-insert clones using costly and laborious capillary-based sequencing approaches. Sanger shotgun sequencing of clone inserts, however, has now been largely abandoned, leaving most of these regions unresolved in newer genome assemblies generated primarily by next-generation sequencing hybrid approaches. Here we show that it is possible to resolve regions that are complex in a genome-wide context but simple in isolation for a fraction of the time and cost of traditional methods using long-read single molecule, real-time (SMRT) sequencing and assembly technology from Pacific Biosciences (PacBio). We sequenced and assembled BAC clones corresponding to a 1.3-Mbp complex region of chromosome 17q21.31, demonstrating 99.994% identity to Sanger assemblies of the same clones. We targeted 44 differences using Illumina sequencing and find that PacBio and Sanger assemblies share a comparable number of validated variants, albeit with different sequence context biases. Finally, we targeted a poorly assembled 766-kbp duplicated region of the chimpanzee genome and resolved the structure and organization for a fraction of the cost and time of traditional finishing approaches. Our data suggest a straightforward path for upgrading genomes to a higher quality finished state.
Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability.
Recurrent deletions of chromosome 15q13.3 associate with intellectual disability, schizophrenia, autism and epilepsy. To gain insight into the instability of this region, we sequenced it in affected individuals, normal individuals and nonhuman primates. We discovered five structural configurations of the human chromosome 15q13.3 region ranging in size from 2 to 3 Mb. These configurations arose recently (~0.5-0.9 million years ago) as a result of human-specific expansions of segmental duplications and two independent inversion events. All inversion breakpoints map near GOLGA8 core duplicons-a ~14-kb primate-specific chromosome 15 repeat that became organized into larger palindromic structures. GOLGA8-flanked palindromes also demarcate the breakpoints of recurrent 15q13.3 microdeletions, the expansion of chromosome 15 segmental duplications in the human lineage and independent structural changes in apes. The significant clustering (P = 0.002) of breakpoints provides mechanistic evidence for the role of this core duplicon and its palindromic architecture in promoting the evolutionary and disease-related instability of chromosome 15.
PacBio-LITS: a large-insert targeted sequencing method for characterization of human disease-associated chromosomal structural variations.
Generation of long (>5 Kb) DNA sequencing reads provides an approach for interrogation of complex regions in the human genome. Currently, large-insert whole genome sequencing (WGS) technologies from Pacific Biosciences (PacBio) enable analysis of chromosomal structural variations (SVs), but the cost to achieve the required sequence coverage across the entire human genome is high.We developed a method (termed PacBio-LITS) that combines oligonucleotide-based DNA target-capture enrichment technologies with PacBio large-insert library preparation to facilitate SV studies at specific chromosomal regions. PacBio-LITS provides deep sequence coverage at the specified sites at substantially reduced cost compared with PacBio WGS. The efficacy of PacBio-LITS is illustrated by delineating the breakpoint junctions of low copy repeat (LCR)-associated complex structural rearrangements on chr17p11.2 in patients diagnosed with Potocki-Lupski syndrome (PTLS; MIM#610883). We successfully identified previously determined breakpoint junctions in three PTLS cases, and also were able to discover novel junctions in repetitive sequences, including LCR-mediated breakpoints. The new information has enabled us to propose mechanisms for formation of these structural variants.The new method leverages the cost efficiency of targeted capture-sequencing as well as the mappability and scaffolding capabilities of long sequencing reads generated by the PacBio platform. It is therefore suitable for studying complex SVs, especially those involving LCRs, inversions, and the generation of chimeric Alu elements at the breakpoints. Other genomic research applications, such as haplotype phasing and small insertion and deletion validation could also benefit from this technology.
Deciphering the genetic basis of human disease requires a comprehensive knowledge of genetic variants irrespective of their class or frequency. Although an impressive number of human genetic variants have been catalogued, a large fraction of the genetic difference that distinguishes two human genomes is still not understood at the base-pair level. This is because the emphasis has been on single-nucleotide variation as opposed to less tractable and more complex genetic variants, including indels and structural variants. The latter, we propose, will have a large impact on human phenotypes but require a more systematic assessment of genomes at deeper coverage and alternate sequencing and mapping technologies. Copyright © 2016 by the Genetics Society of America.
Genetic differences that specify unique aspects of human evolution have typically been identified by comparative analyses between the genomes of humans and closely related primates, including more recently the genomes of archaic hominins. Not all regions of the genome, however, are equally amenable to such study. Recurrent copy number variation (CNV) at chromosome 16p11.2 accounts for approximately 1% of cases of autism and is mediated by a complex set of segmental duplications, many of which arose recently during human evolution. Here we reconstruct the evolutionary history of the locus and identify bolA family member 2 (BOLA2) as a gene duplicated exclusively in Homo sapiens. We estimate that a 95-kilobase-pair segment containing BOLA2 duplicated across the critical region approximately 282 thousand years ago (ka), one of the latest among a series of genomic changes that dramatically restructured the locus during hominid evolution. All humans examined carried one or more copies of the duplication, which nearly fixed early in the human lineage-a pattern unlikely to have arisen so rapidly in the absence of selection (P?0.0097). We show that the duplication of BOLA2 led to a novel, human-specific in-frame fusion transcript and that BOLA2 copy number correlates with both RNA expression (r?=?0.36) and protein level (r?=?0.65), with the greatest expression difference between human and chimpanzee in experimentally derived stem cells. Analyses of 152 patients carrying a chromosome 16p11.2 rearrangement show that more than 96% of breakpoints occur within the H. sapiens-specific duplication. In summary, the duplicative transposition of BOLA2 at the root of the H. sapiens lineage about 282?ka simultaneously increased copy number of a gene associated with iron homeostasis and predisposed our species to recurrent rearrangements associated with disease.
Extensive sequence divergence between the reference genomes of two elite indica rice varieties Zhenshan 97 and Minghui 63.
Asian cultivated rice consists of two subspecies: Oryza sativa subsp. indica and O. sativa subsp. japonica Despite the fact that indica rice accounts for over 70% of total rice production worldwide and is genetically much more diverse, a high-quality reference genome for indica rice has yet to be published. We conducted map-based sequencing of two indica rice lines, Zhenshan 97 (ZS97) and Minghui 63 (MH63), which represent the two major varietal groups of the indica subspecies and are the parents of an elite Chinese hybrid. The genome sequences were assembled into 237 (ZS97) and 181 (MH63) contigs, with an accuracy >99.99%, and covered 90.6% and 93.2% of their estimated genome sizes. Comparative analyses of these two indica genomes uncovered surprising structural differences, especially with respect to inversions, translocations, presence/absence variations, and segmental duplications. Approximately 42% of nontransposable element related genes were identical between the two genomes. Transcriptome analysis of three tissues showed that 1,059-2,217 more genes were expressed in the hybrid than in the parents and that the expressed genes in the hybrid were much more diverse due to their divergence between the parental genomes. The public availability of two high-quality reference genomes for the indica subspecies of rice will have large-ranging implications for plant biology and crop genetic improvement.
Short tandem repeats, segmental duplications, gene deletion, and genomic instability in a rapidly diversified immune gene family.
Genomic regions with repetitive sequences are considered unstable and prone to swift DNA diversification processes. A highly diverse immune gene family of the sea urchin (Strongylocentrotus purpuratus), called Sp185/333, is composed of clustered genes with similar sequence as well as several types of repeats ranging in size from short tandem repeats (STRs) to large segmental duplications. This repetitive structure may have been the basis for the incorrect assembly of this gene family in the sea urchin genome sequence. Consequently, we have resolved the structure of the family and profiled the members by sequencing selected BAC clones using Illumina and PacBio approaches.BAC insert assemblies identified 15 predicted genes that are organized into three clusters. Two of the gene clusters have almost identical flanking regions, suggesting that they may be non-matching allelic clusters residing at the same genomic locus. GA STRs surround all genes and appear in large stretches at locations of putatively deleted genes. GAT STRs are positioned at the edges of segmental duplications that include a subset of the genes. The unique locations of the STRs suggest their involvement in gene deletions and segmental duplications. Genomic profiling of the Sp185/333 gene diversity in 10 sea urchins shows that no gene repertoires are shared among individuals indicating a very high gene diversification rate for this family.The repetitive genomic structure of the Sp185/333 family that includes STRs in strategic locations may serve as platform for a controlled mechanism which regulates the processes of gene recombination, gene conversion, duplication and deletion. The outcome is genomic instability and allelic mismatches, which may further drive the swift diversification of the Sp185/333 gene family that may improve the immune fitness of the species.
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite gene clusters.
The ascomycete fungus Colletotrichum higginsianum causes anthracnose disease of brassica crops and the model plant Arabidopsis thaliana. Previous versions of the genome sequence were highly fragmented, causing errors in the prediction of protein-coding genes and preventing the analysis of repetitive sequences and genome architecture. Here, we re-sequenced the genome using single-molecule real-time (SMRT) sequencing technology and, in combination with optical map data, this provided a gapless assembly of all twelve chromosomes except for the ribosomal DNA repeat cluster on chromosome 7. The more accurate gene annotation made possible by this new assembly revealed a large repertoire of secondary metabolism (SM) key genes (89) and putative biosynthetic pathways (77 SM gene clusters). The two mini-chromosomes differed from the ten core chromosomes in being repeat- and AT-rich and gene-poor but were significantly enriched with genes encoding putative secreted effector proteins. Transposable elements (TEs) were found to occupy 7% of the genome by length. Certain TE families showed a statistically significant association with effector genes and SM cluster genes and were transcriptionally active at particular stages of fungal development. All 24 subtelomeres were found to contain one of three highly-conserved repeat elements which, by providing sites for homologous recombination, were probably instrumental in four segmental duplications.The gapless genome of C. higginsianum provides access to repeat-rich regions that were previously poorly assembled, notably the mini-chromosomes and subtelomeres, and allowed prediction of the complete SM gene repertoire. It also provides insights into the potential role of TEs in gene and genome evolution and host adaptation in this asexual pathogen.
NK-lysin is an antimicrobial peptide and effector protein in the host innate immune system. It is coded by a single gene in humans and most other mammalian species. In this study, we provide evidence for the existence of four NK-lysin genes in a repetitive region on cattle chromosome 11. The NK2A, NK2B, and NK2C genes are tandemly arrayed as three copies in ~30-35-kb segments, located 41.8 kb upstream of NK1. All four genes are functional, albeit with differential tissue expression. NK1, NK2A, and NK2B exhibited the highest expression in intestine Peyer’s patch, whereas NK2C was expressed almost exclusively in lung. The four peptide products were synthesized ex vivo, and their antimicrobial effects against both Gram-positive and Gram-negative bacteria were confirmed with a bacteria-killing assay. Transmission electron microcopy indicated that bovine NK-lysins exhibited their antimicrobial activities by lytic action in the cell membranes. In summary, the single NK-lysin gene in other mammals has expanded to a four-member gene family by tandem duplications in cattle; all four genes are transcribed, and the synthetic peptides corresponding to the core regions are biologically active and likely contribute to innate immunity in ruminants.
Repetitive sequences present a challenge for genome sequence assembly, and highly similar segmental duplications may disappear from assembled genome sequences. Having found a surprising lack of observable phenotypic deviations and non-Mendelian segregation in Arabidopsis thaliana mutants in SEC10, a gene encoding a core subunit of the exocyst tethering complex, we examined whether this could be explained by a hidden gene duplication. Re-sequencing and manual assembly of the Arabidopsis thaliana SEC10 (At5g12370) locus revealed that this locus, comprising a single gene in the reference genome assembly, indeed contains two paralogous genes in tandem, SEC10a and SEC10b, and that a sequence segment of 7 kb in length is missing from the reference genome sequence. Differences between the two paralogs are concentrated in non-coding regions, while the predicted protein sequences exhibit 99% identity, differing only by substitution of five amino acid residues and an indel of four residues. Both SEC10 genes are expressed, although varying transcript levels suggest differential regulation. Homozygous T-DNA insertion mutants in either paralog exhibit a wild-type phenotype, consistent with proposed extensive functional redundancy of the two genes. By these observations we demonstrate that recently duplicated genes may remain hidden even in well-characterized genomes, such as that of A. thaliana. Moreover, we show that the use of the existing A. thaliana reference genome sequence as a guide for sequence assembly of new Arabidopsis accessions or related species has at least in some cases led to error propagation.
Segmental duplications contribute to human evolution, adaptation and genomic instability but are often poorly characterized. We investigate the evolution, genetic variation and coding potential of human-specific segmental duplications (HSDs). We identify 218 HSDs based on analysis of 322 deeply sequenced archaic and contemporary hominid genomes. We sequence 550 human and nonhuman primate genomic clones to reconstruct the evolution of the largest, most complex regions with protein-coding potential (N?=?80 genes from 33 gene families). We show that HSDs are non-randomly organized, associate preferentially with ancestral ape duplications termed ‘core duplicons’ and evolved primarily in an interspersed inverted orientation. In addition to Homo sapiens-specific gene expansions (such as TCAF1/TCAF2), we highlight ten gene families (for example, ARHGAP11B and SRGAP2C) where copy number never returns to the ancestral state, there is evidence of mRNA splicing and no common gene-disruptive mutations are observed in the general population. Such duplicates are candidates for the evolution of human-specific adaptive traits.