Genetic studies of human evolution require high-quality contiguous ape genome assemblies that are not guided by the human reference. We coupled long-read sequence assembly and full-length complementary DNA sequencing with a multiplatform scaffolding approach to produce ab initio chimpanzee and orangutan genome assemblies. By comparing these with two long-read de novo human genome assemblies and a gorilla genome assembly, we characterized lineage-specific and shared great ape genetic variation ranging from single- to mega-base pair-sized variants. We identified ~17,000 fixed human-specific structural variants identifying genic and putative regulatory changes that have emerged in humans since divergence from nonhuman apes. Interestingly, these…
Long-read sequencing technologies enable high-quality, contiguous genome assemblies. Here we used SMRT sequencing to assemble the genome of a Drosophila simulans strain originating from Madagascar, the ancestral range of the species. We generated 8 Gb of raw data (~50x coverage) with a mean read length of 6,410 bp, a NR50 of 9,125 bp and the longest subread at 49 kb. We benchmarked six different assemblers and merged the best two assemblies from Canu and Falcon. Our final assembly was 127.41 Mb with a N50 of 5.38 Mb and 305 contigs. We anchored more than 4 Mb of novel sequence to…
In this study, we used the amplified isoform sequencing technique from Pacific Biosciences to characterize the poly(A)(+) fraction of the lytic transcriptome of the herpes simplex virus type 1 (HSV-1). Our analysis detected 34 formerly unidentified protein-coding genes, 10 non-coding RNAs, as well as 17 polycistronic and complex transcripts. This work also led us to identify many transcript isoforms, including 13 splice and 68 transcript end variants, as well as several transcript overlaps. Additionally, we determined previously unascertained transcriptional start and polyadenylation sites. We analyzed the transcriptional activity from the complementary DNA strand in five convergent HSV gene pairs with…
The recent introductions of low-cost, long-read, and read-cloud sequencing technologies coupled with intense efforts to develop efficient algorithms have made affordable, high-quality de novo sequence assembly a realistic proposition. The result is an explosion of new, ultracontiguous genome assemblies. To compare these genomes, we need robust methods for genome annotation. We describe the fully open source Comparative Annotation Toolkit (CAT), which provides a flexible way to simultaneously annotate entire clades and identify orthology relationships. We show that CAT can be used to improve annotations on the rat genome, annotate the great apes, annotate a diverse set of mammals, and annotate…
Fanconi anemia (FA) is a rare disorder characterized by congenital malformations, progressive bone marrow failure, and predisposition to cancer. Patients harboring X-linked FANCB pathogenic variants usually present with severe congenital malformations resembling VACTERL syndrome with hydrocephalus.We employed the diepoxybutane (DEB) test for FA diagnosis, arrayCGH for detection of duplication, targeted capture and next-gen sequencing for defining the duplication breakpoint, PacBio sequencing of full-length FANCB aberrant transcript, FANCD2 ubiquitination and foci formation assays for the evaluation of FANCB protein function by viral transduction of FANCB-null cells with lentiviral FANCB WT and mutant expression constructs, and droplet digital PCR for quantitation of…
High on the Heidelberg hills, inside the Advanced Training Centre of the European Molecular Biology Laboratory (EMBL) campus with its unique double-helix staircase, scientists gathered for the EMBL conference “Mammalian Genetics and Genomics: From Molecular Mechanisms to Translational Applications,” organized in cooperation with the International Mammalian Genome Society (IMGS) and the Mouse Molecular Genetics (MMG) group. The conference attracted 205 participants from 30 countries, representing 6 of the 7 continents-all except Antarctica. It was a richly diverse group of geneticists, clinicians, and bioinformaticians, with presentations by established and junior investigators, including many trainees. From the 24th-27th of October 2017, they…
Despite the importance of duplicate genes for evolutionary adaptation, accurate gene annotation is often incomplete, incorrect, or lacking in regions of segmental duplication. We developed an approach combining long-read sequencing and hybridization capture to yield full-length transcript information and confidently distinguish between nearly identical genes/paralogs. We used biotinylated probes to enrich for full-length cDNA from duplicated regions, which were then amplified, size-fractionated, and sequenced using single-molecule, long-read sequencing technology, permitting us to distinguish between highly identical genes by virtue of multiple paralogous sequence variants. We examined 19 gene families as expressed in developing and adult human brain, selected for their…
A genome sequence is worthless if it cannot be deciphered; therefore, efforts to describe – or ‘annotate’ – genes began as soon as DNA sequences became available. Whereas early work focused on individual protein-coding genes, the modern genomic ocean is a complex maelstrom of alternative splicing, non-coding transcription and pseudogenes. Scientists – from clinicians to evolutionary biologists – need to navigate these waters, and this has led to the design of high-throughput, computationally driven annotation projects. The catalogues that are being produced are key resources for genome exploration, especially as they become integrated with expression, epigenomic and variation data sets.…
Haplotypes are fundamental to fully characterize the diploid genome of an individual, yet methods to directly chart the unique genetic makeup of each parental chromosome are lacking. Here we introduce single-cell DNA template strand sequencing (Strand-seq) as a novel approach to phasing diploid genomes along the entire length of all chromosomes. We demonstrate this by building a complete haplotype for a HapMap individual (NA12878) at high accuracy (concordance 99.3%), without using generational information or statistical inference. By use of this approach, we mapped all meiotic recombination events in a family trio with high resolution (median range ~14 kb) and phased…
Single-molecule, real-time sequencing developed by Pacific BioSciences offers longer read lengths than the second-generation sequencing (SGS) technologies, making it well-suited for unsolved problems in genome, transcriptome, and epigenetics research. The highly-contiguous de novo assemblies using PacBio sequencing can close gaps in current reference assemblies and characterize structural variation (SV) in personal genomes. With longer reads, we can sequence through extended repetitive regions and detect mutations, many of which are associated with diseases. Moreover, PacBio transcriptome sequencing is advantageous for the identification of gene isoforms and facilitates reliable discoveries of novel genes and novel isoforms of annotated genes, due to its…
Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays and generate a de novo assembly of 2.93?Gb (contig N50: 8.3?Mb, scaffold N50: 22.0?Mb, including 39.3?Mb N-bases), together with 206?Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8?Mb of HX1-specific sequences, including 4.1?Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing…
While some human-specific protein-coding genes have been proposed to originate from ancestral lncRNAs, the transition process remains poorly understood. Here we identified 64 hominoid-specific de novo genes and report a mechanism for the origination of functional de novo proteins from ancestral lncRNAs with precise splicing structures and specific tissue expression profiles. Whole-genome sequencing of dozens of rhesus macaque animals revealed that these lncRNAs are generally not more selectively constrained than other lncRNA loci. The existence of these newly-originated de novo proteins is also not beyond anticipation under neutral expectation, as they generally have longer theoretical lifespan than their current age,…
Single-cell sequencing provides information that is not confounded by genotypic or phenotypic heterogeneity of bulk samples. Sequencing of one molecular type (RNA, methylated DNA or open chromatin) in a single cell, furthermore, provides insights into the cell’s phenotype and links to its genotype. Nevertheless, only by taking measurements of these phenotypes and genotypes from the same single cells can such inferences be made unambiguously. In this review, we survey the first experimental approaches that assay, in parallel, multiple molecular types from the same single cell, before considering the challenges and opportunities afforded by these and future technologies. Copyright © 2016.…
In recent years long-read technologies have moved from being a niche and specialist field to a point of relative maturity likely to feature frequently in the genomic landscape. Analogous to next generation sequencing, the cost of sequencing using long-read technologies has materially dropped whilst the instrument throughput continues to increase. Together these changes present the prospect of sequencing large numbers of individuals with the aim of fully characterizing genomes at high resolution. In this article, we will endeavour to present an introduction to long-read technologies showing: what long reads are; how they are distinct from short reads; why long reads…
To investigate chromosome evolution in fish species, we newly mapped 181 markers that allowed us to construct a yellowtail (Seriola quinqueradiata) radiation hybrid (RH) physical map with 1,713 DNA markers, which was far denser than a previous map, and we anchored thede novoassembled sequences onto the RH physical map. Finally, we mapped a total of 13,977 expressed sequence tags (ESTs) on a genome sequence assembly aligned with the physical map. Using the high-density physical map and anchored genome sequences, we accurately compared the yellowtail genome structure with the genome structures of five model fishes to identify characteristics of the yellowtail…