Menu
July 19, 2019

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, and achieves a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes. Published by Cold Spring Harbor Laboratory Press.


July 19, 2019

Gorilla MHC class I gene and sequence variation in a comparative context.

Comparisons of MHC gene content and diversity among closely related species can provide insights into the evolutionary mechanisms shaping immune system variation. After chimpanzees and bonobos, gorillas are humans’ closest living relatives; but in contrast, relatively little is known about the structure and variation of gorilla MHC class I genes (Gogo). Here, we combined long-range amplifications and long-read sequencing technology to analyze full-length MHC class I genes in 35 gorillas. We obtained 50 full-length genomic sequences corresponding to 15 Gogo-A alleles, 4 Gogo-Oko alleles, 21 Gogo-B alleles, and 10 Gogo-C alleles including 19 novel coding region sequences. We identified two previously undetected MHC class I genes related to Gogo-A and Gogo-B, respectively, thereby illustrating the potential of this approach for efficient and highly accurate MHC genotyping. Consistent with their phylogenetic position within the hominid family, individual gorilla MHC haplotypes share characteristics with humans and chimpanzees as well as orangutans suggesting a complex history of the MHC class I genes in humans and the great apes. However, the overall MHC class I diversity appears to be low further supporting the hypothesis that gorillas might have experienced a reduction of their MHC repertoire.


July 19, 2019

A golden goat genome

The newly described de novo goat genome sequence is the most contiguous diploid vertebrate assembly generated thus far using whole-genome assembly and scaffolding methods. The contiguity of this assembly is approaching that of the finished human and mouse genomes and suggests an affordable roadmap to high-quality references for thousands of species.


July 19, 2019

DNA target recognition domains in the Type I restriction and modification systems of Staphylococcus aureus.

Staphylococcus aureus displays a clonal population structure in which horizontal gene transfer between different lineages is extremely rare. This is due, in part, to the presence of a Type I DNA restriction–modification (RM) system given the generic name of Sau1, which maintains different patterns of methylation on specific target sequences on the genomes of different lineages. We have determined the target sequences recognized by the Sau1 Type I RM systems present in a wide range of the most prevalent S. aureus lineages and assigned the sequences recognized to particular target recognition domains within the RM enzymes. We used a range of biochemical assays on purified enzymes and single molecule real-time sequencing on genomic DNA to determine these target sequences and their patterns of methylation. Knowledge of the main target sequences for Sau1 will facilitate the synthesis of new vectors for transformation of the most prevalent lineages of this ‘untransformable’ bacterium.


July 19, 2019

Genomic structure of the horse major histocompatibility complex class II region resolved using PacBio long-read sequencing technology.

The mammalian Major Histocompatibility Complex (MHC) region contains several gene families characterized by highly polymorphic loci with extensive nucleotide diversity, copy number variation of paralogous genes, and long repetitive sequences. This structural complexity has made it difficult to construct a reliable reference sequence of the horse MHC region. In this study, we used long-read single molecule, real-time (SMRT) sequencing technology from Pacific Biosciences (PacBio) to sequence eight Bacterial Artificial Chromosome (BAC) clones spanning the horse MHC class II region. The final assembly resulted in a 1,165,328?bp continuous gap free sequence with 35 manually curated genomic loci of which 23 were considered to be functional and 12 to be pseudogenes. In comparison to the MHC class II region in other mammals, the corresponding region in horse shows extraordinary copy number variation and different relative location and directionality of the Eqca-DRB, -DQA, -DQB and -DOB loci. This is the first long-read sequence assembly of the horse MHC class II region with rigorous manual gene annotation, and it will serve as an important resource for association studies of immune-mediated equine diseases and for evolutionary analysis of genetic diversity in this region.


July 19, 2019

Genomic analyses of primitive, wild and cultivated citrus provide insights into asexual reproduction.

The emergence of apomixis-the transition from sexual to asexual reproduction-is a prominent feature of modern citrus. Here we de novo sequenced and comprehensively studied the genomes of four representative citrus species. Additionally, we sequenced 100 accessions of primitive, wild and cultivated citrus. Comparative population analysis suggested that genomic regions harboring energy- and reproduction-associated genes are probably under selection in cultivated citrus. We also narrowed the genetic locus responsible for citrus polyembryony, a form of apomixis, to an 80-kb region containing 11 candidate genes. One of these, CitRWP, is expressed at higher levels in ovules of polyembryonic cultivars. We found a miniature inverted-repeat transposable element insertion in the promoter region of CitRWP that cosegregated with polyembryony. This study provides new insights into citrus apomixis and constitutes a promising resource for the mining of agriculturally important genes.


July 19, 2019

Single-molecule sequencing resolves the detailed structure of complex satellite DNA loci in Drosophila melanogaster.

Highly repetitive satellite DNA (satDNA) repeats are found in most eukaryotic genomes. SatDNAs are rapidly evolving and have roles in genome stability and chromosome segregation. Their repetitive nature poses a challenge for genome assembly and makes progress on the detailed study of satDNA structure difficult. Here, we use single-molecule sequencing long reads from Pacific Biosciences (PacBio) to determine the detailed structure of all major autosomal complex satDNA loci in Drosophila melanogaster, with a particular focus on the 260-bp and Responder satellites. We determine the optimal de novo assembly methods and parameter combinations required to produce a high-quality assembly of these previously unassembled satDNA loci and validate this assembly using molecular and computational approaches. We determined that the computationally intensive PBcR-BLASR assembly pipeline yielded better assemblies than the faster and more efficient pipelines based on the MHAP hashing algorithm, and it is essential to validate assemblies of repetitive loci. The assemblies reveal that satDNA repeats are organized into large arrays interrupted by transposable elements. The repeats in the center of the array tend to be homogenized in sequence, suggesting that gene conversion and unequal crossovers lead to repeat homogenization through concerted evolution, although the degree of unequal crossing over may differ among complex satellite loci. We find evidence for higher-order structure within satDNA arrays that suggest recent structural rearrangements. These assemblies provide a platform for the evolutionary and functional genomics of satDNAs in pericentric heterochromatin. © 2017 Khost et al.; Published by Cold Spring Harbor Laboratory Press.


July 19, 2019

New advances in sequence assembly

Extract It may be hard to believe, but the idea of sequence assembly is around 40 years old. Consider this pair of quotes from Rodger Staden (Staden 1979): “With modern fast sequencing techniques and suitable computer programs it is now possible to sequence whole genomes without the need of restriction maps.” “If the 5′ end of the sequence from one gel reading is the same as the 3′ end of the sequence from another the data is said to overlap. If the overlap is of sufficient length to distinguish it from being a repeat in the sequence the two sequences must be contiguous. The data from the two gel readings can then be joined to form one longer continuous sequence.” Replace “gel reading” with “read” and these sentences would go unnoticed in the introduction of any paper today. Here you can also see the birth of jargon that now pervades the field: overlaps between reads form contigs (contiguous sequences). Just a few months later, Gingeras et al. (1979) described “Computer programs for the assembly of DNA sequences.” It all sounds so modern, until the discussion mentions FORTRAN code stored on magnetic tapes. How, then, can we fill an entire special issue of Genome Research with “new advances” so many years later? To me, this reflects the beauty of the problem—simple enough to be stated in a single paragraph, yet complex enough to sustain a field of research for decades. This dichotomy is common to many famous computational problems; indeed, mathematical formulations of sequence assembly fall into a class of problems known as “NP-hard” that do not admit an easy solution (Medvedev et al. 2007). There is another reason for continued advances in sequence assembly—advances in sequencing technology. As evident from the Staden quotes above, the first assembly methods were …


July 19, 2019

A new chicken genome assembly provides insight into avian genome structure.

The importance of the Gallus gallus (chicken) as a model organism and agricultural animal merits a continuation of sequence assembly improvement efforts. We present a new version of the chicken genome assembly (Gallus_gallus-5.0; GCA_000002315.3), built from combined long single molecule sequencing technology, finished BACs, and improved physical maps. In overall assembled bases, we see a gain of 183 Mb, including 16.4 Mb in placed chromosomes with a corresponding gain in the percentage of intact repeat elements characterized. Of the 1.21 Gb genome, we include three previously missing autosomes, GGA30, 31, and 33, and improve sequence contig length 10-fold over the previous Gallus_gallus-4.0. Despite the significant base representation improvements made, 138 Mb of sequence is not yet located to chromosomes. When annotated for gene content, Gallus_gallus-5.0 shows an increase of 4679 annotated genes (2768 noncoding and 1911 protein-coding) over those in Gallus_gallus-4.0. We also revisited the question of what genes are missing in the avian lineage, as assessed by the highest quality avian genome assembly to date, and found that a large fraction of the original set of missing genes are still absent in sequenced bird species. Finally, our new data support a detailed map of MHC-B, encompassing two segments: one with a highly stable gene copy number and another in which the gene copy number is highly variable. The chicken model has been a critical resource for many other fields of study, and this new reference assembly will substantially further these efforts. Copyright © 2017 Warren et al.


July 19, 2019

The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution.

The domesticated sunflower, Helianthus annuus L., is a global oil crop that has promise for climate change adaptation, because it can maintain stable yields across a wide variety of environmental conditions, including drought. Even greater resilience is achievable through the mining of resistance alleles from compatible wild sunflower relatives, including numerous extremophile species. Here we report a high-quality reference for the sunflower genome (3.6 gigabases), together with extensive transcriptomic data from vegetative and floral organs. The genome mostly consists of highly similar, related sequences and required single-molecule real-time sequencing technologies for successful assembly. Genome analyses enabled the reconstruction of the evolutionary history of the Asterids, further establishing the existence of a whole-genome triplication at the base of the Asterids II clade and a sunflower-specific whole-genome duplication around 29 million years ago. An integrative approach combining quantitative genetics, expression and diversity data permitted development of comprehensive gene networks for two major breeding traits, flowering time and oil metabolism, and revealed new candidate genes in these networks. We found that the genomic architecture of flowering time has been shaped by the most recent whole-genome duplication, which suggests that ancient paralogues can remain in the same regulatory networks for dozens of millions of years. This genome represents a cornerstone for future research programs aiming to exploit genetic diversity to improve biotic and abiotic stress resistance and oil production, while also considering agricultural constraints and human nutritional needs.


July 19, 2019

Long-read sequencing uncovers the adaptive topography of a carnivorous plant genome.

Utricularia gibba, the humped bladderwort, is a carnivorous plant that retains a tiny nuclear genome despite at least two rounds of whole genome duplication (WGD) since common ancestry with grapevine and other species. We used a third-generation genome assembly with several complete chromosomes to reconstruct the two most recent lineage-specific ancestral genomes that led to the modern U. gibba genome structure. Patterns of subgenome dominance in the most recent WGD, both architectural and transcriptional, are suggestive of allopolyploidization, which may have generated genomic novelty and led to instantaneous speciation. Syntenic duplicates retained in polyploid blocks are enriched for transcription factor functions, whereas gene copies derived from ongoing tandem duplication events are enriched in metabolic functions potentially important for a carnivorous plant. Among these are tandem arrays of cysteine protease genes with trap-specific expression that evolved within a protein family known to be useful in the digestion of animal prey. Further enriched functions among tandem duplicates (also with trap-enhanced expression) include peptide transport (intercellular movement of broken-down prey proteins), ATPase activities (bladder-trap acidification and transmembrane nutrient transport), hydrolase and chitinase activities (breakdown of prey polysaccharides), and cell-wall dynamic components possibly associated with active bladder movements. Whereas independently polyploid Arabidopsis syntenic gene duplicates are similarly enriched for transcriptional regulatory activities, Arabidopsis tandems are distinct from those of U. gibba, while still metabolic and likely reflecting unique adaptations of that species. Taken together, these findings highlight the special importance of tandem duplications in the adaptive landscapes of a carnivorous plant genome.


July 19, 2019

An integrated strategy combining DNA walking and NGS to detect GMOs.

Recently, we developed a DNA walking system for the detection and characterization of a broad spectrum of GMOs in routine analysis of food/feed matrices. Here, we present a new version with improved throughput and sensitivity by coupling the DNA walking system to Pacific Bioscience® Next-generation sequencing technology. The performance of the new strategy was thoroughly assessed through several assays. First, we tested its detection and identification capability on grains with high or low GMO content. Second, the potential impacts of food processing were investigated using rice noodle samples. Finally, GMO mixtures and a real-life sample were analyzed to illustrate the applicability of the proposed strategy in routine GMO analysis. In all tested samples, the presence of multiple GMOs was unambiguously proven by the characterization of transgene flanking regions and the combinations of elements that are typical for transgene constructs. Copyright © 2017 The Authors. Published by Elsevier Ltd.. All rights reserved.


July 19, 2019

Widespread adenine N6-methylation of active genes in fungi.

N6-methyldeoxyadenine (6mA) is a noncanonical DNA base modification present at low levels in plant and animal genomes, but its prevalence and association with genome function in other eukaryotic lineages remains poorly understood. Here we report that abundant 6mA is associated with transcriptionally active genes in early-diverging fungal lineages. Using single-molecule long-read sequencing of 16 diverse fungal genomes, we observed that up to 2.8% of all adenines were methylated in early-diverging fungi, far exceeding levels observed in other eukaryotes and more derived fungi. 6mA occurred symmetrically at ApT dinucleotides and was concentrated in dense methylated adenine clusters surrounding the transcriptional start sites of expressed genes; its distribution was inversely correlated with that of 5-methylcytosine. Our results show a striking contrast in the genomic distributions of 6mA and 5-methylcytosine and reinforce a distinct role for 6mA as a gene-expression-associated epigenomic mark in eukaryotes.


July 19, 2019

TAL effector driven induction of a SWEET gene confers susceptibility to bacterial blight of cotton.

Transcription activator-like (TAL) effectors from Xanthomonas citri subsp. malvacearum (Xcm) are essential for bacterial blight of cotton (BBC). Here, by combining transcriptome profiling with TAL effector-binding element (EBE) prediction, we show that GhSWEET10, encoding a functional sucrose transporter, is induced by Avrb6, a TAL effector determining Xcm pathogenicity. Activation of GhSWEET10 by designer TAL effectors (dTALEs) restores virulence of Xcm avrb6 deletion strains, whereas silencing of GhSWEET10 compromises cotton susceptibility to infections. A BBC-resistant line carrying an unknown recessive b6 gene bears the same EBE as the susceptible line, but Avrb6-mediated induction of GhSWEET10 is reduced, suggesting a unique mechanism underlying b6-mediated resistance. We show via an extensive survey of GhSWEET transcriptional responsiveness to different Xcm field isolates that additional GhSWEETs may also be involved in BBC. These findings advance our understanding of the disease and resistance in cotton and may facilitate the development cotton with improved resistance to BBC.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.