September 22, 2019  |  

Approaches for surveying cosmic radiation damage in large populations of Arabidopsis thaliana seeds-Antarctic balloons and particle beams.

The Cosmic Ray Exposure Sequencing Science (CRESS) payload system is a proof of concept experiment to assess the genomic impact of space radiation on seeds. CRESS was designed as a secondary payload for the December 2016 high-altitude, high-latitude, and long-duration balloon flight carrying the Boron And Carbon Cosmic Rays in the Upper Stratosphere (BACCUS) experimental hardware. Investigation of the biological effects of Galactic Cosmic Radiation (GCR), particularly those of ions with High-Z and Energy (HZE), is of interest due to the genomic damage this type of radiation inflicts. The biological effects of upper-stratospheric mixed radiation above Antarctica (ANT) were sampled using Arabidopsis thaliana seeds and were compared to those resulting from a controlled simulation of GCR at Brookhaven National Laboratory (BNL) and to laboratory control seed. The payload developed for Antarctica exposure was broadly designed to 1U CubeSat specifications (10cmx10cmx10cm, =1.33kg), maintained 1 atm internal pressure, and carried an internal cargo of four seed trays (about 580,000 seeds) and twelve CR-39 Solid-State Nuclear Track Detectors (SSNTDs). The irradiated seeds were recovered, sterilized and grown on Petri plates for phenotypic screening. BNL and ANT M0 seeds showed significantly reduced germination rates and elevated somatic mutation rates when compared to non-irradiated controls, with the BNL mutation rate also being significantly higher than that of ANT. Genomic DNA from mutants of interest was evaluated with whole-genome sequencing using PacBio SMRT technology. Sequence data revealed the presence of an array of genome structural variants in the genomes of M0 and M1 mutant plants.


September 21, 2019  |  

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.


September 21, 2019  |  

Phased diploid genome assembly with single-molecule real-time sequencing.

While genome assembly projects have been successful in many haploid and inbred species, the assembly of noninbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.


July 19, 2019  |  

Error correction and assembly complexity of single molecule sequencing reads.

Third generation single molecule sequencing technology is poised to revolutionize genomics by en- abling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction. We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones. All source code and the assembly model are available open-source.


July 19, 2019  |  

Long-read, whole-genome shotgun sequence data for five model organisms.

Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characteristics of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4C2 and P5C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.


July 19, 2019  |  

The power of Single Molecule Real-Time sequencing technology in the de novo assembly of a eukaryotic genome.

Second-generation sequencers (SGS) have been game-changing, achieving cost-effective whole genome sequencing in many non-model organisms. However, a large portion of the genomes still remains unassembled. We reconstructed azuki bean (Vigna angularis) genome using single molecule real-time (SMRT) sequencing technology and achieved the best contiguity and coverage among currently assembled legume crops. The SMRT-based assembly produced 100 times longer contigs with 100 times smaller amount of gaps compared to the SGS-based assemblies. A detailed comparison between the assemblies revealed that the SMRT-based assembly enabled a more comprehensive gene annotation than the SGS-based assemblies where thousands of genes were missing or fragmented. A chromosome-scale assembly was generated based on the high-density genetic map, covering 86% of the azuki bean genome. We demonstrated that SMRT technology, though still needed support of SGS data, achieved a near-complete assembly of a eukaryotic genome.


July 19, 2019  |  

Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms.

Resequencing or reference-based assemblies reveal large parts of the small-scale sequence variation. However, they typically fail to separate such local variation into colinear and rearranged variation, because they usually do not recover the complement of large-scale rearrangements, including transpositions and inversions. Besides the availability of hundreds of genomes of diverse Arabidopsis thaliana accessions, there is so far only one full-length assembled genome: the reference sequence. We have assembled 117 Mb of the A. thaliana Landsberg erecta (Ler) genome into five chromosome-equivalent sequences using a combination of short Illumina reads, long PacBio reads, and linkage information. Whole-genome comparison against the reference sequence revealed 564 transpositions and 47 inversions comprising ~3.6 Mb, in addition to 4.1 Mb of nonreference sequence, mostly originating from duplications. Although rearranged regions are not different in local divergence from colinear regions, they are drastically depleted for meiotic recombination in heterozygotes. Using a 1.2-Mb inversion as an example, we show that such rearrangement-mediated reduction of meiotic recombination can lead to genetically isolated haplotypes in the worldwide population of A. thaliana Moreover, we found 105 single-copy genes, which were only present in the reference sequence or the Ler assembly, and 334 single-copy orthologs, which showed an additional copy in only one of the genomes. To our knowledge, this work gives first insights into the degree and type of variation, which will be revealed once complete assemblies will replace resequencing or other reference-dependent methods.


July 19, 2019  |  

The impact of third generation genomic technologies on plant genome assembly.

Since the introduction of next generation sequencing, plant genome assembly projects do not need to rely on dedicated research facilities or community-wide consortia anymore, even individual research groups can sequence and assemble the genomes they are interested in. However, such assemblies are typically not based on the entire breadth of genomic technologies including genetic and physical maps and their contiguities tend to be low compared to the full-length gold standard reference sequences. Recently emerging third generation genomic technologies like long-read sequencing or optical mapping promise to bridge this quality gap and enable simple and cost-effective solutions for chromosomal-level assemblies.


July 19, 2019  |  

Deletion-bias in DNA double-strand break repair differentially contributes to plant genome shrinkage.

In order to prevent genome instability, cells need to be protected by a number of repair mechanisms, including DNA double-strand break (DSB) repair. The extent to which DSB repair, biased towards deletions or insertions, contributes to evolutionary diversification of genome size is still under debate. We analyzed mutation spectra in Arabidopsis thaliana and in barley (Hordeum vulgare) by PacBio sequencing of three DSB-targeted loci each, uncovering repair via gene conversion, single strand annealing (SSA) or nonhomologous end-joining (NHEJ). Furthermore, phylogenomic comparisons between A. thaliana and two related species were used to detect naturally occurring deletions during Arabidopsis evolution. Arabidopsis thaliana revealed significantly more and larger deletions after DSB repair than barley, and barley displayed more and larger insertions. Arabidopsis displayed a clear net loss of DNA after DSB repair, mainly via SSA and NHEJ. Barley revealed a very weak net loss of DNA, apparently due to less active break-end resection and easier copying of template sequences into breaks. Comparative phylogenomics revealed several footprints of SSA in the A. thaliana genome. Quantitative assessment of DNA gain and loss through DSB repair processes suggests deletion-biased DSB repair causing ongoing genome shrinking in A. thaliana, whereas genome size in barley remains nearly constant.© 2017 The Authors. New Phytologist © 2017 New Phytologist Trust.


July 19, 2019  |  

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, and achieves a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes. Published by Cold Spring Harbor Laboratory Press.


July 19, 2019  |  

Re-sequencing transgenic plants revealed rearrangements at T-DNA inserts, and integration of a short T-DNA fragment, but no increase of small mutations elsewhere.

Transformation resulted in deletions and translocations at T-DNA inserts, but not in genome-wide small mutations. A tiny T-DNA splinter was detected that probably would remain undetected by conventional techniques. We investigated to which extent Agrobacterium tumefaciens-mediated transformation is mutagenic, on top of inserting T-DNA. To prevent mutations due to in vitro propagation, we applied floral dip transformation of Arabidopsis thaliana. We re-sequenced the genomes of five primary transformants, and compared these to genomic sequences derived from a pool of four wild-type plants. By genome-wide comparisons, we identified ten small mutations in the genomes of the five transgenic plants, not correlated to the positions or number of T-DNA inserts. This mutation frequency is within the range of spontaneous mutations occurring during seed propagation in A. thaliana, as determined earlier. In addition, we detected small as well as large deletions specifically at the T-DNA insert sites. Furthermore, we detected partial T-DNA inserts, one of these a tiny 50-bp fragment originating from a central part of the T-DNA construct used, inserted into the plant genome without flanking other T-DNA. Because of its small size, we named this fragment a T-DNA splinter. As far as we know this is the first report of such a small T-DNA fragment insert in absence of any T-DNA border sequence. Finally, we found evidence for translocations from other chromosomes, flanking T-DNA inserts. In this study, we showed that next-generation sequencing (NGS) is a highly sensitive approach to detect T-DNA inserts in transgenic plants.


July 7, 2019  |  

Jitterbug: somatic and germline transposon insertion detection at single-nucleotide resolution.

Transposable elements are major players in genome evolution. Transposon insertion polymorphisms can translate into phenotypic differences in plants and animals and are linked to different diseases including human cancer, making their characterization highly relevant to the study of genome evolution and genetic diseases. Here we present Jitterbug, a novel tool that identifies transposable element insertion sites at single-nucleotide resolution based on the pairedend mapping and clipped-read signatures produced by NGS alignments. Jitterbug can be easily integrated into existing NGS analysis pipelines, using the standard BAM format produced by frequently applied alignment tools (e.g. bwa, bowtie2), with no need to realign reads to a set of consensus transposon sequences. Jitterbug is highly sensitive and able to recall transposon insertions with a very high specificity, as demonstrated by benchmarks in the human and Arabidopsis genomes, and validation using long PacBio reads. In addition, Jitterbug estimates the zygosity of transposon insertions with high accuracy and can also identify somatic insertions. We demonstrate that Jitterbug can identify mosaic somatic transposon movement using sequenced tumor-normal sample pairs and allows for estimating the cancer cell fraction of clones containing a somatic TE insertion. We suggest that the independent methods we use to evaluate performance are a step towards creating a gold standard dataset for benchmarking structural variant prediction tools.


July 7, 2019  |  

Dissecting a hidden gene duplication: the Arabidopsis thaliana SEC10 locus.

Repetitive sequences present a challenge for genome sequence assembly, and highly similar segmental duplications may disappear from assembled genome sequences. Having found a surprising lack of observable phenotypic deviations and non-Mendelian segregation in Arabidopsis thaliana mutants in SEC10, a gene encoding a core subunit of the exocyst tethering complex, we examined whether this could be explained by a hidden gene duplication. Re-sequencing and manual assembly of the Arabidopsis thaliana SEC10 (At5g12370) locus revealed that this locus, comprising a single gene in the reference genome assembly, indeed contains two paralogous genes in tandem, SEC10a and SEC10b, and that a sequence segment of 7 kb in length is missing from the reference genome sequence. Differences between the two paralogs are concentrated in non-coding regions, while the predicted protein sequences exhibit 99% identity, differing only by substitution of five amino acid residues and an indel of four residues. Both SEC10 genes are expressed, although varying transcript levels suggest differential regulation. Homozygous T-DNA insertion mutants in either paralog exhibit a wild-type phenotype, consistent with proposed extensive functional redundancy of the two genes. By these observations we demonstrate that recently duplicated genes may remain hidden even in well-characterized genomes, such as that of A. thaliana. Moreover, we show that the use of the existing A. thaliana reference genome sequence as a guide for sequence assembly of new Arabidopsis accessions or related species has at least in some cases led to error propagation.


July 7, 2019  |  

NOVOPlasty: de novo assembly of organelle genomes from whole genome data.

The evolution in next-generation sequencing (NGS) technology has led to the development of many different assembly algorithms, but few of them focus on assembling the organelle genomes. These genomes are used in phylogenetic studies, food identification and are the most deposited eukaryotic genomes in GenBank. Producing organelle genome assembly from whole genome sequencing (WGS) data would be the most accurate and least laborious approach, but a tool specifically designed for this task is lacking. We developed a seed-and-extend algorithm that assembles organelle genomes from whole genome sequencing (WGS) data, starting from a related or distant single seed sequence. The algorithm has been tested on several new (Gonioctena intermedia and Avicennia marina) and public (Arabidopsis thaliana and Oryza sativa) whole genome Illumina data sets where it outperforms known assemblers in assembly accuracy and coverage. In our benchmark, NOVOPlasty assembled all tested circular genomes in less than 30 min with a maximum memory requirement of 16 GB and an accuracy over 99.99%. In conclusion, NOVOPlasty is the sole de novo assembler that provides a fast and straightforward extraction of the extranuclear genomes from WGS data in one circular high quality contig. The software is open source and can be downloaded at https://github.com/ndierckx/NOVOPlasty.


July 7, 2019  |  

Cytosine methylation at CpCpG sites triggers accumulation of non-CpG methylation in gene bodies.

Methylation of cytosine is an epigenetic mark involved in the regulation of transcription, usually associated with transcriptional repression. In mammals, methylated cytosines are found predominantly in CpGs but in plants non-CpG methylation (in the CpHpG or CpHpH contexts, where H is A, C or T) is also present and is associated with the transcriptional silencing of transposable elements. In addition, CpG methylation is found in coding regions of active genes. In the absence of the demethylase of lysine 9 of histone 3 (IBM1), a subset of body-methylated genes acquires non-CpG methylation. This was shown to alter their expression and affect plant development. It is not clear why only certain body-methylated genes gain non-CpG methylation in the absence of IBM1 and others do not. Here we describe a link between CpG methylation and the establishment of methylation in the CpHpG context that explains the two classes of body-methylated genes. We provide evidence that external cytosines of CpCpG sites can only be methylated when internal cytosines are methylated. CpCpG sites methylated in both cytosines promote spreading of methylation in the CpHpG context in genes protected by IBM1. In contrast, CpCpG sites remain unmethylated in IBM1-independent genes and do not promote spread of CpHpG methylation.© The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.