Menu
April 21, 2020  |  

Chromosome-length haplotigs for yak and cattle from trio binning assembly of an F1 hybrid

Background Assemblies of diploid genomes are generally unphased, pseudo-haploid representations that do not correctly reconstruct the two parental haplotypes present in the individual sequenced. Instead, the assembly alternates between parental haplotypes and may contain duplications in regions where the parental haplotypes are sufficiently different. Trio binning is an approach to genome assembly that uses short reads from both parents to classify long reads from the offspring according to maternal or paternal haplotype origin, and is thus helped rather than impeded by heterozygosity. Using this approach, it is possible to derive two assemblies from an individual, accurately representing both parental contributions in their entirety with higher continuity and accuracy than is possible with other methods.Results We used trio binning to assemble reference genomes for two species from a single individual using an interspecies cross of yak (Bos grunniens) and cattle (Bos taurus). The high heterozygosity inherent to interspecies hybrids allowed us to confidently assign >99% of long reads from the F1 offspring to parental bins using unique k-mers from parental short reads. Both the maternal (yak) and paternal (cattle) assemblies contain over one third of the acrocentric chromosomes, including the two largest chromosomes, in single haplotigs.Conclusions These haplotigs are the first vertebrate chromosome arms to be assembled gap-free and fully phased, and the first time assemblies for two species have been created from a single individual. Both assemblies are the most continuous currently available for non-model vertebrates.MbmegabaseskbkilobasesMYAmillions of years agoMHCmajor histocompatibility complexSMRTsingle molecule real time


April 21, 2020  |  

First near complete haplotype phased genome assembly of River buffalo (Bubalus bubalis)

This study reports the first haplotype phased reference quality genome assembly of textquoteleftMurrahtextquoteright an Indian breed of river buffalo. A mother-father-progeny trio was used for sequencing so that the individual haplotypes could be assembled in the progeny. Parental DNA samples were sequenced on the Illumina platform to generate a total of 274 Gb paired-end data. The progeny DNA sample was sequenced using PacBio long reads and 10x Genomics linked reads at 166x coverage along with 802Gb of optical mapping data. Trio binning based FALCON assembly of each haplotype was scaffolded with 10x Genomics reads and superscaffolded with BioNano Maps to build reference quality assembly of sire and dam haplotypes of 2.63Gb and 2.64Gb with just 59 and 64 scaffolds and N50 of 81.98Mb and 83.23Mb, respectively. BUSCO single copy core gene set coverage was > 91.25%, and gVolante-CEGMA completeness was >96.14% for both haplotypes. Finally, RaGOO was used to order and build the chromosomal level assembly with 25 scaffolds and N50 of 117.48 Mb (sire haplotype) and 118.51 Mb (dam haplotype). The improved haplotype phased genome assembly of river buffalo may provide valuable resources to discover molecular mechanisms related to milk production and reproduction traits.


April 21, 2020  |  

Variant Phasing and Haplotypic Expression from Single-molecule Long-read Sequencing in Maize

Haplotype phasing of genetic variants is important for interpretation of the maize genome, population genetic analysis, and functional genomic analysis of allelic activity. Accordingly, accurate methods for phasing full-length isoforms are essential for functional genomics study. In this study, we performed an isoform-level phasing study in maize, using two inbred lines and their reciprocal crosses, based on single-molecule full-length cDNA sequencing. To phase and analyze full-length transcripts between hybrids and parents, we developed a tool called IsoPhase. Using this tool, we validated the majority of SNPs called against matching short read data and identified cases of allele-specific, gene-level, and isoform-level expression. Our results revealed that maize parental and hybrid lines exhibit different splicing activities. After phasing 6,847 genes in two reciprocal hybrids using embryo, endosperm and root tissues, we annotated the SNPs and identified large-effect genes. In addition, based on single-molecule sequencing, we identified parent-of-origin isoforms in maize hybrids, different novel isoforms between maize parent and hybrid lines, and imprinted genes from different tissues. Finally, we characterized variation in cis- and trans-regulatory effects. Our study provides measures of haplotypic expression that could increase power and accuracy in studies of allelic expression.


April 21, 2020  |  

Extended haplotype phasing of de novo genome assemblies with FALCON-Phase

Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. These assemblies can be created in various ways, such as use of tissues that contain single-haplotype (haploid) genomes, or by co-sequencing of parental genomes, but these approaches can be impractical in many situations. We present FALCON-Phase, which integrates long-read sequencing data and ultra-long-range Hi-C chromatin interaction data of a diploid individual to create high-quality, phased diploid genome assemblies. The method was evaluated by application to three datasets, including human, cattle, and zebra finch, for which high-quality, fully haplotype resolved assemblies were available for benchmarking. Phasing algorithm accuracy was affected by heterozygosity of the individual sequenced, with higher accuracy for cattle and zebra finch (>97%) compared to human (82%). In addition, scaffolding with the same Hi-C chromatin contact data resulted in phased chromosome-scale scaffolds.


April 21, 2020  |  

Haplotype-phased genome assembly of virulent Phythophthora ramorum isolate ND886 facilitated by long-read sequencing reveals effector polymorphisms and copy number variation.

Phytophthora ramorum is a destructive pathogen that causes Sudden Oak Death. The genome sequence of P. ramorum isolate Pr102 was previously produced using Sanger reads, and contained 12 Mb of gaps. However, isolate Pr102 had shown reduced aggressiveness and genome abnormalities. In order to produce an improved genome assembly for P. ramorum, we performed long read sequencing of highly aggressive P. ramorum isolate CDFA1418886 (abbreviated as ND886). We generated a 60.5 Mb assembly of the ND886 genome using the Pacific Biosciences sequencing platform. The assembly includes 302 primary contigs (60.2 Mb) and 9 unplaced contigs (265 Kb). Additionally, we found a “Highly repetitive” component from the Pacbio unassembled unmapped reads containing tandem repeats that are not part of the 60.5 Mb genome. The overall repeat content in the primary assembly was much higher than the Pr102 Sanger version (48% vs. 29%) indicating that the long reads have captured repetitive regions effectively. The 302 primary contigs were phased into 345 haplotype blocks and 222,892 phased variants, of which the longest phased block was 1,513,201 bp with 7,265 phased variants. The improved phased assembly facilitated identification of 21 and 25 Crinkler effectors and 393 and 394 RXLR effector genes from two haplotypes. Of these, 24 and 25 RXLR effectors were newly predicted from Haplotype A and Haplotype B, respectively. In addition, 7 new paralogs of effector Avh207 were found in contig 54, not reported earlier. Comparison of the ND886 assembly with Pr102 V1 assembly suggests that several repeat-rich smaller scaffolds within the Pr102 V1 assembly were possibly misassembled; these regions are fully encompassed now in ND886 contigs. Our analysis further reveals that Pr102 is a heterokaryon with multiple nuclear types in the sequences corresponding to contig 10 of ND886 assembly.


April 21, 2020  |  

The population genetics of structural variants in grapevine domestication.

Structural variants (SVs) are a largely unexplored feature of plant genomes. Little is known about the type and size of SVs, their distribution among individuals and, especially, their population dynamics. Understanding these dynamics is critical for understanding both the contributions of SVs to phenotypes and the likelihood of identifying them as causal genetic variants in genome-wide associations. Here, we identify SVs and study their evolutionary genomics in clonally propagated grapevine cultivars and their outcrossing wild progenitors. To catalogue SVs, we assembled the highly heterozygous Chardonnay genome, for which one in seven genes is hemizygous based on SVs. Using an integrative comparison between Chardonnay and Cabernet Sauvignon genomes by whole-genome, long-read and short-read alignment, we extended SV detection to population samples. We found that strong purifying selection acts against SVs but particularly against inversion and translocation events. SVs nonetheless accrue as recessive heterozygotes in clonally propagated lineages. They also define outlier regions of genomic divergence between wild and cultivated grapevines, suggesting roles in domestication. Outlier regions include the sex-determination region and the berry colour locus, where independent large, complex inversions have driven convergent phenotypic evolution.


April 21, 2020  |  

Next generation sequencing characterizes HLA diversity in a registry population from the Netherlands.

Next generation DNA sequencing is used to determine the HLA-A, -B, -C, -DRB1, -DRB3/4/5, and -DQB1 assignments of 1009 unrelated volunteers for the unrelated donor registry in The Netherlands. The analysis characterizes all HLA exons and introns for class I alleles; at least exons 2 to 3 for HLA-DRB1; and exons 2 to 6 for HLA-DQB1. Of the distinct alleles present, there are 229 class I and 71 class II; 36 of these alleles are novel. The majority (approximately 98%) of the cumulative allele frequency at each locus is contributed by alleles that appear three or more times. Alleles encoding protein variation outside of the antigen recognition domains are 0.6% of the class I assignments and 5.3% of the class II assignments. © 2019 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.


April 21, 2020  |  

Patterns of non-ARD variation in more than 300 full-length HLA-DPB1 alleles.

Our understanding of sequence variation in the HLA-DPB1 gene is largely restricted to the hypervariable antigen recognition domain (ARD) encoded by exon 2. Here, we employed a redundant sequencing strategy combining long-read and short-read data to accurately phase and characterise in full length the majority of common and well-documented (CWD) DPB1 alleles as well as alleles with an observed frequency of at least 0.0006% in our predominantly European sample set. We generated 664 DPB1 sequences, comprising 279 distinct allelic variants. This allows us to present the, to date, most comprehensive analysis of the nature and extent of DPB1 sequence variation. The full-length sequence analysis revealed the existence of two highly diverged allele clades. These clades correlate with the rs9277534 A???G variant, a known expression marker located in the 3′-UTR. The two clades are fully differentiated by 174 fixed polymorphisms throughout a 3.6?kb stretch at the 3′-end of DPB1. The region upstream of this differentiation zone is characterised by increasingly shared variation between the clades. The low-expression A clade comprises 59% of the distinct allelic sequences including the three by far most frequent DPB1 alleles, DPB1*04:01, DPB1*02:01 and DPB1*04:02. Alleles in the A clade show reduced nucleotide diversity with an excess of rare variants when compared to the high-expression G clade. This pattern is consistent with a scenario of recent proliferation of A-clade alleles. The full-length characterisation of all but the most rare DPB1 alleles will benefit the application of NGS for DPB1 genotyping and provides a helpful framework for a deeper understanding of high- and low-expression alleles and their implications in the context of unrelated haematopoietic stem-cell transplantation.Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.


April 21, 2020  |  

The CF Canada-Sick Kids Program in individual CF therapy: A resource for the advancement of personalized medicine in CF.

Therapies targeting certain CFTR mutants have been approved, yet variations in clinical response highlight the need for in-vitro and genetic tools that predict patient-specific clinical outcomes. Toward this goal, the CF Canada-Sick Kids Program in Individual CF Therapy (CFIT) is generating a “first of its kind”, comprehensive resource containing patient-specific cell cultures and data from 100 CF individuals that will enable modeling of therapeutic responses.The CFIT program is generating: 1) nasal cells from drug naïve patients suitable for culture and the study of drug responses in vitro, 2) matched gene expression data obtained by sequencing the RNA from the primary nasal tissue, 3) whole genome sequencing of blood derived DNA from each of the 100 participants, 4) induced pluripotent stem cells (iPSCs) generated from each participant’s blood sample, 5) CRISPR-edited isogenic control iPSC lines and 6) prospective clinical data from patients treated with CF modulators.To date, we have recruited 57 of 100 individuals to CFIT, most of whom are homozygous for F508del (to assess in-vitro: in-vivo correlations with respect to ORKAMBI response) or heterozygous for F508del and a minimal function mutation. In addition, several donors are homozygous for rare nonsense and missense mutations. Nasal epithelial cell cultures and matched iPSC lines are available for many of these donors.This accessible resource will enable development of tools that predict individual outcomes to current and emerging modulators targeting F508del-CFTR and facilitate therapy discovery for rare CF causing mutations.Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.


April 21, 2020  |  

Non-coding variability at the APOE locus contributes to the Alzheimer’s risk.

Alzheimer’s disease (AD) is a leading cause of mortality in the elderly. While the coding change of APOE-e4 is a key risk factor for late-onset AD and has been believed to be the only risk factor in the APOE locus, it does not fully explain the risk effect conferred by the locus. Here, we report the identification of AD causal variants in PVRL2 and APOC1 regions in proximity to APOE and define common risk haplotypes independent of APOE-e4 coding change. These risk haplotypes are associated with changes of AD-related endophenotypes including cognitive performance, and altered expression of APOE and its nearby genes in the human brain and blood. High-throughput genome-wide chromosome conformation capture analysis further supports the roles of these risk haplotypes in modulating chromatin states and gene expression in the brain. Our findings provide compelling evidence for additional risk factors in the APOE locus that contribute to AD pathogenesis.


April 21, 2020  |  

Long-Read Sequencing Emerging in Medical Genetics

The wide implementation of next-generation sequencing (NGS) technologies has revolutionized the field of medical genetics. However, the short read lengths of currently used sequencing approaches pose a limitation for identification of structural variants, sequencing repetitive regions, phasing alleles and distinguishing highly homologous genomic regions. These limitations may significantly contribute to the diagnostic gap in patients with genetic disorders who have undergone standard NGS, like whole exome or even genome sequencing. Now, the emerging long-read sequencing (LRS) technologies may offer improvements in the characterization of genetic variation and regions that are difficult to assess with the currently prevailing NGS approaches. LRS has so far mainly been used to investigate genetic disorders with previously known or strongly suspected disease loci. While these targeted approaches already show the potential of LRS, it remains to be seen whether LRS technologies can soon enable true whole genome sequencing routinely. Ultimately, this could allow the de novo assembly of individual whole genomes used as a generic test for genetic disorders. In this article, we summarize the current LRS-based research on human genetic disorders and discuss the potential of these technologies to facilitate the next major advancements in medical genetics.


April 21, 2020  |  

Chromosome-level assembly of the water buffalo genome surpasses human and goat genomes in sequence contiguity.

Rapid innovation in sequencing technologies and improvement in assembly algorithms have enabled the creation of highly contiguous mammalian genomes. Here we report a chromosome-level assembly of the water buffalo (Bubalus bubalis) genome using single-molecule sequencing and chromatin conformation capture data. PacBio Sequel reads, with a mean length of 11.5?kb, helped to resolve repetitive elements and generate sequence contiguity. All five B. bubalis sub-metacentric chromosomes were correctly scaffolded with centromeres spanned. Although the index animal was partly inbred, 58% of the genome was haplotype-phased by FALCON-Unzip. This new reference genome improves the contig N50 of the previous short-read based buffalo assembly more than a thousand-fold and contains only 383 gaps. It surpasses the human and goat references in sequence contiguity and facilitates the annotation of hard to assemble gene clusters such as the major histocompatibility complex (MHC).


September 22, 2019  |  

Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics.

Short read massive parallel sequencing has emerged as a standard diagnostic tool in the medical setting. However, short read technologies have inherent limitations such as GC bias, difficulties mapping to repetitive elements, trouble discriminating paralogous sequences, and difficulties in phasing alleles. Long read single molecule sequencers resolve these obstacles. Moreover, they offer higher consensus accuracies and can detect epigenetic modifications from native DNA. The first commercially available long read single molecule platform was the RS system based on PacBio’s single molecule real-time (SMRT) sequencing technology, which has since evolved into their RSII and Sequel systems. Here we capsulize how SMRT sequencing is revolutionizing constitutional, reproductive, cancer, microbial and viral genetic testing.© The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.