We present high quality, phased genome assemblies representative of taurine and indicine cattle, subspecies that differ markedly in productivity-related traits and environmental adaptation. We report a new haplotype-aware scaffolding and polishing pipeline using contigs generated by the trio binning method to produce haplotype-resolved, chromosome-level genome assemblies of Angus (taurine) and Brahman (indicine) cattle breeds. These assemblies were used to identify structural and copy number variants that differentiate the subspecies and we found variant detection was sensitive to the specific reference genome chosen. Six gene families with immune related functions are expanded in the indicine lineage. Assembly of the genomes of both subspecies from a single individual enabled transcripts to be phased to detect allele-specific expression, and to study genome-wide selective sweeps. An indicus-specific extra copy of fatty acid desaturase is under positive selection and may contribute to indicine adaptation to heat and drought.
Background Assemblies of diploid genomes are generally unphased, pseudo-haploid representations that do not correctly reconstruct the two parental haplotypes present in the individual sequenced. Instead, the assembly alternates between parental haplotypes and may contain duplications in regions where the parental haplotypes are sufficiently different. Trio binning is an approach to genome assembly that uses short reads from both parents to classify long reads from the offspring according to maternal or paternal haplotype origin, and is thus helped rather than impeded by heterozygosity. Using this approach, it is possible to derive two assemblies from an individual, accurately representing both parental contributions in their entirety with higher continuity and accuracy than is possible with other methods.Results We used trio binning to assemble reference genomes for two species from a single individual using an interspecies cross of yak (Bos grunniens) and cattle (Bos taurus). The high heterozygosity inherent to interspecies hybrids allowed us to confidently assign >99% of long reads from the F1 offspring to parental bins using unique k-mers from parental short reads. Both the maternal (yak) and paternal (cattle) assemblies contain over one third of the acrocentric chromosomes, including the two largest chromosomes, in single haplotigs.Conclusions These haplotigs are the first vertebrate chromosome arms to be assembled gap-free and fully phased, and the first time assemblies for two species have been created from a single individual. Both assemblies are the most continuous currently available for non-model vertebrates.MbmegabaseskbkilobasesMYAmillions of years agoMHCmajor histocompatibility complexSMRTsingle molecule real time
Haplotype-phased genome assembly of virulent Phythophthora ramorum isolate ND886 facilitated by long-read sequencing reveals effector polymorphisms and copy number variation.
Phytophthora ramorum is a destructive pathogen that causes Sudden Oak Death. The genome sequence of P. ramorum isolate Pr102 was previously produced using Sanger reads, and contained 12 Mb of gaps. However, isolate Pr102 had shown reduced aggressiveness and genome abnormalities. In order to produce an improved genome assembly for P. ramorum, we performed long read sequencing of highly aggressive P. ramorum isolate CDFA1418886 (abbreviated as ND886). We generated a 60.5 Mb assembly of the ND886 genome using the Pacific Biosciences sequencing platform. The assembly includes 302 primary contigs (60.2 Mb) and 9 unplaced contigs (265 Kb). Additionally, we found a “Highly repetitive” component from the Pacbio unassembled unmapped reads containing tandem repeats that are not part of the 60.5 Mb genome. The overall repeat content in the primary assembly was much higher than the Pr102 Sanger version (48% vs. 29%) indicating that the long reads have captured repetitive regions effectively. The 302 primary contigs were phased into 345 haplotype blocks and 222,892 phased variants, of which the longest phased block was 1,513,201 bp with 7,265 phased variants. The improved phased assembly facilitated identification of 21 and 25 Crinkler effectors and 393 and 394 RXLR effector genes from two haplotypes. Of these, 24 and 25 RXLR effectors were newly predicted from Haplotype A and Haplotype B, respectively. In addition, 7 new paralogs of effector Avh207 were found in contig 54, not reported earlier. Comparison of the ND886 assembly with Pr102 V1 assembly suggests that several repeat-rich smaller scaffolds within the Pr102 V1 assembly were possibly misassembled; these regions are fully encompassed now in ND886 contigs. Our analysis further reveals that Pr102 is a heterokaryon with multiple nuclear types in the sequences corresponding to contig 10 of ND886 assembly.
Platanus-allee is a de novo haplotype assembler enabling a comprehensive access to divergent heterozygous regions.
The ultimate goal for diploid genome determination is to completely decode homologous chromosomes independently, and several phasing programs from consensus sequences have been developed. These methods work well for lowly heterozygous genomes, but the manifold species have high heterozygosity. Additionally, there are highly divergent regions (HDRs), where the haplotype sequences differ considerably. Because HDRs are likely to direct various interesting biological phenomena, many genomic analysis targets fall within these regions. However, they cannot be accessed by existing phasing methods, and we have to adopt costly traditional methods. Here, we develop a de novo haplotype assembler, Platanus-allee ( http://platanus.bio.titech.ac.jp/platanus2 ), which initially constructs each haplotype sequence and then untangles the assembly graphs utilizing sequence links and synteny information. A comprehensive benchmark analysis reveals that Platanus-allee exhibits high recall and precision, particularly for HDRs. Using this approach, previously unknown HDRs are detected in the human genome, which may uncover novel aspects of genome variability.
Plant genome size varies by four orders of magnitude, and most of this variation stems from dynamic changes in repetitive DNA content. Here we report the small 109?Mb genome of Selaginella lepidophylla, a clubmoss with extreme desiccation tolerance. Single-molecule sequencing enables accurate haplotype assembly of a single heterozygous S. lepidophylla plant, revealing extensive structural variation. We observe numerous haplotype-specific deletions consisting of largely repetitive and heavily methylated sequences, with enrichment in young Gypsy LTR retrotransposons. Such elements are active but rapidly deleted, suggesting “bloat and purge” to maintain a small genome size. Unlike all other land plant lineages, Selaginella has no evidence of a whole-genome duplication event in its evolutionary history, but instead shows unique tandem gene duplication patterns reflecting adaptation to extreme drying. Gene expression changes during desiccation in S. lepidophylla mirror patterns observed across angiosperm resurrection plants.
Haplotype information is essential to the complete description and interpretation of genomes, genetic diversity and genetic ancestry. The new technologies can provide Single Molecular Sequencing (SMS) data that cover about 90% of positions over chromosomes. However, the SMS data has a higher error rate comparing to 1% error rate for short reads. Thus, it becomes very difficult for SNP calling and haplotype assembly using SMS reads. Most existing technologies do not work properly for the SMS data.In this paper, we develop a progressive approach for SNP calling and haplotype assembly that works very well for the SMS data. Our method can handle more than 200 million non-N bases on Chromosome 1 with millions of reads, more than 100 blocks, each of which contains more than 2 million bases and more than 3K SNP sites on average. Experiment results show that the false discovery rate and false negative rate for our method are 15.7 and 11.0% on NA12878, and 16.5 and 11.0% on NA24385. Moreover, the overall switch errors for our method are 7.26 and 5.21 with average 3378 and 5736 SNP sites per block on NA12878 and NA24385, respectively. Here, we demonstrate that SMS reads alone can generate a high quality solution for both SNP calling and haplotype assembly.Source codes and results are available at https://github.com/guofeieileen/SMRT/wiki/Software.
The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here, we present a new method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear scaling with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPET4 in an open source format on https://odelaneau.github.io/shapeit4/ and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.