Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome using long-read sequencing
Sequence-based estimation of genetic diversity of Plasmodium falciparum, the most lethal malarial parasite, has proved challenging due to a lack of a complete genomic assembly. The skewed AT-richness (~80.6% (A+T)) of its genome and the lack of technology to assemble highly polymorphic sub-telomeric regions that contain clonally variant, multigene virulence families (i.e. var and rifin) have confounded attempts using short-read NGS technologies. Using single molecule, real-time (SMRT) sequencing, we successfully compiled all 14 nuclear chromosomes of the P. falciparum genome from telomere-to-telomere in single contigs. Specifically, amplification-free sequencing generated reads of average length 12 kb, with =50% of the reads between 15.5 and 50 kb in length. A hierarchical genome assembly process (HGAP), was used to assemble the P. falciparum genome de novo. This assembly accurately resolved centromeres (~90-99% (A+T)) and sub-telomeric regions, and identified large insertions and duplications in the genome that added extra genes to the var and rifin virulence families, along with smaller structural variants such as homopolymer tract expansions. These regions can be used as markers for genetic diversity during comparative genome analyses. Moreover, identifying the polymorphic and repetitive sub-telomeric sequences of parasite populations from endemic areas might inform the link between structural variation and phenotypes such as virulence, drug resistance and disease transmission.