New P. simium sequence data and reference genome
Twelve P. simium isolates from the São Paulo region (2013–2020) underwent Illumina sequencing. The average coverage of P. simium isolates (mapped to PvP01) in the core nuclear genome was 11.0-fold (range: 5.9- to 23.7-fold) and in the mitochondria was 24.2-fold (range: 1.5- to 42.8-fold) (Table S1). Three isolates underwent ONT sequencing (mean read length of 7322 bp), yielding an average of 116,840 high-quality reads, with mean coverage in the core nuclear genome of 20.5-fold (min: 17.1-fold; max: 24.3-fold) and in the mitochondria of 34.1-fold (Table S1). Highly variable regions (e.g. sub telomeric regions) had low coverage (Short reads: 6.5-fold; Long reads: 6.1-fold) (Figure S1).
Using a robust pipeline (Figure S2), combined filtered short (6,936,751; 71.5% of all merged) and long (312,352; 89.11% of all merged) reads for three high-quality isolates were assembled using a hybrid approach. This generated 3932 contigs, which were polished and gap-filled before undergoing scaffolding. Scaffolding links and orders contigs into larger scaffolds using additional information, such as paired-end or long-range data, to establish their relative orientation. While scaffolds may still contain gaps where the sequence is unknown, they better approximate the chromosome structure. This process, guided by the P. vivax PvP01_v1 reference genome (28.9 Mb; 24.2 Mb excluding unplaced contigs), resulted in improved contiguity and overall assembly quality. Scaffolds shorter than 500 bp were discarded before predicting annotations. The final reference consists of 105 scaffolds: 14 nuclear chromosomes, one mitochondrial sequence and 90 supporting contigs (size: 21.8 Mb; 21.5 Mb excluding unplaced contigs) (Table 1). Since the new P. simium assembly was generated using a chimeric approach combining three samples, we evaluated each sample’s contribution by identifying genomic regions with elevated coverage (> 2.5-fold) unique to a single sample, as well as reference alleles observed exclusively in one sample. In total, 140 regions showed greater coverage from one sample (ERR10301330: 127 regions; ERR10301323: seven regions; ERR10301333: six regions), indicating that ERR10301330 contributed the most to the assembly despite the use of three samples (Table S2). This conclusion was further supported by identifying ‘single-sample’ reference alleles, with 1473 such alleles observed overall (ERR10301330: 670; ERR10301323: 351; ERR10301333: 452 unique reference alleles) (Figure S3).
Comparison to the P. vivax genome and P. simium assembly
We first compared the new P. simium “Ps_SãoPaulo”, draft genome to an existing P. simium assembly that was previously created using short-read sequence data (GCA_913736665)8. The new draft genome has high synteny with the short-read P. simium assembly but has greater completeness (Figure S4). Furthermore, we also compared the P. simium draft genome to P. vivax PvP01. The chromosome sizes are on average 11% (standard deviation: 7%) shorter compared to P. vivax PvP01, due to a lack of coverage of sub-telomeric regions. P. simium chromosomes 1, 2, and 4 show notable differences in length and have high sub-telomeric content, with fractions of 23%, 32%, and 29%, respectively. The Ps_SãoPaulo draft genome has higher GC content (45.1%) compared to P. vivax PvP01 (39.6%), but after removing AT-rich sub-telomeric regions from PvP01, its GC content is similar (45.7%) (Figure S5). Pairwise whole genome alignment revealed conservation and differences between Ps_SãoPaulo and PvP01. On average, 98.1% of nucleotide positions were identical between the two aligned DNA sequences (sequence homology) across 14 chromosomes and 99.8% for the mitochondrion (Table S3). The synteny between P. vivax PvP01 and Ps_SãoPaulo scaffolds is high, with minor translocations and inversions present in the new assembly (Fig. 1A, Table S4). To assess the accuracy of these structural variants, we mapped long reads from all three contributing samples and evaluated support for each rearrangement. Specifically, we calculated the average read depth across the region, the average number of bases covered per read, and the average fraction of the region covered by reads (Table S4).
Genomic landscape of P. simum Ps_SãoPaulo (Ps SP) assembly. (A) Circos plot of P. vivax reference (left side; V01-V14) and P. simium São Paulo (Ps SP) (right side; S01-S14) assemblies with annotation. From the outer to the inner, the circular tracks represent pseudochromosomes with Mb scale, GC content (1–58 per 5 kb), gene density (0–5 per 5 kb), identified gaps, and synteny links. Each center line represents a pair of homologous sequences between two genomes. The size of 21.8 Mb of the genome is represented by a total of 3,247 homologous fragments of an average identity of 98.2%. Small translocations visible in the background of the circos plot; (B) A maximum-likelihood tree showing the relationship of human (N = 51) and non-human Plasmodium (N = 24) isolates in relation to P. simium São Paulo (Ps SP) (bold) using mitochondrial sequences. Clusters of species are as expected. Tree is rooted using the midpoint.
Regions with very low average coverage or minimal read support were considered poorly supported and potentially misassembled. For example, regions on Ps_SP_9 and the first region on Ps_SP_14 showed zero coverage across all three samples, while the inversion on Ps_SP_7 had minimal coverage (0.33-fold average depth; 3.7 bp average overlap; 3.2% covered). Therefore, three structural variants (Ps_SP_9, Ps_SP_14 [11145–11379], and Ps_SP_7) were classified as poorly supported and may represent mis-assemblies or unresolved rearrangements in the current scaffold set.
Furthermore, a comparison of the Ps_SãoPaulo mitochondrial genome (MtDNA; 5989 bp) to its PvP01 counterpart confirmed nucleotide differences at positions 4133 and 4467 in the mitochondrial genome (assembly PvP01) and revealed five previously unreported SNPs at positions 181, 2036, 2474, 4510, and 4640 bp. These SNPs are also present in the Peruvian Amazon (PvPAM) strain, confirming they are specific to South America and not exclusive to P. simium14. The Ps_SãoPaulo genome had 94.0% (N = 3424/3642) complete gene orthologues, with 3.3% being fragmented and 2.7% missing. The Companion software suite annotated 4910 genes, comprising 4852 protein-coding genes (including three mitochondrial genes) and 58 non-protein-coding RNA genes, such as tRNAs and rRNAs. In total, 5165 polypeptides were predicted (Figure S6). Each gene may encode multiple polypeptides due to alternative splicing, gene duplication or the presence of paralogous gene copies. Therefore, the total number of predicted polypeptides can exceed the number of unique “coding” genes, as one gene can give rise to several distinct polypeptide products. Orthologous mappings were extracted from P. simium polypeptides, and statistics were compared against protein-coding genes of P. vivax PvP01 (N = 6523). We identified 141 novel polypeptides not annotated with any P. vivax PvP01 orthologue protein-coding gene, although there were BLAST (nucleotide) hits with P. vivax PvP01 for fragments of sequences. Unresolved gaps or mis-assemblies in the scaffolds prevented the assignment of gene annotations in certain regions of the genome. Annotation of the remaining polypeptides (N = 5024) overlaps with 75.7% of the protein-coding P. vivax gene content. Unidentified protein-coding P. vivax genes (24.3%) were predominantly located in PvP01 contigs that are not considered as core genome.
Comparative analysis across Plasmodium species
We also investigated differences of P. simium to other Plasmodium species. Firstly, a maximum-likelihood tree, based on an alignment of mitochondrial nucleotide sequences, confirmed the new Ps_SãoPaulo reference (MtDNA) clusters as expected with other P. simium mitochondria sequences and P. vivax (Fig. 1B)11. Separately, using a multiple sequence alignment of 4998 P. vivax PvP01 proteins and P. simium Ps_SãoPaulo orthologue polypeptides, we compared their conservation and homology across species. Proteins with any uncertain amino acid predictions were excluded from further analysis. The Jaccard similarity index was used to quantify the proportion of shared elements (e.g., genes or features) between datasets, providing a measure of overlap or similarity. A total of 62% (3067/4970) had equal sequence size and an average Jaccard similarity of 0.99 (standard deviation (SD): 0.01). The remaining proteins (N = 1903) varied in size, but the average Jaccard similarity for pairwise comparison was 0.91 (SD: 0.16), which reflects high similarity and conservation between the gene content of P. simium and P. vivax.
The comparative analysis was extended to Plasmodium genes of broad interest, including those related to vertebrate host and vector invasion, and drug resistance (N = 134). The majority (85%) of these genes had an orthologous P. simium protein that allowed for comparative analysis (Table S5). No large indels were found in genes linked to drug resistance. We observed small indels in the merozoite surface protein MSP3.1 and MSP3.2, as well as the circumsporozoite protein (CSP), which are candidates for malaria vaccines. A few indels were detected in the Tryptophan Rich Antigens (TRAgs) multi-gene family, specifically TRAG12 and TRAG13, as well as TSR3. Lastly, Merozoite Adhesive Erythrocytic Binding-Like Protein (MAEBL), responsible for invading red blood cells, and CTRP, which is responsible for ookinete motility and mosquito midgut invasion, had a few small deletions. As expected, deletions were identified in RBP2a (1003 amino acids) and DBP (9 and 91 amino acids) (Figure S7). The larger 91-amino acid deletion in DBP is found only in the new P. simium reference (Ps_SãoPaulo). Meanwhile, the shorter 9-amino acid deletion in DBP appears to be fixed in P. vivax Sal I and P. cynomolgi, occurring between residues 684A and 694E (Figure S8).
New P. simium isolates in the context of South American P. vivax
To gain insight into the geographical context of P. simium in South America, we performed a population structure analysis of 38 isolates (12 new, 26 public8,10) with 354 high-quality P. vivax isolates (Table S6; Fig. 2A)15,16. The analysis used 118,573 bi-allelic chromosomal SNPs and revealed that Brazilian P. simium isolates form a unique cluster (Fig. 2B). A comparison of P. simium and P. vivax mitochondrial sequences confirmed the known P. simium-specific haplotype with a cytosine at 4133 and guanine at 4467 positions (relative to P. vivax PvP01), in all isolates with sufficient coverage (> tenfold) (Figure S9). A further mutation at position 2474 was observed only in six new P. simium isolates, as well as in P. vivax isolates across South America (Table S7).
Population structure of 392 P. simium and P. vivax isolates across South America. (A) Map with the color coordinated geographic locations and populations sizes (P. vivax: Mexico = 20, Colombia = 58, Peru = 97; Panama = 46, Guyana = 3, Brazil = 130; P. simium: Brazil = 38) (B) Multidimensional scaling of first two components with annotated cluster of P. simium isolates.
Furthermore, we compared the nucleotide diversity of P. simium to P. vivax from Brazil. P. simium samples had overall low nucleotide diversity (π) when compared to P. vivax isolates from Brazilian districts (Wilcoxon rank-sum test, P = 2.2 × 10–16) (Figure S10). P. simium samples obtained from human sources demonstrated the lowest overall nucleotide diversity (π range: 2.00 × 10–6 to 6.94 × 10–4), followed by similar nucleotide diversity in P. simium isolates from NHPs (π range: 1.40 × 10–5–6.86 × 10–4) (Wilcoxon rank-sum test, P > 0.01) (Figure S11). In contrast, P. vivax samples from Brazilian districts had greater nucleotide diversity (Figure S11). Although, the sample size is small, this analysis supports the previous postulated hypothesis that P. simium infections of NHPs occurred due to a reverse-zoonotic spillover of P. vivax from humans8.
Identification of mutations in P. vivax drug-resistance candidate orthologues
Given the strong genetic diversity observed in P. vivax populations from South America, we examined non-synonymous mutations in candidate drug-resistance genes within P. simium to assess whether similar variation is present. Non-synonymous mutations were identified in P. simium and P. vivax isolates (n = 392) across five genes of interest (Table S8). The DHFR N117S mutation was identified in P. simium (81.6%) and P. vivax from Mexico (95%) and elsewhere (15%), with N117T being present in Brazil (n = 1). Other mutations connected with drug resistance such as DHFR N50K, R58K, S116G, and I173L were not present in P. simium isolates. Only one mutation at DHFR N410S was unique to P. simium. Among five missense mutations in DHPS, only two were present in P. simium isolates: G383A (P. simium: 89.5%, P. vivax Brazil: 2.3%, P. vivax Mexico: 100%, other P. vivax 55.5%) and G439E present only in isolates from São Paulo at low frequency. Previously observed MDR1 L1076F and M958T mutations were not present in P. simium isolates but were seen with the expected high frequency in other South American P. vivax infections. A previously unreported MDR1 S698G mutation, found outside of ABC transporter domain, was identified in both P. simium (73.7%) and at high frequency in other P. vivax isolates. Additionally, three MDR1 mutations (T723S (13.2%), H681R (2.6%), L18M (2.6%)) were only present in P. simium genomes. There were two additional mutations unique to P. simium not observed in P. vivax from South America, specifically in Kelch13 (K424N, 23.7%) and CRT (E56Q, 2.6%).
Known and novel structural variations in population isolates
Protein alterations revealed from the comparative P. simium Ps_SãoPaulo and P. vivax PvP01 genome analysis were assessed across the isolate data based on structural variant discovery by integrated paired-end and split-read analysis (see Methods). We report on 242 structural variants (195 deletions; 28 inversions; 15 insertions and four duplications) of which 114 were seen uniquely in P. simium and 26% of them were identified in genes of interest (Table S9). A deletion in DBP was present in both P. simium and P. vivax but at varying genomic breakpoints. The deletion is approximately 99 amino acids longer in P. simium (frequency 34%) than in P. vivax (frequency 21%). The deletion in RBP2 was solely unique to P. simium and observed in 63% of isolates (Fig. 3). Modifications in CSP (P. simium 31%, P. vivax ~ 10%) and CTRP (P. simium 34%, P. vivax 13%) had almost identical breakpoints in P. simium and P. vivax isolates. Remaining indels identified on protein level within genes of interest were confirmed by genomic analysis with identified inversions in MAEBL and TSR3 both at 10% frequency, a duplication overlapping MSP3.1 and MSP3.2 seen in 13% of samples, a deletion in TRAG13 (42%), and a small deletion and inversion in TRAG12 at 47% and 10%, respectively (Figure S12). Genomic analysis revealed additional deletions in PRP24 (45%), ARV1 (37%), DHX57 (37%), GCbeta (37%), and LISP2 (37%) solely unique to P. simium (Figure S13). Several indels, identified outside of genes of interest, had high frequencies in P. simium, but similar breakpoints at lower frequencies in P. vivax species (Figure S14).
Deletion profiling with window-based (10 bp) coverage analysis for DBP (chromosome 6) and RBP2a (chromosome 14) applied to 392 isolates. Heatmap columns represent 10 bp genomic segments across each gene, rows represent individual samples, and the colour scale represents log10 coverage, with gray indicating 0 coverage. Samples are grouped by species. Additionally, only isolates with an average coverage greater than 5X across the inspected gene were included in the visualisation. (A) Duffy binding protein (DBP) and; (B) Reticulocyte-Binding Protein 2a (RBP2a).
Genomic regions under positive selection
To investigate recent positive selection in P. simium from Brazil compared to other South American P. vivax populations, we applied two complementary haplotype-based methods: integrated haplotype score (iHS) and cross-population extended haplotype homozygosity (XP-EHH). The iHS statistic detects evidence of recent selection within a population by measuring the decay of haplotype homozygosity around a given SNP, comparing the extended haplotypes carrying the ancestral and derived alleles. A high absolute iHS value indicates that one allele has risen rapidly in frequency, leaving a long-range haplotype, consistent with a recent selective sweep. In contrast, XP-EHH compares haplotype homozygosity between two populations to identify alleles that have become fixed or nearly fixed in one population but not the other. This makes it particularly useful for detecting selection events that have occurred in P. simium but not in P. vivax, or vice versa. Together, these methods provide complementary insights into recent or ongoing adaptation, allowing us to identify candidate loci under selection that may reflect host-switching events, environmental adaptation, or immune evasion specific to P. simium in Brazil.
For the P. simium samples obtained from Brazil, a combined total of 38 sites were potentially undergoing positive selection (iHS) (Table S10, Figure S15). Some sites were observed in known genes, including ATP4, LISP1, and MTIP. We also searched for gene products in genomic regions undergoing positive selection, which included proteins of unknown function, a signal recognition protein (SRP68), and genes involved in transcription and translation (Table S11). Due to the small sample size, a − log10[1–2 | ΦiHS–0.5 |] threshold of 2.5 was also used to identify 19 unique sites in P. simium considered to be undergoing selection (XP-EHH) (compared to South American P. vivax) (Table S12, Figures S16-19). Such sites were found to overlap between geographical comparators and were observed in similar genes, including I2, PVP01_1142200 and PVP01_0708700. A single hit was also detected in MND1 (Colombia). Gene products in genomic regions considered to be undergoing positive selection were also compared across populations (Table S13). Two regions on PvP01_v1 chromosome 5 (1,010,000–1,040,000) and chromosome 10 (180,000–200,000) had a mean XP-EHH > 2.5 and were observed in the comparison between P. simium and P. vivax populations from Brazil. The majority of genes encoded products with unknown functions, but also included a Plasmodium exported protein (PHIST) with unknown function; putative glutamyl-tRNA(Gln) amidotransferase subunit A; putative DNA helicase MCM9; and phosphoenolpyruvate.
These findings suggest that P. simium is undergoing recent and potentially adaptive changes in response to unique selective pressures distinct from those acting on South American P. vivax populations. The identification of shared and population-specific signals may reflect evolutionary adaptation to different hosts or environments following the zoonotic transition. The presence of selection signals in both known and uncharacterised genes highlights the importance of further functional characterisation to better understand the molecular basis of P. simium’s divergence and adaptation.
Comparison between P. simium isolates obtained from human and non-human primate sources
To investigate the genetic basis of host adaptation in P. simium, we performed a population differentiation analysis (FST) between parasites isolated from humans and NHPs. This approach aims to identify genomic loci showing high divergence between the two host-associated groups, potentially reflecting selection pressures associated with adaptation to different host environments. Within the P. simium dataset, comprised of novel and previously published WGS data, 32 were obtained from Homo sapiens and six from Alouatta guariba (Brown Howler Monkey) (Table S6). Given the small sample size, analysis of selection was not possible. Therefore, we opted to compare these populations using FST to identify any SNPs that may be specific to P. simium isolates coming from each source. In total, 24 variants had high FST (> 0.85) (Table S14). Eighteen variants were observed exclusively in P. simium isolated from brown howler monkeys, including five intergenic variants, three intronic variants (PVP01_1109400, BPLP, PVP01_0813100 and CEP76), three synonymous mutations (PVP01_0813100 N629, CRMP1 C1544 and PVP01_1105700 S3105), and six nonsynonymous mutations (PVP01_1247800 S315C, PVP01_0824300 A148S, PVP01_1022500 A7418T, PVP01_1257400 F99I, PVP01_1270700 G10V and PVP01_1313300 G996D) (P < 0.01). The remaining mutations with high FST were shared between P. simium isolated from humans and brown howler monkeys but found at low frequencies in human-derived samples. An intronic variant in SMC2 with FST of 0.86 (P < 0.01) was found mostly in human samples (N = 18) rather than brown howler monkeys (N = 1). Such variants in candidate genes may be involved in host switching.


