The overlapping global distribution of dengue, chikungunya, Zika and yellow fever

Categories: Disease & Virus

April 10, 2025

To map environmental suitability for Aedes-borne arboviral disease, we applied an ENM framework which required: (i) occurrence data; (ii) a set of climatic, environmental and socioeconomic covariates that are known to influence the transmission of these viruses or population dynamics of their vectors; and (iii) a statistical model that parameterises a function which describes the predicted probability of disease being present in each spatial location. The resulting model produced 5 × 5 km spatial-resolution global maps of probability of occurrence of dengue, chikungunya, Zika and yellow fever given long-term average environmental conditions.

Arbovirus occurrence data

The primary epidemiological data included in our analysis was occurrence data, defined as a unique geographic location where one or more cases of a particular disease been reported at any point in time²³. We extracted existing occurrence data from previous publications on dengue¹⁷, chikungunya^18,38, Zika¹⁹, and yellow fever²⁰. Additional searches were conducted using ProMED mail (http://www.promedmail.org) reports between the period 2015–2022 for chikungunya and Zika to fill important temporal gaps following previously established protocol^23,39. Despite potential variations in coverage and lack of precise geopositioning, ProMED remains valuable for its comprehensive and timely^40,41 tracking of global outbreaks of emerging infectious diseases, supporting traditional surveillance efforts effectively. We also searched through epidemiological bulletins of the ECDC^8,42,43,44, WHO regional outbreak updates^45,46 and included occurrence data for countries where diseases had not been reported in the past, but had occurred between 2016 and March 2024 (e.g., dengue in Ibiza, Spain and chikungunya in Uruguay), or where occurrence data were available at the sub-national level (e.g., yellow fever in Bolivia, Brazil, and Peru). Finally, occurrence data for dengue, chikungunya, Zika, and yellow fever from HealthMap platform (www.healthmap.org) was extracted and inspected by cross-referencing with the peer-reviewed literature^{47,48,49,50,51} and the epidemiological bulletins. Additional data uncovered through cross-referencing with these sources was incorporated only if they related to subnational levels. Detailed information about the data sources and the number of occurrence data extracted from each source are presented in Supplementary Table 1.

We included occurrence data in two forms: point and polygon. Point data represents infection in a specific location with a minimum resolution of 0.05 degrees which represents the highest available resolution of all covariate datasets. Polygon data represents occurrence somewhere within an administrative unit area, but with the exact location unknown. With the exception of small island nations, all polygon data is at the first or second administrative unit. Both point and polygon data underwent a spatial standardisation (“thinning^52,53”) where repeated reports from the same location, irrespective of their reporting time, were removed. This reduced the influence of arbitrary variations in reporting frequency between locations (e.g. monthly vs annually) that would have increased bias in areas that choose to report more frequently. Like other species and disease mapping studies, our approach assumes “niche conservatism”, i.e. the environmental limits that define where transmission is possible, and probable do not change substantially over time. This means that the best estimate of this conserved niche comes from aggregating data over as many years as possible, as opposed to just using contemporary data. The final standardised dataset included 13,127 records, comprising 5337 point locations and 7790 polygon locations. The total number of occurrence data before and after thinning and for each disease are shown in Supplementary Table 2.

Occurrence data for other viral diseases

To attempt to account for spatial differences in surveillance¹⁷ (see section “Ecological niche model”) we assembled a dataset that included occurrence data for other viral diseases from the HealthMap platform (www.healthmap.org). HealthMap is a tool for tracking infectious disease events in 15 functional languages by aggregating data from diverse informal online sources, including news media and social media posts, and employs machine learning algorithms to classify and filter these reports, ensuring relevance and reducing noise⁵⁴. For this analysis, HealthMap data was extracted from a MySQL database using structured query to obtain viral disease occurrences between 2006 and 2019. We carefully reviewed the list of viral infections to identify those under dedicated surveillance programmes and excluded occurrence points for Ebola, HIV/AIDS, measles, and polio, as these may lead to an overestimation of surveillance capability in certain regions, particularly in West African countries. This resulted in a total of 338,005 records, representing occurrence reports for viral diseases that cause acute febrile illness, primarily diagnosed by serology and polymerase chain reaction (PCR). The final standardised dataset after thinning included 21,700 records. The full list of these diseases and their counts before and after thinning is shown in Supplementary Table 3.

Background points

No globally representative survey data are available for these diseases, making classical geostatistical models and their assumptions of unbiased sampling from a clear denominator population, unsuitable¹⁶. We instead used a “presence-background” or “presence-only” modelling approach⁵⁵ where observed ‘presence’ points are supplemented with randomly generated background (absence) across all land surfaces. These background points represent areas where we assume the disease is absent, providing a contrast to the observed presence points in the model. To reduce label-imbalance which can lead to bias, the number of background points selected was proportionate to the number of dengue, chikungunya, Zika and yellow fever occurrence points separately in order to maintain a 1:1 ratio between presence and background points with each disease (sometimes referred to as “down-sampling⁵⁶”).

Covariates

We included global raster layers associated with a range of factors hypothesised to be associated either with transmission of Aedes-borne arboviruses or with capacity to detect, diagnose and report cases. Our choice of covariates was based on a previous systematic review of arbovirus risk mapping studies⁵⁷. Covariate data sources were prioritised by those that gave the highest spatial resolution and covered the 2010-2020 time period where most of our occurrence point data are concentrated. Further details of coverage and resolution of each covariate are provided in Supplementary Table 4. Where multiple observations were available across multiple years, mean values for each pixels were calculated to produce a synoptic raster layer representing the average over time period covered. Most of the covariates were available at 0.05 degree (~5 × 5 km at the equator) resolution or higher. For covariates available only at a national scale (e.g., treatment seeking, child mortality, government effectiveness and physicians density), we computed the average value for each World Bank income group and assigned these values to countries with missing data, thus ensuring a globally complete raster layer.

Covariates were resampled to a consistent 0.05 degree grid with a common extent and land/sea mask with lakes and major water bodies removed. The log transformation was optionally applied based on the distributions of each covariate, and all covariates were scaled and centred to have a zero mean and variance of 1. Covariate values for each occurrence point were extracted with mean values across administrative unit areas used for polygon data. Maps of each covariate layer are provided in Supplementary Figs. 16 and 17.

Covariates for surveillance model

To explicitly model spatial variation in the probability of reporting of an arboviral disease (if present in a given location), we considered nine covariates that are known to be related to the likelihood of detection, diagnosis, and reporting of acute viral infections: (i) gross domestic product (GDP) (5 ×5 km resolution and aggregated national level)⁵⁸; (ii) the fraction of urban land⁵⁹; (iii) travel time to healthcare facilities by walk⁶⁰; (iv) travel time to cities (>50,000 people, any travel mode)⁶⁰; (v) treatment-seeking for fever in children under five years old⁶¹; (vi) child mortality under five years old^62,63; (vii) physicians density^62,64; and (viii) government effectiveness^62,65.

Covariates for arbovirus model

We included nine covariates in our arbovirus model that reflect known drivers of transmission and vector dynamics: (i) temperature suitability for dengue virus transmission⁶⁶; (ii) mean temperature of the coldest month⁶⁷ (iii) annual cumulative precipitation⁶⁷; (iv) Normalised Difference Vegetation Index (NDVI)⁶⁸; (v) Dynamic Habitat Indices (DHI)⁶⁹; (vi) predicted suitability for Ae. albopictus⁶⁹; (vii) predicted suitability for Ae. aegypti⁶⁹; (viii) GDP (aggregated at national level)⁵⁸ and (ix) human population density⁷⁰. Three additional covariates were included in yellow fever model: (i) predicted suitability for Haemagogus janthinomys in South America⁷¹; (ii) distribution of non-human primates (NHPs)²⁰ and (iii) yellow fever vaccination coverage⁷². Predicted suitability for Ae. albopictus was not included in the yellow fever model.

Ecological niche model

Approach

We used a machine learning model which has previously proven useful for global environmental suitability mapping applications⁵⁷, including dengue^17,39, chikungunya¹⁸, Zika¹⁹, and yellow fever²⁰, as well as the global distribution map of Aedes vectors²¹. Specifically we used a down sampled random forest (RF) approach that balances the presence and background points at a 1:1 ratio, as RF model has shown to outperform many other machine-learning approaches for the modelling of presence-only data across a range of examples^56,73.

Our approach involves the development of two separate models: a surveillance model and an arbovirus model. The surveillance model aims to capture the between and within country differences in the long-term average probability that a person infected with an acute viral infection seeks treatment, is correctly diagnosed and is reported in the public domain. It should be noted that such estimates may not correlate with ability to detect unfamiliar newly emerging diseases but will represent longer-term differences in surveillance capacity once locally circulating diseases have been appropriately characterised. The predictions from this surveillance model are then used as an offset in the arbovirus model, allowing the arbovirus model to attribute spatial variations in arbovirus occurrence data solely to drivers of transmission risk.

The surveillance model uses occurrence data on all emerging acute febrile diseases including arboviral diseases (Supplementary Table 3) with an equal ratio of randomly sampled background points, following a uniform distribution across the global landscape. The model formula is as follows, with covariates listed in the same order as in Supplementary Table 4.

$${oc}{{c}_{{viral}}}_{i} \, \sim \, {{\rm{Bernoulli}}}\left(P\left({oc}{{c}_{{viral}}}_{i}\right)\right)\\ {{\rm{logit}}}(P({oc}{{c}_{{viral}}}_{i})) \,=\, f\left({GD}{P}_{i},{GD}{{P}_{{National}}}_{i},\right.\\ {Urba}{n}_{i},{{{travel}}_{{health}}}_{i},{tra}{{{vel}}_{{cities}}}_{i},\\ \left.{treatmentseekin}{g}_{i},{childmortalit}{y}_{i},{goveffectivenes}{s}_{i},{physicia}{n}_{i}\right)$$

(1)

$P({{occ}}_{{{viral}}_{i}})$ is the location-specific probability that a record $i$ is an occurrence record of an emerging acute febrile disease, rather than a background point, and $f$ denotes the non-linear and interacting function of all covariates at the same spatial location as record $i$.

The arbovirus model includes occurrence data for all arboviral diseases (dengue, chikungunya, Zika and yellow fever) combined and includes arboviral disease as a categorical variable (${arboviru}{s}_{i}$). This allows the model to share information between arboviral diseases, but also generate distinct relationships between risk of each disease and the environmental covariates if they are justified⁷⁴, e.g. different vector competencies for different viruses. The overall arbovirus model equation is as follows:

$${{\rm{logit}}}(P({oc}{{c}_{{arbovirus}}}_{i}))= f({arboviru}{s}_{i},\\ {Tem}{p}_{i},{Tcol}{d}_{i},{Preci}{p}_{i},{NDV}{I}_{i},{DH}{I}_{i},\\ {aegypt}{i}_{i},{alb}{o}_{i},{Haemagogu}{s}_{i},{{NHP}}_{i},\\ {GD}{{P}_{{National}}}_{i}{Urba}{n}_{i},{Po}{p}_{i})+\\ \log (P({oc}{{c}_{{viral}}}_{i}) \, * \, (1-{Vaccin}{e}_{i}))$$

(2)

where $\log (P({oc}{{c}_{{viral}}}_{i})*(1-{Vaccin}{e}_{i}))$ is an offset term. Two different versions of the arbovirus model are fit using the same data but with different covariates: i) an arbovirus model for dengue, chikungunya and Zika which omits covariates from ${{NHP}}_{i}$ and ${{Haemagogus}}_{i}$ and sets ${{Vaccine}}_{i}$ to 0 and ii) a yellow fever arbovirus model that omits the Ae. albopictus covariate$({{albo}}_{i})$.

The use of a modelled offset for spatial sampling bias based on detections of a wide range of other related diseases is analogous to the commonly used Target Group Background approach⁷⁵, but further enables us to explicitly map the estimated spatial variation in reporting rates. Unlike other approaches that have been proposed in ecology to infer spatial bias in a single model⁷⁶, the two-stage approach we employ separates the modelling of surveillance and disease processes. This distinction minimises the potential for overlap in the effects of covariates (e.g. urbanness and GDP) that influence both surveillance and disease dynamics, which is particularly important for the arboviruses we consider.

Spatial block cross-validation

Randomly repeated cross validation may lead to an over-optimistic model performance metrics if the independence assumption between training and test data is violated. This is especially common when dealing with spatially structured (clustered) data where training observations are often in close proximity to test observations, introducing spatial autocorrelation⁷⁷. To maintain independence between training and test data, a spatial block 100-fold cross-validation was employed to assess the overall and spatially stratified predictive performance of the model. This approach considers the spatial dependence structure of the data when estimating predictive performance, resulting in a final metric that is typically lower than would be returned by conventional cross-validation procedures. Initially, a grid of ~1000 equal-sized square blocks, each with a size of 500 x 500 km, was generated using the cv_spatial function from the “blockCV” package⁷⁸. Each block was then randomly assigned to 100 roughly equal-sized folds, maintaining the balanced number of presences versus background records in each fold. In each iteration, we reserved one-fold (1%) for validation, which was kept out of the model fitting process (99%) and only used to validate predictions (‘out-of-sample’ validation). By cycling through the data, each fold was used as a validation fold exactly once and then we averaged model performance statistics across the 100 repeated runs, using the area under the curve (AUC) of the receiver-operating characteristic plot as a metric of goodness-of-fit. AUC is a confusion matrix-based model performance metrics that measures how well the model can discriminate presence from background points⁷⁹.

Spatially stratified AUCs

To understand and visualise the model’s performance in different parts of the world, the spatial map of AUC values was produced using a geospatial approach. We calculated AUC values for a set of validation polygons that are defined by a 250 km radius around each presence or background (PB) point. ROC curves and AUC statistics were then calculated within each of these polygons which was summarised using the coordinates of the centroid for visualisation purposes. The resulting values were then averaged to derive a comprehensive AUC value for each specific PB point. Additionally, regionally stratified AUCs are calculated by averaging the AUC values of PB points within regional boundaries. This iterative approach was chosen to balance spatial continuity of the map while mitigating the impact of small spatial units on AUC values.

Prediction and uncertainty levels

To increase the robustness of model predictions and quantify model uncertainty, we obtained 100 predictions by using RF models that were iteratively calibrated during spatial cross-validation as described above. Each of the 100 fitted sub-models predicted environmental suitability on a scale from 0 to 1. Predictions were made separately for each disease, by updating the “arboviral disease” variable (see section “Ecological niche model”) to represent different diseases. To generate the final prediction raster, a weighted average prediction was calculated for each 5 × 5 km pixel based on the following equation:

$${P}_{k}=\frac{{\sum }_{i=1}^{n}({p}_{k,i}\times {w}_{i})}{{\sum }_{i=1}^{n}{w}_{i}}$$

(3)

where ${P}_{k}$ represents the weighted average of predicted risk for pixel $k$, ${p}_{k,i}$ represents the predicted risk for pixel $k$ derived from the $i$-th sub-model, ${w}_{i}$ represents the AUC of the $i$-th sub-model, and $n$ is total number of sub-models (here $n=$100). We calculated the interquartile range (IQR) for 100 model predictions at each location to quantify uncertainty. We also generated 1000 bootstrap samples for weighted average prediction for each pixel and determined the confidence intervals by computing 2.5th and 97.5th percentiles from their distributions.

Variable importance and partial dependence plots

The metrics employed for variable importance included mean decrease in accuracy and mean decrease in node impurity (measured by the Gini index). The variable importance values were extracted and normalised by dividing each value by the sum of all feature importance values and averaged across 100 folds. Partial dependence values for each model were also extracted using the “pdp” package and aggregated across 100 models by calculating the mean and the 95% confidence interval.

Masking

Following previous global mapping studies^17,21, to limit extrapolation of risk predictions far beyond their original fitting datasets we post-hoc mask out (set risk to 0) model-based risk predictions in areas where alternative forms of evidence suggest that transmission is extremely unlikely. This included where temperature profiles were estimated as too cold to allow mosquitoes to survive long enough to complete the extrinsic incubation period of the dengue virus⁶⁶. Temperature suitability values were calculated using the mean temperature of 2010–2020⁶⁷ and converted to binary range maps using threshold of 14/365 = 0.04 as they were unlikely to support transmission over the two week serial interval of autochthonous arboviral disease transmission. Environmental suitability predictions for dengue, chikungunya and Zika in areas with temperature suitability values of less than 0.04 were set to 0. Risk predictions for yellow fever were set to 0 in areas outside the countries at risk, endemic, or potentially at risk for yellow fever as defined by the WHO yellow fever risk assessment working group which excluded countries outside of South America and Africa³¹. Masking layers and unmasked versions of environmental suitability maps are provided in Supplementary Figs. 18 and 19 to show the regions of the world affected by the masking process.

Estimating distribution and population at risk

The continuous suitability maps were converted into binary distribution maps using a threshold value above which an area was classified as at-risk. We defined the threshold as the suitability value that maximised the global sum of sensitivity and specificity when compared the original presence and background points. Separate thresholds were generated for each disease, with the following suitability values: dengue (0.37), chikungunya (0.21), Zika (0.14), and yellow fever (0.32), to produce individual binary maps. An aggregate binary map for dengue, chikungunya and Zika was then created by combining these individual binary maps from the three diseases, such that any area predicted to be at risk for any of these diseases was considered at risk. Using these binary maps and the global population grid⁷⁰, we calculated population at a global, continent and country (UN member state; 193 in total) level. For the purpose of estimating total countries at risk, a country was only included if more than 10% of its total population was identified as being at risk of disease.

Sensitivity analyses

To test if the maps for Zika were improved with a disease-specific temperature suitability covariate, the dengue temperature suitability covariate was replaced with a Zika temperature suitability covariate based on temperature relationships from Ryan et al.⁸⁰ The improvement of the maps was assessed using a 50-fold block cross validation approach comparing overall and regionally-stratified model performance metrics (AUC, sensitivity and specificity) of the base model (as presented in the section “Ecological niche model”) against models with alternative specifications for the relationship between temperature and transmission risk^80,81,82,83.

Given that Ae. aegypti play a minimal role in modern transmission of yellow fever virus in South America, where transmission primarily occurs through sylvatic spillover rather than urban cycle^28,29, an alternative version of the yellow fever arbovirus model was fit without the Ae. aegypti covariate. The improvement of this model was assessed by using a 50-fold block cross validation approach and comparing overall and South America-specific model performance metrics with those of the base model.

To assess the benefits of joint modelling over disease-specific models, we re-ran individual models for dengue, chikungunya, Zika, and yellow fever separately and compared the model performance metrics (AUC, sensitivity, and specificity) with those of the joint model. Model performance was evaluated using 10-fold cross-validation with metrics calculated based on a cutoff value where the sum of sensitivity and specificity is maximised.

Comparison between diseases and with previously published maps

We obtained previously published suitability maps from Messina et al.¹⁷ for dengue, Nsoesie et al.¹⁸ for chikungunya, Messina et al.¹⁹ for Zika, and Shearer et al.²⁰ for yellow fever. These maps were chosen because they represent recent global predictions of disease risk and also utilise occurrence data and species distribution modelling approaches⁵⁷, making a more useful assessment of the advances made in this work. The same threshold calculation above was performed for the previous suitability maps using the same set of presence and background points to convert them into binary maps. This was a necessary calibration step to standardise the scales of predicted suitability. Once the binary maps of at-risk areas were constructed, our map and previously published maps were compared and every 5 ×5 km pixel was categorized into four distinct groups: those at-risk in both our map and previous maps, those at-risk only in previous maps, those at-risk only in our map, and areas deemed not at-risk in either.

Independent validation

As part of the validation process, the initial versions of the suitability maps were presented in June 2023⁸⁴ to the Technical Advisory Group on Arbovirus (TAG-Arbovirus⁸⁵), a multidisciplinary group of experts supported by WHO Arbovirus Secretariat. Following the presentation, we invited feedback by providing the TAG with an interactive map of occurrence records for each disease (https://ahyounglim.shinyapps.io/multi_arbo_mapping/) and a concise questionnaire (Supplementary Table 5). This questionnaire addressed specific questions, particularly regarding any discrepancies between our model estimates and their context-specific knowledge, as well as additional drivers of arbovirus risk that should be taken into consideration. This resulted in two main changes to our models: i) identified additional data sources for chikungunya occurrences in Brazil³⁸; ii) included additional covariates, such as Haemagogus janthinomys⁷¹ for yellow fever model and under-five mortality, government effectiveness, and physicians density for the surveillance capability model (see section “Covariates”).