Geographical shifting of cholera burden in Africa and its implications for disease control

Categories: Disease & Virus

August 8, 2025

Cholera incidence data

Data sources

We curated a global cholera incidence database of national and subnational surveillance data and other reports from ministries of health (MOHs), GTFCC partners and other public data sources for countries in Africa from 2011 through 2020, which included a comprehensive online search for national and subnational cholera outbreak reports for every modeled country and year. Shapefiles were obtained from MOHs, WHO country or regional offices, unified or curated sources such as GADM, geoBoundaries and GRID3 and other online sources (https://data.humdata.org/ and ref. ⁴⁹) and were linked to observation locations. Suspected cholera case definitions varied by data source and were oftentimes not stated but were commonly variations of the recommended WHO suspected case definition, such as ‘any patient presenting with or dying from acute watery diarrhea’ and ‘a patient aged 2 years or more develops acute watery diarrhea with or without vomiting’ (see Supplementary Table 3 for a complete list).

All countries in Africa that had at least one national-level report of suspected cholera (including zero) in both periods of analysis were modeled (Supplementary Table 1). Following this criterion, 11 of 54 countries in Africa were excluded (Algeria, Cape Verde, Comoros, Egypt, Gambia, Libya, Mauritius, Morocco, Sao Tome and Principe, Seychelles and Tunisia).

Data collection protocol and data template

All cholera surveillance and alert documents were systematically scraped for all reported counts of suspected cholera (‘observations’) that were explicitly linked to date ranges and geographical areas (‘locations’) and were thought to represent all cases reported in a specific space-time unit (for example, not representing just a subset of cases, such as age-statified or sex-stratified counts).

Documents were extracted only when contextual information suggested that the document creator thought the data represented a real cholera outbreak. For regularly updating data sources (for example, situation reports), older data were updated to the most recent back-corrected case counts to limit issues stemming from incorrect initial ascertainment and reporting delays and irregularities. Prior to running our final set of models, we also conducted a data audit on highly discrepant outlier observations (for example, sum of subnational case counts greatly exceeded national case counts and high variation in case counts during roughly the same period) to correct individual observations and prune likely reporting errors on a case-by-case basis.

Location names were systematically verified, and associated geographic shapefiles were identified with a standard location audit protocol, which consisted of searches on reputed websites and resources and comparison to locations that already existed in the Cholera Taxonomy database^4,50. Metadata, source documents, shapefiles and observations were then added to the global cholera surveillance database. Each observation contained the following information: location shapefile, date range, number of suspected cases and time fraction (tfrac) within a calendar year, which is calculated from the date range.

Cholera data processing

Cholera data were extracted from the database and passed through a processing pipeline to format and harmonize raw data inputs (which covered a wide range of temporal and spatial scales) for our statistical mapping modeling framework (Supplementary Fig. 5). The main data processing steps consist of temporal aggregation to the yearly timescale, identifying temporally censored observations (those spanning fewer than 8 months of a year), filtering observations that do not contribute to the likelihood, imputation of limited national-level observations and assigning observation-linked geographic areas (‘locations’) to the spatial modeling grid. After modeling, the resulting gridded estimates undergo postprocessing to produce estimates for unified, non-overlapping administrative units.

Temporal aggregation

Our statistical mapping model aimed to infer mean annual incidence rates, so we sought to aggregate observations to the annual time resolution. As observations may exist for arbitrary locations and date ranges, non-overlapping observations that were consecutive in time were aggregated if they were in the same location, calendar year and source document. If a location had multiple observations of suspected cholera for the same time bounds within the same data source, we included the observation with the largest case count in the aggregated observation; the implicit assumption here is that cases are more likely to be underreported than overreported, so we give preference to higher case count reports. This resulted in a set of aggregated observations per location, year and data source and the corresponding fraction of calendar year that they covered, which were used as model inputs.

Identifying time-censored observations

Our modeling framework did not assume that cholera incidence was homogenous throughout the year (see Methods, ‘Statistical framework for modeling mean annual incidence’), and we, therefore, differentiated full-year observations from partial-year observations in the model likelihood. Partial-year observations were considered to be right-censored if they spanned fewer than 8 months (0.65 years). We dropped all right-censored observations with zero suspected cases, as these have a likelihood of 1 and, therefore, do not contribute to the model likelihood.

Observation filtering

Observations were dropped from inclusion in the model if they have a likelihood of 1 (and, therefore, would not contribute to the likelihood) or were otherwise not informative to the model. Although some of these decisions are discussed elsewhere in Methods, broadly speaking, observations that were removed include those that (1) were not associated with a geographic shapefile, (2) were ADM0 (country-level) observations that spanned multiple years, (3) exactly duplicated observations prior to temporal aggregation, (4) had the exact same location and time bounds but reported fewer cases than another temporally aggregated observation from the same data source, (5) were time-censored observations that reported zero cases, (6) were time-censored observations at the ADM0 level that had less than half the reported cases as full-year ADM0 observations in that timeslice or (7) had zero population according to the WorldPop raster.

National-level data imputation

At least one country-level annual observation was sought for every country–year combination modeled to improve model stability and performance. This imposed a critical constraint to bound modeled incidence rate estimates, particularly when fitting a model to data with only censored observations or only subnational observations and incomplete spatial coverage across the country.

When a country-level observation was not found in a given year, imputation of a country-level annual observation was performed. If no observations at any spatial scale were available for that year, a zero-case observation was imputed, thus assuming that absence of data in a year corresponded to a report of zero cases for that country. If subnational observations were available and they covered a non-overlapping spatial area that represents at least 10% of the country population, a mean tfrac-adjusted incidence rate was computed across all subnational observations and multiplied by the country population to impute a country annual observation. If subnational observations were available and they covered a spatial area representing less than 10% of the country population, an observation was imputed as the maximum of the sum of cases across all unique data source and administrative unit level combinations. If only censored national observations and no subnational observations were available, the maximum value censored observation was imputed.

In the end, a limited number of observations were imputed (109 imputed relative to 30,102 non-imputed observations) in order to ensure that all modeled countries had at least one country-level observation per year. Of these, zero-case observations were added when no annual country-level report was found (96 imputed observations in 21 countries). When subnational or censored national annual reports were available, non-zero-case observations were aggregated to impute an annual country-level report (13 imputed observations in 10 countries).

Geographic linkage of observations to modeling grid

Our statistical modeling framework was applied to a space-time modeling grid, where the space dimension was composed of 20-km × 20-km grid cells that overlapped with a given country’s geographic shapefile and had a population greater than zero according to the associated WorldPop gridded population estimates in that year; the time dimension was represented in annual time slices. In our cholera surveillance database, some observations corresponded to an area roughly the size of a 20-km × 20-km grid cell, but very few were smaller than that. We, therefore, selected the 20-km × 20-km grid scale as a compromise between the limits of inference in the available data, computational tractability and spatial granularity of the burden estimates, following the example of previously published cholera burden maps⁴.

Observations of suspected cholera were associated with space-time cells in the modeling grid according to their geographic shapefiles and date ranges. In the space dimension, observations were geographically linked to all 20-km × 20-km grid cells that intersected the observation’s geographic shapefile. When grid cells were only partially covered by the observation’s geographic shapefile (for example, grid cells at country borders), we computed and assigned a spatial fraction (sfrac) to the grid cell–shapefile pairs. The sfrac value was calculated as the sum of the 1-km × 1-km grid cell population (after aligning the 1-km × 1-km WorldPop gridded population estimates to the 20-km × 20-km spatial modeling grid) that intersected the observation’s geographic shapefile divided by the total 20-km × 20-km grid cell population. In the time dimension, observations were mapped to all annual time slices that overlapped with the observation date range.

We removed cell–shapefile linkages with small overlaps in order to improve the smoothness of model estimates at shapefile borders. Spatial grid cells that overlapped with an observation’s geographic shapefile with a population-weighted spatial fraction below 0.05 were removed from being associated with the shapefile. To improve the smoothness of model estimates at country borders, we removed spatial grid cells from the space-time modeling grid if the grid cell sfrac was less than 0.3.

Other spatial data sources

Gridded annual 1-km × 1-km population estimates were taken from the unconstrained global mosaic WorldPop population counts dataset and matched by cholera observation year (https://www.worldpop.org/). The gridded estimates were then linearly scaled such that the total country population matched the respective annual estimate from the 2022 revision of the United Nations World Population Prospects⁵¹. For map visualizations, we accessed major water bodies in Africa published by the Regional Centre for Mapping of Resources for Development and the AQUASTAT program of the Food and Agriculture Organization of the United Nations^52,53.

Statistical framework for modeling mean annual incidence

We developed a hierarchical Bayesian modeling framework that accounts for spatiotemporal heterogeneity in underlying suspected cholera incidence and variability and overlap in the spatial and temporal scales of case reports within and across data sources. In particular, the framework accounted for misalignment in spatial and temporal resolutions both across data sources and between observed case counts and intended outputs. This model expanded on a previously published approach⁴.

Model overview and inference

We modeled each country and time period (2011–2015 and 2016–2020) separately. The process model for most country-periods consisted of a log-linear model of annual cholera incidence rates over a 20-km × 20-km grid accounting for spatial autocorrelation and interannual variability. Spatial autocorrelation was implemented through a directed acyclic graph autoregressive (DAGAR) prior for the spatial random effects, which has improved performance relative to traditional spatial priors in disease mapping⁵⁴. Temporal variability was modeled through annual temporal random effects. Yearly observations and temporally censored observations contributed to separate parts of the likelihood. The process models for country-periods with no or minimal subnational data were modified to improve interpretability of results and model performance (see Supplementary Table 4 for model settings by country and Methods, ‘Mean annual incidence model equations’ for model details). The modeled mean incidence rate for an area corresponding to a reported case count was derived as the weighted average of the incidence rates of grid cells covering that area. Observations overlapping in space and time were treated as independent measurements of the same underlying incidence rate. To account for reporting variability across data sources, we used a negative-binomial observation model for the case counts with inferred administrative level-specific overdispersion parameters (Methods, ‘Mean annual incidence model equations’).

Posterior samples were drawn with Hamiltonian Monte Carlo (HMC) as implemented in the Stan programming language⁵⁵. Sampler convergence was assessed visually through the inspection of trace plots and observation-level Rhat statistics⁵⁶. Model fit was evaluated through scatterplots of true and fitted observations and posterior retrodictive checks of the posterior coverage of observations by administrative level⁵⁷ (Supplementary Figs. 6–15). Gridded outputs were postprocessed for mean annual incidence, mean annual incidence rate, IRR, population in ADM2 units in 5-year and 10-year incidence categories, assignment of ADM2 units in 5-year and 10-year incidence categories at ADM0 and ADM2 (sometimes called country-level and district-level, respectively) and region and continent scales for analysis (Methods, ‘Incidence modeling postprocessing’).

Mean annual incidence model equations

We first describe the base statistical model and add complexity that improves the base model’s ability to handle challenges presented by the real-world observation data. The core of the modeling framework consists in partitioning the variability of suspected cholera case reports among interannual variability, captured by yearly random effects, spatial patterns captured by a spatial autocorrelation prior and assumed to be constant for each 5-year modeling period and observational variability captured through an overdispersed observation model (negative binomial). Sharing spatial information across years helped improve the quality of the spatially resolved estimates given the limited availability of subnational data in many countries. We think that this partitioning of variability between temporal and spatial effects and observational overdispersion into two modeling periods represents an appropriate compromise of reliable volumes of input data, model complexity, spatial smoothing and data-driven inference to meet our analysis aim.

At the end, we present the full final ‘standard’ model structure and deviations from this standard model structure, which were deployed in country-periods described in Supplementary Table 4.

Base model

To estimate mean annual incidence across a period of T years (T annual time slices), we first must model annual cholera incidence estimates corresponding to a ‘modeling time resolution’ of 1 year. In a simple scenario, suppose that all observations have a duration of 1 year, which means that the ‘observation time resolution’ always equals the modeling time resolution. To model space-time incidence rates over a spatial domain that covers the area of interest across the T years, we defined a modeling space-time grid with a time resolution of 1 year for a given gridded spatial resolution—that is, each space-time grid-cell (s,t) spans 1 year t and a spatial grid cell s.

Observation-level cases can then be modeled as:

$${c}_{i}={\sum}_{{S}_{i,s,t}}\quad{\lambda}_{s,t}{\phi }_{i,s}{\rm{pop}}_{s,t},$$

$$\log \left({\lambda }_{s,t}\right)=\gamma +{\omega }_{s}+{\eta }_{t},$$

where ${c}_{i}$ represents the modeled mean number of cases for observation i; ${S}_{i,s,t}$ is the set of space-time grid cells intersecting observation i; ${\lambda }_{s,t}$ is the annual incidence rate in space-time grid cell s,t; ${\phi }_{i,s}$ is the population-weighted spatial fraction of grid cell s,t that is covered by the observation location; and ${\rm{pop}}_{s,t}$ is the total population in grid cell s,t. Grid cell incidence rates were modeled with a log link as the sum of the offset $\gamma$, which is the expected incidence rate across the space-time modeling grid, spatial random effect ${\omega }_{s}$ and yearly random effect ${\eta }_{t}$.

The expected incidence rate $\gamma$ was calculated as the population-weighted mean of the implied incidence rate (time-adjusted reported cases of full-year observations divided by location population) across all full-year observations contributing to the model (see Methods, ‘Identifying time-censored observations’). If the expected incidence rate was less than 0.01 per 100,000 population, it was changed to be 1 × 10⁻⁷.

Observation ${y}_{i}$ is then linked to modeled cases through an observation model. For instance, in the simplest setting, one can assume that observations follow a Poisson distribution:

$${y}_{i}\sim {\rm{Poisson}}({c}_{i}).$$

However, a Poisson model does not reflect the heterogeneity in case counts observed in the data, thereby necessitating a more elaborate observation process model. We expand on the observation process below.

Prior on the spatial random effect (ω
_s)

To capture spatial variability in the incidence rates and produce spatially smooth maps, we introduced spatial random effects into the model at the grid cell level. We assumed that the spatial random effect ${\omega }_{s}$ was constant across the T time slices to reduce the number of parameters that must be estimated from a model that may have limited observations in any given time slice. We acknowledge that this model may not adequately capture situations where the spatial autocorrelation in cholera cases changes across time slices. In such scenarios, the estimates of ${\omega }_{s}$ will represent the average spatial variability across all the time slices.

In our model, the joint distribution of ${\omega }_{s}$ for all spatial grid cells s is specified as a DAGAR prior. The DAGAR model was demonstrated to have improved model performance, interpretability of parameters and computational efficiency over other spatially smooth priors that are traditionally used in disease mapping (for example, conditional autoregressive prior)⁵⁴. The DAGAR prior can be specified via a sequence of simple conditional normal distributions. Specifically, the conditional distribution of ${\omega }_{s}$, conditional on its directed neighbors on the grid, follows a normal distribution with mean ${\mu }_{{\omega }_{s}}$ and s.d. ${\sigma }_{{\omega }_{s}}$:

$${\omega }_{s}\sim {\rm{Normal}}\left({\mu }_{{\omega }_{s}},{\sigma }_{{\omega }_{s}}\right),$$

$${\mu }_{{\omega }_{s}}=\frac{\rho }{(1+(n{n}_{s}-1){\rho }^{2})}\sum _{u\in {\varOmega }_{s}}{\omega }_{u},$$

$${\sigma }_{{\omega }_{s}}={\xi }_{{\sigma }_{w}}\sqrt{\frac{(1-{\rho }^{2})}{(1+(n{n}_{s}-1){\rho }^{2})}},$$

where $\rho$ is the strength of the spatial autocorrelation between grid cells; ${{nn}}_{s}$ is the number of neighbors of cell s; and ${\varOmega }_{s}$ is the set of neighbors to cell s. We denote the DAGAR prior as ${\mathbf{\upomega }} \sim {\rm{DAGAR}}(\rho ,{\xi }_{{\sigma }_{w}})$ where ${\mathbf{\upomega}}$ is a vector containing ${\omega }_{s}$ for all the spatial grid cells s.

Prior on the temporal random effects (${\eta }_{t}$)

Although the yearly temporal random effects were initially assumed to be independent, we imposed a zero-sum constraint to improve identifiability of these parameters and enforced a marginal standard normal prior on the set of these terms⁵⁸. In brief, the approach applies a QR decomposition on the covariance matrix of the yearly random effect to obtain a set of random variables with a zero-sum constraint and marginal s.d. values of 1. In practice, priors are set on T − 1 independent random effects, and the random effect of the T-th time slice is computed from them.

Expansion for partial-year observations

Partial-year observations (that is, those with tfrac < 0.65 within a given annual modeling time slice) were treated as right-censored in the likelihood (see Methods, ‘Identifying time-censored observations’). Because we assumed that incidence rates were non-homogeneous within a given annual time slice, we chose to treat partial-year observations as right-censored observations of the annual counts as opposed to performing an extrapolation to represent a full year. In other words, we make no assumptions beyond that the number of cases in the full-year modeling time slice would be at least as large as the number of observed cases in the partial-year observation. The observation model likelihood for partial-year observations was:

$$L(\,{y}_{i})=\Pr (Y\ge {y}_{i}|{c}_{i}),$$

So, in the case of the Poisson observation model, the likelihood is:

$$L\left(\,{y}_{i}\right)=1-{\rm{CDF}}_{\rm{Poisson}}\left(\,{y}_{i}|{c}_{i}\right).$$

Expansion for overdispersed observation data

Examination of the observation data determined that a Poisson observation model would not be sufficient to account for the overdispersion observed in many country-period models and across different administrative unit levels. Consequently, we accounted for overdispersion with a negative binomial observation likelihood:

$${y}_{i}\sim {\rm{NegBinom}}\left({c}_{i}, {\tau }_{A\left[i\right]}\right),$$

where τ is the overdispersion parameter that defines the relationship of the mean $c$ to the variance:

$${\rm{variance}}=c+\frac{{c}^{2}}{\tau }.$$

To account for expected differences in overdispersion across administrative level reporting, the model allowed for different overdispersion parameters by observation administrative unit level A[i]. The overdispersion parameter (${\tau }_{A0}$) was fixed at the country level (A0) but inferred for all other administrative unit levels.

Complete standard model formulation

The final standard model followed a hierarchical structure, such that the process model was defined:

$${c}_{i}=\sum _{{S}_{i,s,t}}\quad{\lambda }_{s,t}{\phi }_{i,s}{po}{p}_{s,t},$$

$$\log \left({\lambda }_{s,t}\right)=\gamma +{\omega }_{s}+{\eta }_{t},$$

and the observation model was defined for full-year and partial-year observations:

$$\Pr(\,{y}_{i}|{c}_{i})={\rm{NegBinom}}(y_i|{c}_{i}, {\tau }_{A[i]}) \qquad \text{if} \quad {\varPhi}_{i,t}\ge a,$$

$$\Pr \left(\,{y}_{i}|{c}_{i}\right)=1-{\rm{CDF}}_{\rm{NegBinom}}\left({y}_{i}|{c}_{i},{\tau }_{A\left[i\right]}\right)\qquad \text{if} \quad {\varPhi }_{i,t} < a,$$

where $a=0.65$, the threshold above which the time fraction ${\varPhi }_{i,t}$ for observation i is considered to represent the full year t.

We used the following hyperpriors for the spatial random effects:

$${\mathbf{\upomega }} \sim {\rm{DAGAR}}\left(\rho ,{\xi }_{{\sigma }_{w}}\right),$$

$$\rho \sim {\rm{Beta}}\left(5,1.5\right),$$

$$\Pr \left({\xi }_{{\sigma }_{w}}\right)=\theta {f}_{\rm{Normal}}\left({\xi }_{{\sigma }_{w}}|10,{\sigma }_{\omega ,1}^{{\prime} }\right)+\left(1-\theta \right){f}_{\rm{Normal}}\left({\xi }_{{\sigma }_{w}}|0,{\sigma }_{\omega ,2}^{{\prime} }\right),$$

$$\theta \sim {\rm{Beta}}\left(1,3\right),$$

$${\sigma }_{\omega ,1}^{{\prime} }\sim {\rm{Half}\; \rm{normal}}\left(0,2\right),$$

$${\sigma }_{\omega ,2}^{{\prime} }\sim {\rm{Half}\; \rm{normal}}\left(0,0.5\right),$$

where $\Pr ({\xi }_{{\sigma }_{w}})$ represents a mixture prior on the s.d. scaling constant (${\xi }_{{\sigma }_{w}}$), which is the sum of two normal distribution densities (${f}_{\rm{Normal}}$), one centered at 0 and the other centered at 10, weighted by the mixture parameter $(\theta )$. The hyperpriors on the s.d. of the normal distribution densities contributing to the mixture prior are represented by ${\sigma }_{\omega ,1}^{{\prime} }$ and ${\sigma }_{\omega ,2}^{{\prime} }$. This mixture prior reflects the possibility that different models may have high or low magnitude of spatial variability.

We used the following prior for temporal random effects:

$${\eta {\prime} }_{[1:T-1]}\sim {\rm{Normal}}\left(0,\frac{1}{\sqrt{1-\frac{1}{T}}}\right),$$

where ${\eta {\prime} }_{[1:T-1]}$ is multiplied by the QR matrix (see Methods, ‘Prior on the temporal random effects’) to yield the ${\eta }_{[1:T]}$ yearly random effects for all T time slices whose sum is enforced to be 1.

For the observation model, the overdispersion term for administrative unit level 0 (A0) observations was fixed at 100 or 1,000, which corresponded to a moderate amount of overdispersion for case counts of that magnitude. We used the following priors for the overdispersion term in the observation model:

$${\tau }_{A0}=100\quad{{\rm{if}}\; {\rm{max}}}(\,{y}_{i,A0})\le 5{,}000,$$

$${\tau}_{A0}=1{,}000\quad{{\rm{if}}\;{\rm{max}}}(\,{y}_{i,A0})>5{,}000,$$

$$\frac{1}{{\tau }_{A > 0}}\sim {{\rm{Half}}\; {\rm{normal}}}\left(0,1\right),$$

where A > 0 refers to administrative level units below the national level.

Model formulation without spatial autoregressive term

For country-periods with no subnational observations or only zero-case observations, we removed the spatial random effect from the process model. For this model deviation, spatial random effect priors were removed, and the process model was as follows:

$${c}_{i}={\sum}_{{S}_{i,s,t}}\quad{\lambda}_{s,t}{\phi}_{i,s}{\rm{pop}}_{s,t},$$

$$\log \left({\lambda }_{s,t}\right)=\gamma +{\eta }_{t}.$$

Model selection and model formulation with non-mixture prior

All country-periods with at least one subnational non-zero-case observation were first attempted with the standard model formulation. We found that models with limited subnational data had poor model fit and convergence due to identifiability issues in the spatial autocorrelation strength parameter $\rho$ determining the spatial random effect $\omega$. The standard model employed a mixture prior on the scaling factor (${\xi }_{{\sigma }_{w}}$) of the s.d. of $\omega$ (${\sigma }_{{\omega }_{s}}$) to account for possible bimodality in the strength of spatial autocorrelation (that is, country-periods may have high or low $\rho$ depending on the data).

For country-periods with limited subnational data, model convergence improved when the mixture prior on ${\xi }_{{\sigma }_{w}}$ was replaced with a unimodal prior that favors a higher ${\sigma }_{{\omega }_{s}}$ and, therefore, lower $\rho$. The scaling constant priors for this model deviation reduced to:

$${\xi }_{{\sigma }_{w}}\sim {{\rm{Half}}\; {\rm{normal}}}(5,0.5).$$

Incidence modeling postprocessing

Our modeling framework produced 20-km × 20-km gridded mean annual incidence estimates for each country and time period (4,000 posterior samples). Grid cell estimates were then aggregated to ADM0 (country) and ADM2 (district) scales according to standardized sets of non-overlapping country-unified shapefiles. These were obtained for each modeled country from GADM (https://gadm.org/) or geoBoundaries⁴⁹ after an additional quality assessment (Supplementary Table 5). Lesotho did not have ADM2 shapefiles from a standard source, so these outputs were processed to the ADM1 scale instead.

A set of continent-wide and region-wide posterior samples was assembled by summing across a single random posterior predictive sample of each country model; mean and 95% CrIs were then calculated from the summed results, thus yielding 4,000 continent-wide and region-wide posterior predictive samples that are internally consistent.

Multiple outputs were calculated at continent, regional, ADM0 and ADM2 scales for each model in postprocessing: mean annual incidence (cases per year); mean annual incidence rate (cases per population per year); IRR (calculated as mean annual incidence rate in 2016–2020 divided by that in 2011–2015); assignment of ADM2 units to 5-year and 10-year incidence categories; and population in ADM2 units in 5-year and 10-year incidence categories. Mean annual incidence and mean annual incidence rate posterior mean and 95% CrI estimates were calculated across 4,000 posterior predictive samples at the relevant spatial scale. To estimate IRRs, the mean and 95% CrIs were calculated across all pairwise ratios of 4,000 posterior predictive samples in each period (that is, evaluated across the distribution of 4,000² samples—all 2016–2020 samples were pairwise-divided by all 2011–2015 samples). IRRs were deemed statistically significant if the 95% CrIs did not cross 1.

The number of people living in ADM2 units by 5-year incidence categories at continent and region scales was estimated across the summed 4,000 continent-wide and region-wide posterior samples described above (corresponds to results in Fig. 3a and Extended Data Fig. 5). The mean and 95% CrIs for each 5-year incidence category were calculated across the 4,000 posterior samples. Consequently, the variability in these estimates reflects variation in ADM2 incidence category assignment across samples.

Assignment of ADM2 units to specific 5-year incidence categories was performed in a two-step procedure (corresponds to results in Fig. 3b and Extended Data Fig. 6). First, we determined incidence categories for each 20-km × 20-km modeling grid cell and posterior sample and retained the highest incidence category encompassing at least 10% of the unit’s population or 100,000 people in 2020. Then, ADM2 units were assigned to the lowest incidence category with at least 50% posterior cumulative probability of assignment at or above that level (for example, ADM2 unit was assigned to the 50–100 cases per 100,000 category if ≥50% of posterior samples categorized the ADM2 unit in the 50–100 or ≥100 cases per 100,000 people categories). Thus, the assignment of an ADM2 unit to an incidence category already factors in the variability in assignment across samples. We note that summing the number of people in all ADM2 units by 5-year incidence category would not yield the same number as the continent-wide mean estimate of population living in the ADM2 category.

Assignment of ADM2 units to 10-year incidence categories was based directly on their 2011–2015 and 2016–2020 incidence category assignments. We defined four 10-year incidence categories: ‘sustained high’ for ADM2 units classified as high incidence (≥10 cases per 100,000 people per year) in both periods; ‘history of high’ for ADM2 units classified as high incidence in at least one period; ‘sustained low’ for ADM2 units classified as <1 case per 100,000 per year in both periods; and ‘history of moderate’ for all other combinations.

Consideration of border effects between countries

We ran our statistical model on each country separately, which could lead to border effects in our mean annual incidence estimates between countries. Independent country models sometimes generated estimates for the same border grid cell, in which case we merged the estimates taking the mean across posterior samples from the model runs of each neighboring country. During model validation, we performed a close examination of border estimates out of concern that country-level modeling may introduce artificial edge effects and did not observe outlier or unusual estimates at country borders (Fig. 1b).

Although we did not explicitly model cross-border transmission, cross-border regions of higher cholera incidence were able to be captured directly from subnational model input data (Extended Data Figs. 1 and 2). For example, the modeled estimates appear to identify contiguous high-incidence areas across several country borders, including Nigeria–Cameroon–Chad, South Sudan–Kenya and DRC–Zambia. We, however, note that some cross-country borders had notable discontinuities in mean annual incidence estimates (for example, the Southern Malawi–Northern Mozambique border). These could be due to real differences in underlying cholera incidence or due to differences in cholera reporting between administrative areas. As such, these discontinuities are valuable opportunities for further investigations of the spatial patterns of cholera incidence and reporting in Africa, which we intend to exploit in future work.

Definition of 5-year and 10-year incidence categories

We identified six categories pertaining to the 5-year mean annual incidence: <1, [1, 10), [10, 20), [20, 50), [50, 100) and ≥100 cholera cases per 100,000 people per year. For a given posterior sample, ADM2 units were assigned to the most severe 5-year incidence category where at least 10% of the unit’s population or 100,000 people (in 2020-adjusted population sizes) were living, according to the modeled 20-km × 20-km grid cell estimates (see Methods, ‘Incidence modeling postprocessing’ for details). This incidence category assignment procedure was designed to identify locations that may make high-impact targets for public health intervention. We used the labels ‘very high incidence’ to refer to the category of ≥100 cases per 100,000 population per year, ‘high incidence’ for ≥10 per 100,000 and ‘low incidence’ for <1 per 100,000.

We also used a 10-year incidence categorization at the ADM2 level that combines the incidence categories in each 5-year period (2011–2015 and 2016–2020). We defined four 10-year incidence categories: ‘sustained high’ for ADM2 units classified as high incidence (≥10 cases per 100,000 people per year) in both periods; ‘history of high’ for ADM2 units classified as high incidence in at least one period; ‘sustained low’ for ADM2 units classified as <1 case per 100,000 per year in both periods; and ‘history of moderate’ for all other combinations.

People living in 5-year incidence categories relied on the mean population estimate across the relevant 5-year period. We used 2020 population estimates for people living in 10-year incidence categories and the analysis of potential intervention reach in order to facilitate comparison of epidemiologic changes over time.

Statistical analysis of 2022–2023 cholera occurrence

We evaluated the association between the 10-year incidence categories (from 2011 to 2020) and cholera occurrence as reported in 2022–2023 WHO external cholera situation reports 1 through 10 (refs. ^{1,59,60,61,62,63,64,65,66,67}), complemented with country-specific situation reports for areas not displayed in WHO reports^{68,69,70,71,72} (Supplementary Table 6). The extracted data corresponded primarily to cholera reported between January 2022 and December 2023, with limited additional data in surrounding months due to the temporal reporting resolution. The 2022–2023 period was selected for analysis because it corresponded to the WHO emergency declaration period, and centralized and official WHO datasets were available for data extraction. For this analysis, locations were determined to have cholera if one or more suspected cholera cases were reported in any of the above-described situation reports.

Fourteen situation report documents (Supplementary Table 6) with map images of cholera occurrence in the post-2020 period were loaded into QGIS geographic information system software (version 3.28.12) and overlaid with the standardized set of country-unified shapefiles (Supplementary Table 5). Each image was georeferenced to the country-unified shapefiles as a basemap with country borders as control points. After aligning the administrative unit boundaries, we manually added centroids to extract point locations for each administrative unit and added attributes to identify the administrative unit level, confidence about the certainty of the administrative unit level, presence of reported cholera cases and the time range represented by the map.

We then spatially joined extracted locations with cholera occurrence to the set of unique ADM2 units used to summarize the cholera incidence mapping results. Cholera occurrence was extrapolated to ADM2 units if cholera was reported in ADM3 scale units or below.

Occurrence model equations

As the situation reports provided limited subnational case data, we evaluated the association between 10-year incidence categories and recent cholera occurrence (binary outcome) in a Bayesian modeling framework. The model consisted of a hierarchical logistic regression that accounted for the probability of failing to detect cholera (false negatives), reporting of cases at multiple administrative levels as well as partial pooling of country-specific parameter values at the regional and continental levels. Inference was performed with HMC as implemented in the Stan programming language⁵⁵.

We first describe a base statistical model and add hierarchical spatial complexity to complete the full model description.

Base statistical model

This analysis aimed to estimate the association between 10-year incidence categories and the probability of reporting suspected cholera occurrence in the 2022–2023 period. For all locations that were modeled, those that reported cholera were indexed with j, and those that did not report cholera were indexed with k.

For ADM2 locations that reported cholera occurrence, the likelihood is:

$$L(\,{y}_{\!j,A2}=1)={p}_{\!j,A2}{\phi }_{j,A2},$$

and

$${\rm{logit}}\left({p}_{\!j,A2}\right)=\alpha +{\beta }_{j,A2},$$

where ${y}_{\!j,A2}$ is the reported cholera occurrence status extracted from the situation report documents in location j, which is at the ADM2 level ($A2$); ${p}_{\!j,A2}$ is the probability of true cholera occurrence; ${\phi }_{j,A2}$ is the probability of reporting cholera if it is present (sensitivity of cholera detection); $\alpha$ is the model intercept; and ${\beta }_{j,A2}$ is the effect of the 10-year incidence category in the ADM2 unit. Notably, the model assumes that all reported cholera is a true instance of cholera occurrence (that is, no false positives).

As the absence of reported occurrence may be due to lack of cholera occurrence or lack of reporting, we treated the absence of reported occurrence as missing data and marginalized out all possible reporting statuses to estimate the underlying true cholera occurrence status. For ADM2 locations that did not report cholera, the likelihood reads:

$$L(\,{y}_{k,A2}=0)={p}_{k,A2}(1-{\phi}_{k,A2})+(1-{p}_{k,A2})=1-{p}_{k,A2}{\phi}_{k,A2},$$

and

$${\rm{logit}}\left({p}_{k,A2}\right)=\alpha +{\beta }_{k,A2},$$

where ${p}_{k,A2}$ is the probability of true cholera occurrence in ADM2 unit $k$; $(1-{\phi }_{k,A2})$ is the probability of not reporting cholera if it is indeed present; and ${\beta }_{k,A2}$ is the effect of the 10-year incidence category in the ADM2 unit.

Reports of cholera occurrence in ADM2 locations could, therefore, be modeled with a Bernoulli distribution:

$${y}_{i,A2}\sim {\rm{Bernoulli}}\left({p}_{i,A2}{\phi }_{i,A2}\right),$$

where i represents any location regardless of cholera reporting status.

Adding higher administrative unit level observations

As some occurrence data were available only at the ADM1 or ADM0 (country) level, these observations were integrated into the model:

$${y}_{i,A < 2}\sim {\rm{Bernoulli}}({\eta }_{i,A < 2}),$$

and

$${\eta }_{i,A < 2}=1-\prod _{i,A2\in S,A2}1-{p}_{i,A2}{\phi }_{i,A2},$$

where S,A2 represents the set of i,A2 ADM2 units contained within the location i,A < 2, which is above the ADM2 level, and ${\eta }_{i,A < 2}$ is the probability of reported cholera occurrence in the higher administrative unit level location i,A < 2.

Hierarchical country-level and region-level priors

We assumed that the association between 10-year incidence categories and the probability of cholera occurrence may vary across countries and regions (for example, eastern Africa). We accounted for these geographic differences by setting hierarchical priors, such that priors for the association of the 10-year incidence category and probability of true cholera occurrence in location i, which is contained within country $c$ and region $r$, were defined as:

$${\beta}_{c}^{m}{\sim}{\rm{Normal}}\left(\,{\mu}_{\beta ,r}^{m},{\sigma}_{\beta,r}^{m}\right),$$

and

$${\mu }_{\beta ,r}^{m}\sim {\rm{Normal}}\left({\mu }_{\beta }^{m},{\sigma }_{\beta }^{m}\right),$$

where $m$ denotes the 10-year incidence category associated with location i; ${\mu }_{\beta ,r}^{m}$ and ${\sigma }_{\beta ,r}^{m}$ are regional-level mean and s.d. of the 10-year incidence category effect $\beta$; and ${\mu }_{\beta }^{m}$ and ${\sigma }_{\beta }^{m}$ are hyperpriors for the mean and s.d. of the 10-year incidence category effect.

Hierarchical priors were also assumed for the cholera detection sensitivity parameters $\phi$, which had analogous relationships on a logit scale:

$${\rm{logit}}({\phi}_{c}){\sim}{\rm{Normal}}(\,{\mu}_{{\rm{logit}}(\phi),r},{\sigma}_{{\rm{logit}}(\phi),r}),$$

and

$${\mu }_{{\rm{logit}}(\phi ),r}\sim {\rm{Normal}}\left({\mu }_{{\rm{logit}}\left(\phi \right)},{\sigma }_{{\rm{logit}}\left(\phi \right)}\right).$$

Model priors and hyperpriors

We used the following priors:

$${\sigma }_{\beta ,r}^{m}\sim {{\rm{Half}}\; {\rm{normal}}}\left(0,2.5\right),$$

$${\mu }_{\beta }^{m}\sim {\rm{Normal}}\left(0,2\right),$$

$${\sigma }_{\beta }^{m}\sim {{\rm{Half}}\; {\rm{normal}}}\left(0,2.5\right),$$

$${\sigma }_{{\rm{logit}}(\phi ),r}\sim {{\rm{Half}}\; {\rm{normal}}}\left(0,1\right),$$

$${\mu }_{{\rm{logit}}(\phi )}\sim {\rm{Normal}}\left(1.5,5\right),$$

$${\sigma }_{{\rm{logit}}(\phi )}\sim {{\rm{Half}}\; {\rm{normal}}}\left(0,1\right).$$

Assessing potential intervention reach when prioritizing targets by cholera incidence

We assessed the potential reach (best-case scenario) of non-specific interventions when targeted based on cholera incidence through two analyses. Both analyses prioritized ADM2 units by decreasing incidence category and decreasing population size within incidence categories as a simplification of how intervention targets might be prioritized using cholera incidence data. Here, the population targeted by interventions was calculated as the sum of the targeted ADM2 unit populations, adjusted for 2020 population size. The analyses then examined what proportion of (1) mean annual cholera cases or (2) population living in ADM2 units with 2022–2023 cholera occurrence would have been reached under different targeting strategies. ‘Prospective’ targeting used past incidence categories to target future interventions, whereas ‘oracle’ targeting prioritized interventions based on incidence categories from the same period.

In the first analysis, we assessed the proportion of mean annual 2016–2020 cholera cases that would have been reached by interventions had 2011–2015 (‘prospective’) or 2016–2020 (‘oracle’) incidence categories been used for targeting. We also assessed the proportion of mean annual 2011–2015 cholera cases that would have been reached by interventions had 2011–2015 incidence categories been used for targeting (‘oracle’ only).

In the second analysis, we examined the proportion of population living in ADM2 units with modeled 2022–2023 cholera occurrence (modeled according to the above-described statistical analysis) that would have been reached by interventions had 2011–2015, 2016–2020 and 2011–2020 incidence categories (‘prospective’) been used for targeting. These three strategies were compared to an ‘oracle’ targeting strategy where ADM2 units with 2022–2023 cholera occurrence were ranked in decreasing order of population size.

Inclusion and ethics statement

The institutional review board (IRB) at Johns Hopkins Bloomberg School of Public Health (BSPH) determined that secondary analysis of data from the global cholera incidence database was exempt (BSPH IRB no. 27682), and no other institutional approvals were sought.

Twelve authors (J.P.M.L., R.C., P.W.O., G.B., L.E., A.V.N., N.F.M., E.W.O., S.Y., F.K., S.O.O. and A.J.S.) contribute to public health activities in cholera-affected LMICs. They provided feedback on the interpretation and application of this work based on their expertise in cholera surveillance and control in LMIC contexts. We fully endorse the Nature Portfolio journals’ guidance on LMIC authorship and inclusion and are strongly committed to the inclusion of more researchers and decisionmakers from LMICs in future related work.

Policymakers in the Africa region may use the data from this study to identify areas with historically sustained, sporadic and limited cholera activity, which may inform future cholera control planning and serve as a benchmark for measuring progress in cholera control efforts. These burden maps may also be used to identify cross-border areas that would benefit from enhanced regional coordination and surveillance.