Probabilistic forecasting of hand, foot and mouth disease in Mainland China using Bayesian additive regression tree model

Categories: Disease & Virus

September 30, 2025

HFMD surveillance data

Previous studies divided the geographic regions of mainland China into seven regions, as Northeast, North, Northwest, East, Middle, Southwest, and South¹. In this study, we selected study areas from seven regions, namely Heilongjiang (Northeast), Shanxi (North), Gansu (Northwest), Shandong (East), Henan (Middle), Sichuan (Southwest), and Guangdong (South). The monthly HFMD case data for these areas from June 2008 to December 2018 was obtained from the Chinese Center for Disease Control and Prevention through the Public Health Sciences Data Center (https://www.phsciencedata.cn/).

We fitted the model using the training set (June 2008 to December 2017, 115 months) and evaluated the forecasting accuracy of the model using the test set (January 2018 to December 2018, 12 months).

Methods

BART

The Bayesian Additive Regression Trees (BART)²⁰ is a ‘sum-of-trees’ model in which each tree is constrained by the prior and the model expression is as follows.

$$Y = \mathop \sum \limits_{i = 1}^{m} g\left( {X,T_{i} ,{\varvec{M}}_{{\varvec{i}}} } \right) + e, e\sim N\left( {0,\sigma_{e}^{2} } \right)$$

(1)

where $T_{i}$ represents a binary regression tree, including splitting rules and a set of terminal nodes, ${\varvec{M}}_{{\varvec{i}}}$ denotes the value of the parameter associated with the terminal node of $T_{i}$. Each tree fits residuals that are not explained by the rest of the trees, so the final predicted value is the sum of the predicted values of all the trees.

By assuming that each tree, the terminal nodes of each tree and $\sigma_{e}^{2}$ are independent, it is possible to set the priors for $p\left( {T_{i} } \right)$, $p\left( {M_{i} |T_{i} } \right)$ and $p\left( {\sigma_{e} } \right)$. For $p\left( {T_{i} } \right)$, it consists of the depth of the tree, the splitting rules and the splitting values. In order to make the probability of the terminal node higher as the tree depth gets progressively larger, the authors assume that the probability is $\alpha \left( {1 + d} \right)^{ – \beta }$. For splitting rules, the uniform prior of the available variables is used, and for splitting values, the uniform prior of the discrete set of available splitting values is used. The prior for the terminal node parameters $p\left( {M_{ij} |T_{i} } \right)$ is assigned a normal distribution. For $p\left( {\sigma_{e} } \right)$, the inverse chi-square distribution of the conjugate prior is given.

Therefore, based on the observed data Y, the posterior distribution is generated. The point estimates are generated from the mean of samples that are sampled by the back-fitting MCMC algorithm, and interval estimates are obtained from the percentile of these samples.

ARIMA

The Auto-regressive Integrated Moving Average (ARIMA) model is the classic model for time series forecasting²⁴. It combines autoregressive and moving average terms. When the time series contains seasonal features, the Seasonal Auto-regressive Integrated Moving Average model is required for forecasting. The models without and with seasonal features can be expressed respectively as ARIMA (p, d, q) and ARIMA (p, d, q) × (P, D, Q)_S, where p, q, P and Q denote the order of auto-regression and moving average, the order of seasonal auto-regression and moving average, respectively. d and D denote the number of non-seasonal and seasonal differences, and s denotes the length of the seasonal period.

Before constructing the ARIMA model, we performed an evaluation of stationarity using the ADF test, and performed a difference transformation when non-stationarity was detected. For series exhibiting significant heteroskedasticity (as determined by residual diagnostics), we performed variance stabilization using the Box-Cox transformation. After model fitting, we performed the Ljung-Box test for residual autocorrelation to verify that the model met the necessary assumptions. In order to maintain comparability with the BART model, we employ a rolling prediction strategy, i.e., each step of the prediction uses a fixed window of historical data to generate predictions for the test set on a rolling basis.

We constructed the ARIMA and BART models for the regions. For ARIMA model, the optimal model was selected based on the Bayesian Information Criterion. The smaller of the BIC values indicates that the model fits the data better. For the BART model, we first applied a Box-Cox transformation to normalize the count data and incorporated 12-month lagged terms as model inputs to explicitly account for seasonal patterns. The model employed the recommended default hyperparameters ($\alpha$ = 0.95, $\beta$ = 2) and followed by a comprehensive tuning via grid search to optimize: (i) Number of trees (ntree:50–500); (ii) Branching factor (k:1–4); (iii) Burn-in iterations (nskip:50–300); (iv) Posterior draws (ndpost:300–1200). All statistical analysis were completed in R (v4.1.2), employing the forecast package for ARIMA and the BART package for BART implementation.

Evaluation metrics

For the accuracy of point forecasting, we used root mean square error (RMSE), mean absolute percentage error (MAPE), and the Pearson correlation coefficient (PCC) to evaluate the performance of the models. The model with smaller RMSE and MAPE and larger PCC has better performance.

$$RMSE = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} – y_{i}^{*} } \right)^{2} }$$

(2)

$$MAPE = \frac{100\% }{n}\mathop \sum \limits_{i = 1}^{n} \left| {\frac{{y_{i}^{*} – y_{i} }}{{y_{i} }}} \right|$$

(3)

$$PCC = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} – \overline{y}_{i} } \right)\left( {y_{i}^{*} – \overline{y}_{i}^{*} } \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} – \overline{y}_{i} } \right)^{2} } \sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i}^{*} – \overline{y}_{i}^{*} } \right)^{2} } }}$$

(4)

where $y_{i}$ is the actual value, $y_{i}^{*}$ is the forecast value and $\overline{y}$ is the mean value.

For interval forecasting, Prediction Interval Coverage Probability (PICP) reflects the probability that the actual value falls in the upper and lower limits of the prediction interval at a certain confidence level $\alpha$. The closer the PICP is to 1, the more reliable the prediction is²⁵. Prediction Interval Normalized Average Width (PINAW) reflects the average width between the upper and lower limits of the prediction interval. When the PICP values are the same, a smaller PINAW indicates that the model has better predictive performance. Since PICP and PINAW reflect only unilateral evaluation, they cannot reflect the comprehensive performance of interval prediction²⁶. Therefore, we used the coverage width-based criterion (CWC) to compare the performance of the models in interval prediction. The smaller the CWC value, the better the model performs.

$$PICP = \frac{1}{\left| \varepsilon \right|}\mathop \sum \limits_{t \in \varepsilon } I\left( {\zeta_{t} \le y_{t} \le \mu_{t} } \right)$$

(5)

$$PINAW = \frac{1}{\left| \varepsilon \right|R}\sum\limits_{t \in \varepsilon } {\mu_{t} – \zeta_{t} }$$

(6)

$$CWC = PINAW\left\{ {1 + I\left( {PICP

(7)

$\left\{ {\left[ {\zeta_{t} ,\mu_{t} } \right]} \right\}_{t \in \varepsilon }$ denotes the set of samples of the forecasting interval, $\zeta_{t}$ and $\mu_{t}$ represent the lower and upper bounds of the prediction interval, respectively. $\varepsilon$ is the set of time scales of the forecasting interval. $R$ represents the difference between the maximum and minimum values of the observed values in the $\varepsilon$. $\lambda$ refers to the penalty parameter, which is taken as 1 in this study.

Source link