Socio-demographic and behavioral characteristics of the study subjects
Overall, 907 presumptive TB cases, aged ≥15 years were prospectively enrolled in the current study. The mean age of the study participants was 43.6 ± 17.7 years. Among the participants, 479 (52.8%) were men, 494 (54.5%) lived in rural areas, and 371 (40.9%) were unable to read and write. One hundred and fourteen (12.6%) patients had a history of TB contact, and 130 (14.3%) had a history of imprisonment (Table 1).
Clinical characteristics of the study subjects
Among the 907 respondents, 787 (86.8%) reported cough for at least 14 days, 559 (61.6%) reported fever, 603 (66.5%) reported weight loss, and 658 (72.5%) reported night sweats (Table 2). Additionally, 725 (79.9%) patients experienced fatigue, 160 (17.6%) had pallor, and 251 (27.7%) had crepitation. Among all presumptive TB cases, 155 (17.1%) were confirmed to have PTB.
Five hundred and sixty-five (62.3%) patients received antibiotics before the sputum smear test, of whom 20% (117) had PTB. In total, 138 (15.2%) study participants were positive for human immunodeficiency virus (HIV), and 17 (1.9%) had diabetes mellitus. However, 57.3% and 59.8% of the participants were unaware of their HIV and diabetes mellitus statuses, respectively (Table 3).
Variable selection
We used LASSO regression to select candidate variables. Initially, we fitted the LASSO logistic regression with 27 variables (sociodemographic, clinical, and risk factors). Next, we used a minimum lambda to shrink the coefficients (Fig. 1). Finally, we reduced the number of features to ten using one standard error value of lambda (lambda.1se) and produced a parsimonious and interpretable model. Our cross-validation yielded an accuracy of 83.4%. The minimum lambda (λ) value was 0.0046, with a log (λ) value of -5.37.
Model development
Ten variables were fitted in the multivariable binary logistic regression model. This included age, cough severity, loss of appetite, number of classical TB symptoms, antibiotic trial, history of TB contact, history of imprisonment, chronically sick-looking, pallor, and presence of dull sounds (Table 4). In the adjusted analysis, all variables, except for history of imprisonment, were independently associated with PTB (p < 0.05).
Risk score derivation
The predicted risk of PTB in presumptive TB cases was calculated with the following formula: predicted risk =\(\:1/(1+{e}^{-\text{Z}})\)
$$\begin{gathered} \:{\text{Z}} = \:\left( { – 5.62} \right) + \left( {3{\text{*age}}\:15 – 24\:{\text{years}}} \right) + \left( {2{\text{*age}}\:25 – 34\:{\text{years}}} \right) + \left( {2{\text{*severity}}\:{\text{of}}\:{\text{cough}}} \right) + \left( {2{\text{*}}\:{\text{appetite}}\:{\text{loss}}} \right) \hfill \\ + \left( {3{\text{*number}}\:{\text{of}}\:{\text{symptoms}}} \right) + \left( {1{\text{*antibiotic}}\:{\text{trial}}\:} \right) + \left( {2{\text{*history}}\:{\text{of}}\:{\text{TB}}\:{\text{contact}}\:} \right) + \hfill \\ \left( {1{\text{*chronically}}\:{\text{sick}} – {\text{looking}}} \right) + \left( {1{\text{*}}\:{\text{pallor}}\:} \right) + \left( {1{\text{*dull}}\:{\text{sound}}} \right) \hfill \\ \end{gathered}$$
. Where ‘e’ is the base of natural logarithms and Z is the sum of scores, including the intercept.
To convert the raw score into a user-friendly scale, we used a simple linear scaling:
$$\:\text{T}\text{r}\text{a}\text{n}\text{s}\text{f}\text{o}\text{r}\text{m}\text{e}\text{d}\:\:\text{s}\text{c}\text{o}\text{r}\text{e}=\:\text{K}+\:{({\upbeta\:}}_{0}+{{\upbeta\:}}_{1}{\text{X}}_{1+\dots\:+}{{\upbeta\:}}_{\text{n}}{\text{X}}_{\text{n}})+\text{C}$$
Where, \(\:{({\upbeta\:}}_{0}+{{\upbeta\:}}_{1}{\text{X}}_{1+\dots\:+}{{\upbeta\:}}_{\text{n}}{\text{X}}_{\text{n}})\) is the raw log-odds score, which is equal to Z in the preceding equation; ‘K’ is a scaling factor, and ‘C’ is a constant to shift the score. In our risk score, we assigned 1 for K and 5.62 for B. The possible sum of the scores for an individual in the dataset ranged from zero to 15. The best cutoff score (threshold) for our model was 8.5. This was obtained at a Youden index of 51.5%.
Nomogram for pulmonary TB
This nomogram calculates the risk of PTB in presumptive TB cases based on significant predictors. In this nomogram, once a care provider determines the patient-specific parameters, the patient’s total points are determined by adding the points earned for the values of each variable. Then, the probability of PTB in the patient can be ascertained from the total points (Fig. 2).
Model performance (discrimination and calibration)
The discrimination power (AUC) of the model for predicting PTB was 0.835 (95% CI: 0.80–0.87) (Fig. 3A). In the calibration plot, the predicted risks overlapped with the observed proportion of PTB (Fig. 3B). Our model had a slope of 0.98 (95% CI: 0.83, 1.17) and an intercept of 0.001 (95% CI: -0.02, 0.02). At this threshold value (≥ 8.5), the risk score showed moderate discrimination, with an AUC of 0.82 (95% CI: 0.78–0.85).
ROC curve and model calibration plot. In Fig. 3A, the ROC curve for the model is above the curve for the simplified risk score. In Fig. 3B, the red line represents perfect calibration, the blue line corresponds to the calibration of the model, the black line corresponds to a smoothed (Loess) calibration, and the gray region corresponds to the 95% confidence interval of the Loess calibration.
In addition, at the threshold score≥8.5 and a Youden’s index of 51.5%, our risk score had a sensitivity of 82.6% (95% CI, 75.7–88.2%) and a specificity of 68.9% (95% CI, 65.4–72.2) (Table 5). According to the cutoff score, 39.9% (362) of the study participants were at high risk (score ≥8.5), and 60.1% (545) were at low risk (score < 8.5). The proportions of patients with PTB were 35.4% and 5.0% among high-risk and low-risk patients, respectively.
Model validation (Internal)
We performed bootstrap validation and determined the optimism-corrected performance of the model. The AUC for our model was 0.835 (95% CI, 0.80–0.871) after resampling. The internal validation had a mean absolute error of 0.014 and a mean squared error of 0.00027. The bias-corrected calibration had an intercept of -0.1025 and a slope of 0.905 (Supplementary Fig. S1). The Somer’s delta (Somer’s D) value for the resampled data was 0.636, which was approximately equal to that of the original data (Somer’s D = 0.67). In addition, the model had an optimism coefficient of 0.0339, indicating an internally valid prediction.
Clinical utility
According to the DCA, the curve for the model was higher and further to the right, indicating greater benefits of the model across a wide range of threshold values. The optimal threshold that maximized the net benefit was between 0.1 and 0.2. In this range, the model curve lies above the treat-all or treat-none lines. Thus, treating high-risk patients identified by the model can lead to favorable outcomes (Fig. 4).
A decision curve plotting benefit against a threshold: the net benefit at different thresholds for the model to predict PTB among presumptive cases is presented in bold rose color, along with its 95% confidence interval. The other two lines represent intervention for all (thin gray line) and intervention for none (black line). Cost-benefit ratio: The cost-benefit ratio is extremely low at a low threshold probability. At large values, there was no cost for treatment; all patients chose to be treated regardless of the TB risk.



