Comparing the accuracy of computer-aided detection (CAD) software and radiologists from multiple countries for tuberculosis detection in chest X-Rays

Categories: Disease & Virus

July 2, 2025

This analysis leveraged the CAD readings previously published in a complementary publication using the same evaluation dataset^10,17. For the complementary publication, we contacted 13 CAD manufacturers with commercially available products for TB (according to ai4hlth.org) between January 2021 and December 2023, and 10 of the 13 contacted CAD manufacturers consented to participate¹⁷. These readings were utilized in this complementary analysis. Evaluated CAD products in this paper therefore include: CAD4TB [version 7] (Delft Imaging Systems, the Netherlands), ChestEye [version 2.4] (Oxipit.ai, Lithuania), Genki [version 20.12] (DeepTek, India), InferRead [version 1] (Infervision, China), JF CXR-2 [version 2] (JF Healthcare, China), Lunit INSIGHT CXR (Lunit) [version 4.9] (Lunit, South Korea), Nexus CXR [version 1.0] (Nexus, South Africa), qXR [version 3] (Qure.ai, India), RADIFY [version 3.5.0c] (Envisionit, South Africa), TiSepX-TB [version1.0.0.0] (MedicalIP, South Korea), XrayAME [version 1] (Epcon, Belgium) and Xvision [version 2.2.211] (Mindfully Technologies SRL, Romania). This resulted in a total of 12 products in this evaluation.

Of 774 participants, 396 (51%) were males and participants had a median age of 48.3 (± 18). Over half (n = 405, 52%) had TB symptoms, mostly cough (n = 258, 33%). Of 258 cases, most (n = 189, 73%) were positive on both liquid culture and Xpert (Table 1). Cases were significantly younger than controls, were more likely to have had prior TB infection and to currently be receiving or have previously received TB treatment; however, cases were less likely to report symptoms or to smoke than controls. Over a quarter of cases were people living with HIV (n = 65), compared to only 15% of controls (n = 77). All CAD assigned significantly lower abnormality scores to controls than to cases with p < 0.01.

Table 1 Characteristics of 774 participants.

323 (42%), 324 (42%), 258 (33%), and 259 (33%) of CXR were classified as TB-suggestive by British, Nigerian, American, and Indian radiologists, respectively. Of these, 203 (79%), 199 (77%), 177 (69%), and 173 (67%) were bacteriologically positive, respectively. Radiologists from all countries classified significantly more CXR as suggestive of TB in cases compared to controls.

Radiologists’ sensitivity and specificity against the MRS

In the restricted reading, radiologists from the UK had the highest overall sensitivity at 78.7% (95% CI: 73.2–83.5%), with specificity of 76.7% (95% CI: 72.9–80.3%), followed by Nigerian, American, and Indian radiologists (Table 2). Although the latter had significantly lower overall sensitivity than the British radiologists, there was some overlap in individual-level performance as UK Radiologist 4 demonstrated a notably lower sensitivity (56.5% [34.5–76.8%]) than all other study radiologists. In contrast, American Radiologist 2 had the highest sensitivity across all study radiologists (94.1% [71.3–99.9%]), however they only significantly outperformed Nigerian Radiologist 2 in terms of sensitivity.

While British radiologists had the highest overall sensitivity, their specificity (76.7% [72.9–80.3%]) was lower than American and Indian radiologists, but higher than those from Nigeria. The individuals with the lowest specificity were UK Radiologist 3 (67.2% [54.6–78.2%]) and Nigerian Radiologist 1 (67.9% [62.0-73.4%]). Both had specificity significantly lower than Nigerian Radiologist 2, American Radiologist 1, and Indian Radiologist 3, with the latter also performing significantly worse than the other Indian radiologists. However, both UK Radiologist 3 and Nigerian Radiologist 1 demonstrated comparably high sensitivity compared to their peers. Overall, American radiologists had the highest specificity at 84.3% (95% CI: 80.9–87.3%), with sensitivity of 68.6% (95% CI: 62.6–74.2%), followed by Indian, British and Nigerian radiologists. American Radiologist 2 demonstrated the highest specificity at 84.8% (68.1–94.9%), although this was not significantly better than others. The same radiologist also demonstrated the highest sensitivity. While Nigerian radiologists were the least specific (75.8% [71.8–79.4%]), they had the second highest sensitivity after radiologists from the UK.

In the inclusive reading, American radiologists had the highest sensitivity (92.3% [95% CI: 88.3–95.2%]), followed by British, Nigerian, and Indian radiologists, with significant differences between the highest and lowest (Table 2). However, American radiologists also had significantly lower specificity than any other group. American Radiologist 2 had the highest sensitivity point estimate of all radiologists, achieving 100%, but with moderate specificity (66.7% [48.2–82.0%]). Indian radiologists had the highest specificity (73.6% [95% CI: 69.6–77.4%]) overall, significantly outperforming radiologists from all other countries. Individual specificity ranged from 73.0% (67.4–78.1%) to 77.3% (66.2–86.2%) across Indian readers, with the top-performing significantly more specific than Radiologist 1 from the US, UK, and Nigeria. While the sensitivity of Indian radiologists was lowest, this was only significant when compared to American radiologists. Meanwhile, British radiologists simultaneously achieved the second highest sensitivity and specificity, with 87.2% (82.5–91.0%) and 63.4% (59.1–67.5%), respectively.

Table 2 Radiologist sensitivity, specificity, and accuracy by country.

Only American Radiologist 2 surpassed the WHO’s Target Product Profile (TPP) of 90% sensitivity and 70% specificity on the restricted reading. None of the country-level estimates met the TPP. Sensitivity and specificity varied widely across countries despite receiving the same reading instructions. In both inclusive and restricted readings British and Nigerian readers were especially sensitive but less specific, where the reverse was observed for Indian radiologists who generally had high specificity. Although American radiologists were highly sensitive in the inclusive reading, in the restricted reading their performance diverged as Reader 1 demonstrated higher specificity, but missed TB cases, while Radiologist 2 delivered high sensitivity and specificity simultaneously.

Level of agreement between country groups

On the restricted reading, agreement ranged from weak to moderate between radiologists’ groups. Agreement was strongest between American and Indian radiologists, with a kappa of 0.74 (0.68–0.79) (Table 3). Meanwhile, British radiologists had a low to moderate level of agreement with radiologists from all other countries²⁰. Agreement was weak between Nigerian and Indian radiologists (kappa = 0.60 (95% CI: 0.55–0.66), while Nigerian and American radiologists (kappa = 0.59 (95% CI 0.53–0.65)) had the weakest level of agreement overall. Agreement was even lower on the inclusive reading. The highest alignment was between Indian and British radiologists (0.69 (95% CI 0.64–0.74)), who had moderate agreement, while the lowest was between American with Nigerian and Indian radiologists, who demonstrated weak agreements of 0.55 (95% CI 0.50–0.61) and 0.55 (95% CI 0.50–0.60), respectively.

Table 3 Cohen’s kappa coefficient between countries on restricted and inclusive readings.

Impact of radiologists’ characteristics on performance

We analyzed the impact of various individual-level factors on how well radiologists interpreted X-rays to detect TB. We found that the country where a radiologist works made very little difference in their accuracy (variance of 0.019), meaning that being from different countries didn’t significantly affect how well radiologists read the X-rays. Although the intercept was statistically significant, indicating the overall chance of correctly identifying TB on an X-ray was better than guessing, this result might be influenced by the small number of radiologists we studied (only 11 people from 4 countries). Furthermore, the negative interaction term suggested that more experience may not necessarily result in higher reader accuracy, but this was not statistically significant. Including country of practice as a fixed effect also identified no significant impact of reader characteristics on interpretation accuracy (Annex 1).

Radiologists sub-group analysis

Radiologist performance varied across subgroups defined by HIV status, prior TB history, and age (Fig. 1). Results are presented here for the restricted analysis; however, similar trends are also observed for the inclusive analysis (Annex 2).

Sensitivity and specificity were generally higher in HIV-negative individuals than in those who were HIV-positive, with significant differences in sensitivity observed for US radiologists (a reduction of 23.5%) and in specificity for Indian and Nigerian radiologists, with reductions of 17.2% and 17.3%, respectively. UK radiologists demonstrated the highest sensitivity (83.3% [95% CI: 76.6–88.4%]) in HIV-negative people, with moderate specificity of 79.9% (75.3–83.8%). UK radiologists also had the second highest sensitivity in people living with HIV (PLHIV) (73.8% [62.0–83.0%]) and were only outperformed by Nigerian radiologists, who demonstrated sensitivity of 76.9% (65.4–85.5%) but the lowest specificity of any country in PLHIV (68.8% [57.8–78.1%]) (Annex 2). In the HIV-negative population, Indian radiologists demonstrated the highest specificity (87.3% [83.3–90.4%]) and lowest sensitivity (74.0% [66.4–80.4%]), but American radiologists that demonstrated the highest specificity in PLHIV (80.5% [70.3–87.8%]) at the expense of sensitivity (53.8% [41.9–65.4%]).

All groups demonstrated higher sensitivity but significantly lower specificity in individuals with a prior TB compared to those without. The greatest difference was observed in British radiologists, who were 44.4% less specific in those with a history of TB. However, despite having the lowest specificity, UK radiologists demonstrated the highest sensitivity in those with prior TB at 84.6% (75.8–90.6%). In contrast, in people with prior TB, US radiologists had the lowest sensitivity (74.7% [64.9–82.5%]), but highest specificity (60.4% [50.9–69.2%]). Indian and Nigerian radiologists demonstrated a better balance of sensitivity and specificity. In those who had never had TB, UK radiologists again had the highest sensitivity at 75.4% (68.4–81.4%), simultaneously obtaining high specificity (85.9% [82.1–88.9%]) although this was surpassed by both American and Indian radiologists. Specificity in this group was highest for Indian radiologists at 91.2% (88.1–93.6%) and lowest for Nigerian radiologists at 84.4% (80.6-87.6%).

Age also influenced radiologist performance. Sensitivity was highest in the youngest group (15 to < 35 years) and declined with age. While radiologists demonstrated similar sensitivities in young and middle age groups (35 < 55 years), with overlapping confidence intervals, when comparing the young and older age groups (> 55 years) radiologists from India, the UK and the US were significantly less sensitive in the older individuals. Indian and Nigerian radiologists were also significantly less sensitive in the older than the middle age group. The largest absolute reduction in sensitivity between young and old was observed for Indian radiologists at 24.3%. Trends in specificity were more complex. Specificity also decreased with age but was lowest in the middle-aged group for all countries, recovering marginally in the older age group. For Indian and British radiologists, specificity was significantly lower in the middle age group compared to the younger, with the greatest reduction observed for British radiologists (19.7%). There were no significant differences in specificity between middle and older age groups.

In the youngest group, UK radiologists had the highest sensitivity at 85.7% (76.7–91.6%), with moderate-ranking specificity of 88.9% (82.5–93.1%), while Indian radiologists had the lowest sensitivity at 75.0% (64.8–83.0%) and highest specificity (92.6% [86.9–95.9%]). In the middle-aged group, Nigerian radiologists showed the highest sensitivity (83.5% [75.1–89.4%]) and lowest specificity, while US radiologists demonstrated the lowest sensitivity and highest specificity (80.8% [73.7–86.4%]). Among the oldest group, UK radiologists had the highest sensitivity at 66.2% (54.6–76.1%), with moderate specificity of 74.5% (68.5–79.6%). Indian and American radiologists both demonstrated very low sensitivity in the older population at 50.7% (39.3–62.0%) and 54.9% (43.4–66.0%), respectively, but both had specificity surpassing 80%.

Radiologists compared to CAD overall performance

In general CAD solutions and radiologists demonstrate a considerable overlap in performance (Fig. 2). However, the CAD with the highest AUC, Lunit (0.902 [95% CI: 0.879–0.926]) outperformed all radiologists on both inclusive and restricted readings. Nexus (0.897 [95% CI: 0.872–0.922), the second highest performing CAD, outperformed radiologists on the restricted reading and performed similarly on the inclusive reading. For both restricted and inclusive readings, the performance of most other CAD software overlapped with that of radiologists, including qXR, JF CXR-2, ChestEye, Xvision, CAD4TB, Genki, InferRead DR Chest, and TiSepX-TB. There was no significant difference in performance between most CAD software, except TiSepX-TB which had non-overlapping confidence intervals with the top two-performers (Lunit and Nexus), XrayAME which performed significantly worse than all but TiSepX-TB and RADIFY, and RADIFY with significantly worse performance than all CAD. The software with the lowest AUCs (RADIFY and XrayAME) performed worse than all radiologists on both readings.

Our prior, complementary publication provides a detailed interpretation of CAD software performance alone in terms of ROC-AUC and other performance metrics¹⁷.

Matching CAD with radiologists’ sensitivity and specificity when detecting TB against MRS

In both restricted and inclusive readings, CE-marked CAD solutions (Class I, IIa, and IIb) performed as well as or better than radiologists across all four countries (Table 4). The CE mark signifies compliance with EU standards for safety, health, and environmental protection. For CAD products included in Table 4, Class I is the least stringent classification, while Class IIb is the most stringent. However, there were some exceptions: CAD4TB had significantly lower specificity when matching Indian radiologists’ sensitivity on the inclusive reading; InferRead with significantly lower sensitivity when matching American radiologists’ specificity on the restricted reading, and ChestEye had significantly lower specificity when matching American radiologists’ sensitivity on the restricted reading.

Lunit significantly outperformed all radiologists, except for Indian radiologists when matching specificity in the inclusive reading. Several CAD without CE-mark also performed well compared to radiologists. Nexus significantly outperformed all radiologists, except Indian radiologists when matching specificity in the restricted reading. Furthermore, Genki and JF CXR-2 performed on par with radiologists from all countries in the restricted reading, while JF CXR-2 significantly improved upon radiologists from Nigeria and the US in the inclusive reading. Genki also significantly improved upon the sensitivity of American radiologists when matching specificity in the inclusive reading. RADIFY and XrayAME performed significantly worse than radiologists in all analyses, except for Indian radiologists when matching sensitivity in the restricted reading. Further details are available in Annexes 3 and 4.

Table 4 Summary of statistical differences between CAD and radiologists. Green = CAD performed better than the radiologist (p < 0.05); red = statistically worse performance by CAD, p < 0.05; grey = no statistical difference between performance.

Performance of radiologists compared to the best-performing CAD software

We compared radiologists’ performance against the best performing CAD solution. Figure 3 shows the ROC curve of the CAD software with the highest AUC with the points of each group of radiologists indicated. Radiologists’ sensitivity and specificity fall below the level of the highest-performing CAD software on both the restricted and inclusive readings. For analysis of radiologists’ groups against different CAD software see Annex 5.

We then compared radiologists’ performance against the best CAD software when matching sensitivity and specificity (Annex 5). On the restricted reading, American radiologists achieved specificity closest to the best CAD at 84.3% (95% CI: 80.9–87.3%) compared to 88.4% (95% CI: 85.3–91.0%) and with overlapping confidence intervals. The sensitivity of British radiologists (78.7% [95% CI: 73.2–83.5%]) was closest and overlapped with CAD’s (87.1% [95% CI: 80.4–92.2%]). For the inclusive reading, when sensitivity matched, Indian radiologists compared most favorably to the best CAD software, with specificities of 73.6% (95% CI: 69.6–77.4%) and 83.5% (95% CI: 80.0-86.6%), respectively. While American radiologists were closest to the best CAD software in terms of sensitivity, with 92.3% (88.3–95.2%) compared to 96.4% (91.9–98.8%).

Given the heterogeneity in radiologist’s performance, we calculated the Euclidean distance between each country’s performance and each radiologist’s performance and the best-performing CAD software’s ROC curve to assess how closely aligned performance is (Table 5)^21,22. The shortest possible distance was calculated from each point to the ROC curve, with a smaller distance indicating a more similar performance to the software. In both restricted and inclusive readings, American radiologists performed most comparably to the best CAD (i.e. the shortest distance), followed by those from India, the UK and Nigeria in the restricted reading and by British, Indian, and Nigerian radiologists in the inclusive reading.

Table 5 Euclidean distance between the ROC of the highest-performing CAD and radiologists from each country.

Overall, we found no significant differences in Euclidean distances between the radiologists (in general p > 0.05) (Annex 6), with a few exceptions. In the restricted reading, UK Radiologist 1 showed the closest individual alignment with the best CAD, although American radiologists that were most closely aligned overall (Table 5). This alignment was only significantly greater than that of UK Radiologist 2 and American Radiologist 2. Notably, American Radiologist 2 had the greatest Euclidean distance from the best CAD’s curve, making them the least aligned reader on the restricted reading, despite being the only one to meet the WHO TPP. This discrepancy was significant only compared to UK Radiologist 1 (p = 0.015, Annex 6). For the inclusive reading, UK Radiologist 1 also aligned closest with CAD (p = 0.033, Table 5), but alignment was significantly closer only when compared to Nigerian Radiologist 2, who was the least aligned reader (p = 0.018, Table 5). Nigerian Radiologist 2 was also significantly further from the ROC curve than UK Radiologist 4, indicating less alignment with the best-performing CAD software.

Agreement between radiologists from each country and the best-performing CAD

We calculated the agreement between the country groups and the highest performing CAD software, setting the threshold of CAD at the midpoint of 0.5. In the restricted reading, there was moderate to strong agreement between this CAD and each group of radiologists with kappa ranging from 0.65 (0.60–0.70) for Nigerian radiologists to 0.72 (0.66–0.76) for Indian radiologists (Annex 7)²⁰. In the inclusive reading there was weaker agreement, except for Indian radiologists, where the kappa coefficient was 0.7, indicating moderate to strong agreement.

Source link