Machine learning based characterization of high risk carriers of HTLV-1-associated myelopathy (HAM)

Categories: Disease & Virus

July 12, 2025

In this study, we developed a machine learning-based approach to capture HTLV-1 carriers at elevated risk of HAM progression. The Isolation forest anomaly detection algorithm identified a subgroup of anomaly samples from the asymptomatic HTLV-1 carrier population. Further characterization through classifier prediction and statistical analysis revealed that the anomaly carrier samples closely resemble the characteristics of HAM, suggesting a similar disease trajectory. Additionally, different patterns of antibody response were observed among the asymptomatic carriers and other clinical subgroups which enabled us to further investigate the risk factors. Finally, we utilized SHAP for comparative feature analysis among the sample groups (non-anomaly carrier, anomaly carrier, ATL, and HAM) to identify the key driving features that characterize each subgroup and contribute to the disease progression.

The main aim of this study was to shed light on asymptomatic carriers who are at a high risk of progressing HAM onset. With most of the anomaly carrier samples being predicted as HAM by the RF classifier [Fig. 1], our hypothesis was further supported when the purposely included CDH sample in the carrier population was also identified as an anomaly and subsequently predicted as HAM. The potential similarities in the underlying profiles of the anomaly carrier samples are also reflected in their clustering near the HAM samples (Fig. 2). All features were significantly higher in anomaly carriers compared to non-anomaly carriers (Fig. 3). Elevated antibody responses in anomaly carriers might reflect the immune response have higher activity during disease progression. Interestingly, we found that only anti-Env antibody titer in anomaly carriers differed significantly from those of HAM (Fig. 3, Supplementary Table S6), whereas other features showed no significant differences. Env is one of the structural proteins of a virion and is necessary for cell-to-cell transmission. Thus it is a primary target of the antibody response^32,33,34,35. Furthermore, elevated anti-Env antibody responses have been associated with HAM patients in several studies, which supports our result^11,20,36,37. A novel implication is that, before onset, the rate of progression accelerates, as evidenced by the increased antibody levels. In HAM, the immune response is fully engaged; however, in progressive asymptomatic carriers, this saturation has yet to be achieved²⁰. This phase might represent a snapshot of dynamic host-virus interaction where these rising antibody titers likely reflect the heightened viral activity and the immune system’s escalating response as the disease advances toward clinical manifestation. Ultimately, a saturation point is reached at the onset of the disease, where antibody levels level off as the immune response shifts into a steady-state phase. This might be well reflected in feature analysis, where the SHAP value of Env is relatively high in the anomaly carrier, but not in HAM and non-anomaly carrier [Fig. 4, Supplementary Fig. S7].

We found Tax to be the predominant feature of HAM, consistent with findings from multiple studies^20,38. Furthermore, prior studies have reported significantly higher antibody responses to Env and Gag proteins in HAM patients reinforcing their potential role in HAM patients^11,20. It is known that during infection, Gag and Env proteins are initially unpolarized in isolated T cells and accumulate at the cell-cell junction upon contact. Gag protein is subsequently transferred from HTLV-1-infected T cells to uninfected T cells³⁹. Aligning with these previous observations, we interestingly found the feature values of Gag p15, p24, and Env of anomaly carrier samples exhibited a significant inverse relationship with their anomaly scores, i.e., higher feature values correspond to higher anomaly levels [Supplementary Table S2, Table S5, Supplementary Fig. S5]⁴⁰. Assessment of humoral immunity to Gag demonstrates potential as a biomarker for detecting high-risk individuals. In our study, we succeeded in suggesting that Gag p15 protein has some important function that may lead to developing HAM onset [Figs. 3 and 4 and Supplementary Fig. S7], however, we avoid attributing our result to some implications about Gag p15; further research is required to identify the specific function of these mature Gag proteins (p15, p19, and p24). It is noteworthy that, although the SHAP value of Gag p24 falls within the high-ranking features that characterize anomaly carriers in some classifiers, we opted to exclude the interpretation of Gags due to their inconsistent contribution patterns observed across the multiple classifiers employed in this study [Supplementary Fig. S7].

Identifying the risk for developing HAM onset is challenging compared to other HTLV-1-associated diseases. In the case of ATL, for example, the risk can often be characterized by the changes in the clonality of infected cells, since a single clonal infected cell expands during the viral progression. Also, several driver mutations are reported to stimulate malignancy, thus leading to the survival of pathogenic cells and outcompete other infected cells towards monoclonal proliferation⁴¹. While these promising markers can detect risks of ATL onset, HAM is less described for early diagnosis, due to the nature of its slow progression⁴². Moreover, complicated host immune responses against infected cells vary widely between patients with different lifestyles, which makes the prediction more difficult⁴³. Having anti-Env at the top of the list, elevated antibody titer might be a key observation for evaluating disease progression.

Of interest is the significant heterogeneity in immune response among the asymptomatic carriers in our study. Surprisingly, antibody responses (against Env, Tax, Gags, and PVL) in many asymptomatic carriers were observed at the same elevated level as that of HTLV 1-related diseases (ATL and HAM). This finding led us to our initial hypothesis to detect high-risk asymptomatic carriers (i.e., anomaly carriers) who are likely to progress to disease onset. Although heterogeneity seems to be obvious when considering the various lifestyle backgrounds of patients, it is noteworthy to confirm it based on our large number of asymptomatic carrier data. This leads to get our distinct approach but still aligns with our previous findings using the same dataset, showing latent and diverse potential of the asymptomatic carriers²².

Our work acknowledges some limitations. First, we don’t have information on anomaly carriers whether they develop HTLV-1-related diseases in the future except for one sample who was diagnosed as HAM later (CDH). To fully evaluate the prediction and the hypothesis of our result especially for HAM, further data accumulation would be critical (a prospective study like¹⁵). Second, little is known about the relationship between the antibody titers and the host immune defense as mentioned above. For the dataset from the LIPS assay like ours to be used as a clinical diagnosis, these interplays should be explored in more depth. Furthermore, inconsistent results in antibody titers from previous studies have discouraged clinical application, which makes it difficult to choose consensus cutoff values for disease distinction⁴². Additionally, our dataset exhibits class imbalance (more carriers than ATL and HAM) which is reflected by the actual skewed prevalence of the disease, but are still biased toward being balanced (larger proportions of ATL and HAM than in the real-world). We implemented repeated down sampling to prevent the model from inaccurate training, though each down-sampling excludes some carrier samples and thus we run the risk to understate model generalizability when applied to the full population. Integrating additional clinical metadata relating clinical background or comorbidities which we don’t have here may enhance the interpretability of the anomaly-detected carrier subgroup. Finally, as this study was designed as a data-driven investigation focusing primarily on immunological patterns inferred from antibody titer profiles, our findings should be considered exploratory and hypothesis-generating. Given the absence of external model validation, the results remain preliminary. The identified risk indicators are not clinically actionable conclusions but should be viewed as starting points for further prospective studies involving larger cohorts and independent validation.

Methods

Ethics statement

This study was performed in accordance with the Declaration of Helsinki and was approved by the Ethics Committees of Kumamoto University (accession numbers: G489, G499, and E2214). Written informed consent was waived because of the retrospective design. Consent for publication was obtained from all patients.

Study population

The data used in this study was published previously by Yamada et al.²². PVL and antibody titer data (non-time series) were collected against HTLV-1 antigens Tax, Env, Gag p15, p19, and p24 using LIPS assay. No cut-off was applied to LIPS antibody titers, allowing continuous evaluation of their distribution and diagnostic relevance. In our analysis, the ATL group comprised of both individuals who had been diagnosed with ATL at the time of sample collection (n = 25) and those who were carriers at the time of sampling but were later diagnosed with ATL (CDA, n = 24). We also had only one carrier who later developed HAM (CDH) and it was purposefully included into the carrier population. Therefore, we focused our study on 264 asymptomatic carriers, 49 ATL, and 56 HAM patients.

Determination of key variables

Initially, Spearman’s rank correlation revealed a significant correlation between Gag p19 and p24 [Supplementary Fig. S2]. To address the multicollinearity issue and choose the variables to use in the ML analysis, the Variance Inflation Factor (VIF) score was used²⁹. See Supplementary Tables S1A and S1B.

Anomaly detection by isolation forest algorithm

For identifying potential outliers or anomalous data points from the asymptomatic carrier population (n = 264), we selected the Isolation Forest Anomaly Detection algorithm, an unsupervised machine learning technique based on decision trees, as our primary method because of its unique approach of isolating anomalies rather than profiling normal data. For each datapoint (sample), the following process is repeated until the datapoint is isolated:

1.

Randomly select a feature (e.g. PVL).
2.

Randomly choose a threshold between the maximum and minimum values of the selected feature (e.g. PVL = 0.1) and divide the data points below and above the threshold.

The key idea is that data points with anomalous feature values are likely to be isolated with only a few iterations. The algorithm constructs an ensemble of isolation trees for a given dataset and uses the path length from the root to the leaf to determine the anomaly score. Given m is the number of data points, the anomaly scores s for a datapoint x is defined as

$$\:s(x,m)={2}^{\:\frac{-E\left({h}_{i}\right(x\left)\right)}{c\left(m\right)}}$$

(1)

where $\:{h}_{i}\left(x\right)$ represents the path length for the $\:i$-th isolation tree, $\:E\left.\left({h}_{i\left(x\right)}\right.\right)=\:{\sum\:}_{i}{h}_{i\left(x\right)}$

denotes the average path length across the ensemble of isolation trees,

$c\left( m \right){\text{ }}=\left\{ {\begin{array}{*{20}{c}} {2H(m – 1) – \frac{{2(m – 1)}}{m}}&{(m>2)} \\ 1&{(m=2)} \\ 0&{{\text{(otherwise)}}} \end{array}} \right.$

is the average path length for a dataset with m points, utilized as a normalization factor⁴⁴, and $\:H\left(k\right)$ is the harmonic number. The sklearn implementation of the decision function of Isolation Forest yields negative anomaly scores, where lower (negative) scores indicate potential anomalies⁴⁰.

By applying a cutoff threshold at -0.05 to the anomaly scores of the Isolation forest, we isolated the anomaly data points for further investigation⁴⁵. This threshold was strategically chosen to capture approximately 5% of the most extreme anomalies (inversely corresponding to the 95th percentile of the normal data distribution) from our carrier population [Supplementary Fig. S4]. Since around 4% of the carriers develop HAM^3,46, we aimed to mirror this proportion.

The resulting anomaly carrier samples were then removed from the carrier data and considered as a holdout test set (unseen data) for further classification analysis. The remaining non-anomaly carrier, ATL, and HAM samples were used for training and cross-validation of the classifier models. Additionally, the feature values of the anomaly carrier samples were tested for Spearman correlation analysis with their anomaly scores. The difference between the sample groups was evaluated by plotting all the samples in a PLS plane.

Classification modeling

We employed the One-vs-Rest (OvR) approach to address the multiclass classification problem. This approach breaks down the multiclass classification into multiple binary classification tasks, where one classifier is trained for each class against all others. Given our data volume, we relied on these classifier models because they represent diverse and well-established approaches suitable for our classification task. Combining tree-based ensemble models with SVMs has been shown balancing performance, interpretability, and generalization in high-dimensional biomedical data⁴⁷. To determine the best-performing model, we evaluated four different classifiers: Random Forest classifier (RF), XGboost Classifier (XGB), Extra Trees Classifier models (ETC), and Support Vector Machine (SVM).

Nested cross-validation (CV) was used to ensure robust performance evaluation and avoid overfitting. Particularly, an outer cross-validation loop was used to assess the model performances, while an inner loop was used to optimize the hyperparameters of each classifier using GridSearch. To address class imbalance and ensure robust evaluation, we first performed bootstrap downsampling of the carrier group: in each of five independent iterations we randomly sampled 147 carriers with replacement, then merged these with the full ATL and HAM cohorts to form a training/validation subset. We then applied nested CV to each subset, using an outer 5-fold loop to estimate model generalization and an inner 5-fold loop within each outer training fold for hyperparameter tuning using GridSearch. The overall mean area under the precision-recall curve (PRAUC)^46,48 served as the optimization criterion in the inner loop and as performance metric in the outer loop.

Next, for each candidate classifier (Random Forest, XGBoost, ExtraTrees, SVM) we averaged its per-class PRAUCs across the five outer-fold repeats to obtain an “overall mean PRAUC” per repeat, and then aggregated these values across repeats to yield a mean for each model. The model with the highest average overall mean PRAUC was chosen as the best, retrained on the full training set, and then applied to the held-out anomaly carrier samples. We extracted the predicted probability for the target class among anomaly carriers. The predicted probability of HAM among the anomaly carrier samples was calculated, followed by a correlation analysis of the predicted probabilities and their feature values. The workflow is depicted in [Supplementary Fig. S1]. For the classification models performed in this study, the implementation available in the sklearn library was used⁴⁰.

Boxplot visualization and statistical analysis

We employed a combination of visual and statistical methods facilitating an initial comparison of the feature distributions among different sample groups including anomaly carriers. The Kruskal-Wallis test was performed, with a significance level set at α = 0.05. P-values were adjusted for multiple comparisons using the Bonferroni correction method for Dunn’s post-hoc analysis to maintain the overall type 1 error rate. The statistical analysis was performed using the Python Scipy package^49,50.

Interpretation with SHapley additive explanations (SHAP) analysis

As an approach to interpreting the model’s behavior, the Shapley Additive exPlanations (SHAP) framework was used^51,52. It provides the SHAP value for each feature for all samples and explains how much an increase in each feature value can affect the predicted probability for each clinical subgroup (non-anomaly carriers, ATL, HAM, and anomaly carriers). A higher SHAP value indicates a greater impact on the classification of a sample into a specific subgroup, while a lower SHAP value corresponds to a smaller impact. In this section, four classifiers (RF, ETC, XGB, and SVM) were explored for their performance in terms of PRAUC using nested cross-validation and were calculated for 300 different random seeds (i.e., different values for parameter random_state). Different random seeds are considered in this study because we wanted to extract the SHAP value which is consistent whenever the randomized manipulation during the learning process is different. This allows us to evaluate the results with a high degree of confidence. For each random seed, hyperparameters were optimized on all data without cross-validation by GridSearch and used for calculating SHAP value. The absolute median of the SHAP value from all samples was collected for 300 random seeds, and then the absolute median value and its standard deviation were calculated for visualization. Specifically, KernelSHAP was applied for all classifiers in a SHAP python package (Version 0.45.1)⁵².

Source link