Metaheuristic optimizers integrated with vision transformer model for severity detection and classification via multimodal COVID-19 images

Categories: Disease & Virus

April 22, 2025

The COVID-19 pandemic had a thoughtful impact on healthcare systems worldwide, as its unpredictable progression and severe impact on lung function continue to challenge clinicians. Advances in diagnostic tools, including molecular testing and imaging modalities, have been crucial in assessing the severity of the disease, guiding treatment decisions, and predicting patient outcomes. While RT-PCR tests remain essential for the diagnosis of COVID-19, imaging techniques such as chest X-ray (CXR) and computed tomography (CT) have proven vital in evaluating lung involvement and determining the severity of the disease¹.

CXR is a frequently used imaging modality due to its availability and ease of use, especially in resource-constrained environments or during pandemic surges. CXR reveals key radiological features associated with COVID-19, which includes bilateral lung infiltrates, ground-glass opacities (GGOs), and consolidations, which correlate with the severity of lung damage². Its portability and accessibility make it particularly valuable in intensive care units (ICUs), where mobile CXR can be used for rapid bedside evaluations³. Several scoring systems have been developed to quantify lung involvement in COVID-19 patients via CXR. For example, the Brixia score is used to evaluate the presence of opacities in different lung zones, whereas the Airspace Opacity Severity Score (ASOS) is used to assess the percentage of lung affected by airspace opacification. These scoring systems provide standardized methods to determine disease severity, help track disease progression and guide treatment decisions⁴.

However, despite its utility in critical settings, CXR has limitations, particularly in the detection of early or subtle lung changes, making it less useful in the identification of mild or moderate cases of COVID-19. Nonetheless, in settings where rapid assessments are necessary, especially for critically ill patients, CXR remains an indispensable diagnostic tool.

Compared with CXR, CT imaging particularly high-resolution computed tomography (HRCT), offers a more detailed and sensitive approach for evaluating lung damage caused by COVID-19. CT scans are critical for detecting early lung changes such as GGOs, crazy paving patterns, and consolidations. These features help clinicians differentiate among mild, moderate, and severe COVID-19 cases⁵. The CT severity score (CT-SS) has been widely adopted to evaluate the volume and density of GGOs and consolidations, providing a quantitative assessment of lung involvement and correlating with disease severity.

CT is beneficial in early-stage COVID-19, where patients may present with mild symptoms but have significant lung involvement not visible on CXR. CT can also detect complications associated with severe COVID-19, such as pulmonary embolism and fibrosis, which are critical for managing patients with worsening respiratory function⁶. Both CXR and CT imaging play crucial roles in assessing COVID-19 severity, although their applications differ on the basis of the clinical context. CT is the preferred modality for early detection and comprehensive assessment of lung abnormalities because of its high sensitivity and detailed imaging⁷. However, CXR remains essential in critical care settings due to its accessibility and portability, making it a practical solution for rapid evaluations, particularly in ICUs⁸.

Recent advancements in AI-driven systems have enhanced the efficiency and precision of CT severity scoring, offering reliable and objective assessments of lung involvement. Deep learning models have been integrated with high resolution CT to automate lung segmentation and scoring, offering rapid assessments of disease severity in clinicalpractice⁹.

Artificial intelligence (AI) is increasingly important in COVID-19 severity analysis because it integrates imaging techniques such as CXR and CT. AI-based tools for segmentation and classification have been applied to automate the detection of infection patterns, including GGOs and consolidations, which help clinicians quantify lung involvement and predict disease progression¹⁰. These tools enable faster and more accurate assessments, reduce the diagnostic burden on radiologists and improve healthcare resource management.

In recent studies, AI-driven systems have demonstrated high accuracy in automating CT and CXR severity scoring, through better correlations with traditional radiologist assessments¹¹. AI models for CT severity scoring can automatically detect and quantify GGOs and consolidations, providing real-time evaluations of disease progression and outcomes. AI has also been used to integrate clinical data with CXR findings to predict ICU admissions and patient mortality, making it an invaluable tool in managing COVID-19 patients.

A comparative study revealed that while CT excels at detecting of early-stage lung changes, CXR remains crucial to monitor the disease progression in critically ill patients. In situations where CT isn’t feasible, CXR serves as a dependable alternative to assess lung involvement¹². Both CXR and CT are key to evaluate the severity of COVID-19, which offers distinct advantages based on clinical needs. The integration of AI into imaging has further improved diagnostic speed and accuracy, leads to better patient management and outcomes.

In addition to enhancing traditional imaging modalities, artificial intelligence has revolutionized medical imaging analysis by applying Vision Transformers (ViT). ViT leverages self-attention mechanisms to extract intricate features from imaging data, outperforming traditional convolutional neural networks (CNNs) in capturing subtle abnormalities. In recent studies, ViT have been employed for COVID-19 classification and severity assessment using both CXR and CT images, achieving state-of-the-art accuracy and interpretability. Notable models include COVID-ViT for CT-based COVID-19 classification¹³, xViTCOS for explainable COVID-19 screening via CXR¹⁴, and LT-ViT for multi-label classification of normal, COVID-19, and other pneumonia cases¹⁵.

Optimizers play a crucial role in deep learning by adjusting model parameters to minimize loss and improve performance⁵⁹. Ko et al. (2024) evaluated six different optimizers (Adam, AdamW, NAdam, RAdam, SGDW, and Momentum) on three ViT architectures such as ViT, FastViT and CrossViT using a CXR dataset and identified strategies to enhance accuracy for lung disease prediction, emphasizing the role of optimization techniques¹⁶. Park et al. (2021) demonstrated the efficacy of ViT for COVID-19 detection using a chest X-ray feature corpus, showcasing the model’s capability to generalize across diverse datasets. These models demonstrate the adaptability of ViT to diverse datasets and their potential to generalize effectively across medical imaging tasks¹⁷. To achieve optimal performance with ViT, fine–tuning and selection of hyperparameters, such as batch size, layer depth, attention heads, and training epochs is required. Overall, these studies establish ViT as a powerful alternative to CNNs, offering improved accuracy and explainability in classifying and detecting COVID-19 and pneumonia from medical images.

Existing studies primarily rely on CNN-based approaches for COVID-19 severity classification. Challenges persist in capturing long-range dependencies in medical images and robust optimization techniques for hyperparameter tuning and feature selection are often lacking. In contrast, the proposed method utilizes ViT, which excels at modeling both global and local features through self-attention mechanisms, effectively addressing these limitations. Additionally, the integration of metaheuristic algorithms, such as the Grey Wolf Optimizer (GWO) and Particle Swarm Optimization (PSO), sets this framework apart by optimizing hyperparameters and identifying significant features, resulting in enhanced accuracy and computational efficiency.

The contributions of the proposed work for COVID-19 severity analysis (mild, moderate, severe) using CXR and CT images are as follows:

1.

The Grey Wolf Optimizer (GWO) is used to optimize Vision Transformer (ViT) hyperparameters, including batch size, epochs, layer depth, and attention heads, for improved severity classification performance.
2.

The optimized ViT model is trained and employed to extract deep features from CXR and CT images, effectively representing the severity levels.
3.

Particle Swarm Optimization (PSO) is applied to select the most significant features from the extracted ViT features, reducing redundancy and enhancing classification accuracy.
4.

A Multi-Layer Perceptron (MLP) classifier, with hyperparameters specifically tuned, is trained on the optimized features to accurately classify COVID-19 severity levels into mild, moderate, and severe categories.
5.

The final model is evaluated using metrics such as accuracy, precision, recall, and F1-score, along with Grad-CAM visualizations, to validate the effectiveness of the combined GWO, PSO, and MLP approach.

Related work

The COVID-19 pandemic has spurred major progress in the use of medical imaging methods, such as CXR and CT, for early diagnosis and assessment of severity. Machine learning (ML) and deep learning (DL) models, particularly convolutional neural networks (CNNs), have become critical tools for automating the diagnostic process, especially in settings where medical resources are limited⁵⁸. Early detection of COVID-19 using pre-trained CNN models, such as ResNet50, VGG16, and InceptionV3, have proven highly effective in distinguishing COVID-19 from other conditions¹⁸. Leveraging transfer learning and binary cross-entropy loss, these models have been applied successfully to CXR and CT scans, particularly in the absence of widespread RT-PCR testing¹⁹.

However, despite the success of CNNs, manual interpretation of CXR images remains time-consuming and prone to error, necessitating the development of automated, multi-stage AI systems.²⁰ proposed a multi-stage ensemble learning system to classify COVID-19 severity and localize infections, which has improved diagnostic accuracy. Similarly, segmentation models such as U-Net have been utilized to extract features such as ground-glass opacities and other infection markers from CT images, aiding in the severity classification of COVID-19 cases²¹.

Even with the advances in CNNs, they have limitations due to their small receptive fields, restricting their ability to capture global image features. These shortcomings are critical to analyze the complex structures such as the lungs. The ViT model have emerged as a solution, that offers several advantages over CNNs, particularly to process CXR and CT scans for lung disease detection. The ViT model uses self-attention mechanisms to model long-range dependencies, which effectively capture both local and global features²². This global context is essential to identify abnormalities in lung images, demonstrated that ViT model outperform CNN to quantify lung infection severity. Similarly,²³ reported that the ViT model was superior to detect COVID-19 severity, especially when it was combined with attention visualization techniques such as Grad-CAM.

Another key advantage of the ViT model is its reduced reliance on deep convolutional layers, which minimizes the need for complex feature hierarchies and lowers the risk of overfitting. This is particularly relevant in medical imaging, where data are often limited.²⁴ established that ViT-based models outperformed CNNs, even on smaller datasets, especially for the diagnosis of post-COVID-19 pulmonary conditions.

Recent research has explored hybrid models that combine CNNs and ViT to further increase diagnostic accuracy.²⁵ developed a fusion model integrating ResNet andViT for 3D CT image classification, to exploit CNN strength in capturing local details and ViT model ability to model the global context. This approach has proven more accurate to find conditions such as COVID-19 and tuberculosis²⁶. Similarly,²⁷ introduced a two-stage ViT model that outperformed the traditional CNN in COVID-19 severity classification by detecting subtle patterns across lung regions.

Despite these benefits, the ViT model presents challenges due to its high computational costs. Researchers have addressed this issue to develop more efficient ViT model variants.²⁸ explored the ViT model with fewer trainable parameters for COVID-19 severity assessment, demonstrating that these models can maintain high diagnostic accuracy with reduced computational demands.²⁹ proposed an efficient feature extraction framework for pneumonia detection via the ViT model, to reduce computational costs while retains its strong performance.³⁰ demonstrated optimized pre-trained ViT models for COVID-19 classification using Stochastic Configuration Networks (SCNs), to improve performance and avoid overfitting, which is particularly important in data-scarce medical applications⁵⁶.

Emerging ViT-based models such as PneuNet and IEViT have shown significant promise in CXR image classification. PneuNet improves COVID-19 pneumonia diagnosis by applying multi-head attention to channel patches, whereas IEViT enhances sensitivity, precision, and generalizability with input optimization³¹. These models offer effective solutions to automate diagnoses and reduce dependence on radiologists, particularly in remote or underserved areas³². ViT model frameworks efficiently handle diverse image resolutions, and are well-suited for diagnosis such as pneumonia and COVID-19 cases, improving healthcare accessibility and outcomes³³.

Existing studies on COVID-19 severity detection typically rely on CNN, which excel at capturing local features but struggles with long-range dependencies, limiting their application to complex medical images such as CXR and CT scans. CNN-based approaches, such as ResNet50, VGG16, and InceptionV3, are reviewed, highlighting their successes in early diagnosis but also their limitations, including small receptive fields and the inability to capture global image features. The emergence of ViT is introduced as a solution, leveraging self-attention mechanisms to model both global and local features effectively, addressing CNN shortcomings. Hybrid models that integrate ViT and CNNs are explored, demonstrating improved diagnostic accuracy, along with innovations to reduce ViT computational demands through lightweight variants and optimized architectures. The review identifies research gaps, including a predominant focus on binary classification, limited exploration of multi-class severity classification, and underutilization of multimodal data. The proposed framework integrates ViT with metaheuristic algorithms, such as GWO for hyperparameter tuning and PSO for feature selection, addressing these gaps and offering a scalable, robust solution for COVID-19 severity detection.

Source link