Automated multi-model framework for malaria detection using deep learning and feature fusion

Categories: Disease & Virus

July 16, 2025

The primary objective of this article is to develop and implement a robust AI-driven model for the accurate diagnosis of malaria using blood film microscopic images. The proposed methodology is designed as a multi-phase approach, with each phase contributing to the precision and reliability of the diagnostic process. The comprehensive workflow, as illustrated in Fig. 1, encompasses various stages of model development and implementation, including data preprocessing, feature extraction, model training, validation, and testing. This structured pipeline aims to ensure a high degree of diagnostic accuracy and efficiency, addressing the critical need for effective malaria detection in clinical and field settings.

1.

Dataset.

The “Cell Images for Detecting Malaria” dataset²¹, hosted on Kaggle, is a well-curated collection of microscopic images of blood smears designed to facilitate research and development in malaria detection using machine learning and deep learning models. It comprises 27,558 images, equally divided into two categories: Parasitized (malaria-infected cells) and Uninfected (healthy cells). The images were captured using light microscopy at a consistent magnification of 100x, ensuring high-resolution visual clarity ideal for machine learning and deep learning applications. The original input image has a size of 150 × 150. The dataset was undergoing some preprocessing, such as resizing to be convenient with the input layer of the proposed AI model.

The dataset is systematically organized into two directories. The Parasitized directory contains images of red blood cells infected with the Plasmodium parasite, marked by distinct visual features such as irregular shapes, dark spots, and uneven internal textures. The Uninfected directory includes images of healthy red blood cells, characterized by their smooth texture, uniform shape, and absence of parasitic artifacts. This structure simplifies data loading and preprocessing tasks for researchers and developers.

One of the standout features of this dataset is its balanced distribution, with an equal number of images in each category. This balance minimizes class imbalance issues, ensuring that machine learning models can train effectively without bias toward a particular class. Additionally, the dataset exhibits biological diversity, encompassing a wide range of cell morphologies, staining variations, and patterns that mirror the complexities encountered in real-world clinical settings.

Overall, the dataset is an invaluable resource for advancing AI-driven healthcare solutions. It is particularly suited for training, validating, and testing classification models as well as for transfer learning, where pre-trained models are fine-tuned for malaria detection tasks. The high-quality, diverse, and well-labelled nature of this dataset ensures that models trained on it are robust, reliable, and capable of performing effectively in real-world diagnostic scenarios. The dataset was split into 70% training, 15% validation, and 15% testing during the implementation of the proposed AI model. Examples of the utilized dataset are provided in Fig. 2. The distribution of the image count among the targeted classes is shown in Table 2.

Table 1 A comparison of related works concerning malaria diagnosis using AI techniques.

Table 2 Distribution of the dataset utilized.

2.

Phase (a): Implementation of End-to-End Deep Learning Models.

In this phase, the methodology combines the power of transfer learning and end-to-end deep learning approaches to process microscopic blood film images for malaria diagnosis. Three state-of-the-art CNN architectures—ResNet-50²², VGG-16²³, and DenseNet-201²⁴—are employed as feature extractors. These models utilize pre-trained weights from large-scale datasets, enabling them to recognize intricate patterns in blood cell images. In these models, we performed transfer learning by retraining the end head of the CNN, including a fully connected layer, a dropout layer, a SoftMax layer, and a classification layer. Each of the proposed models keep its main parameters in the convolution layer including the number of filters, stride length, and its weights.

For ResNet-50, it captures 2048 deep hierarchical features through its residual connections, addressing vanishing gradient problems for effective learning. Meanwhile, VGG-16 identifies 4096 spatial features using its uniform convolutional architecture, ensuring consistent extraction of visual details. In addition, DenseNet-201 enhances feature propagation by creating densely connected layers, improving the richness and relevance of extracted 1920 features.

Beyond feature extraction, these architectures are also trained as end-to-end models. This allows the entire network—from input to classification layers—to be fine-tuned specifically for malaria diagnosis. By adapting pre-trained models to the task at hand, the networks learn to optimize both low-level and high-level features relevant to identifying infected and healthy cells. This dual approach not only harnesses the strengths of transfer learning but also capitalizes on the adaptability of deep learning models, creating a robust foundation for accurate malaria diagnosis.

In this phase, the methodology combines the strengths of transfer learning and end-to-end deep learning approaches to process microscopic blood film images for malaria diagnosis. Three advanced convolutional neural network (CNN) architectures—ResNet-50, VGG-16, and DenseNet-201—serve as the backbone for feature extraction and classification. These models leverage pre-trained weights from large-scale datasets like ImageNet, enabling them to detect intricate and subtle patterns in blood cell images with remarkable precision and efficiency. The process of feature extraction from CNN models can be mathematically represented as shown in Eq. (1).

$$\:F={f}_{CNN}\left(X;{\theta\:}_{pre-trained}\right)\:\:\:\:\:\:\:\:\:\:$$

(1)

Where $\:X\:$is the input image, $\:{f}_{CNN}$ represents the feature extraction function of the CNN model, $\:{\theta\:}_{pre-trained}$ Are the pretrained weights, $\:F$ Is the resulting feature vector.

Each architecture contributes uniquely to feature extraction. ResNet-50 captures 2048 deep hierarchical features through its residual learning framework, which introduces shortcut connections to bypass layers. This residual connection can be described as shown in Eq. (2).

$$\:y=f\left(x\right)+x$$

(2)

Where $\:x$ is the input for the residual block, $\:f\left(x\right)$ is the learned transformation, $\:y$ is the output of the residual block

VGG-16, known for its simple and sequential structure, extracts 4096 spatial features using uniform convolutional layers with small $\:3\times\:3$ Filters. Equation (3) presents the output of these layers.

$$\:{z}_{i,j,k}=\sum\limits_{m=0}^{M-1}\sum\limits_{n=0}^{N-1}{x}_{i+m,j+n}.\:{w}_{m,n,k}+{b}_{k}$$

(3)

Where $\:{z}_{i,j,k}$ is the output feature map, $\:{x}_{i+m,j+n}$ is the input feature map, $\:{w}_{m,n,k}$ are the weights of the filter, $\:{b}_{k}\:$is the bias term, and $\:M$ and $\:N$ are the filter dimensions. This approach ensures precise detection of spatial characteristics such as cell morphology and texture.

DenseNet-201 enhances feature propagation and reuse through densely connected layers, where each layer is connected to all previous layers. This can be expressed as presented in Eq. (4).

$$\:{x}_{l}={H}_{l}\left(\right[{x}_{0},\:{x}_{1},\:\dots\:..,{x}_{l-1}\:\left]\right)$$

(4)

where $\:{x}_{l}$ is the output of the $\:{l}^{th}$ layer, $\:{H}_{l}$ is the learned transformation, and $\:\left[{x}_{0},\:{x}_{1},\:\dots\:..,{x}_{l-1}\:\right]\:$represents the concatenation of feature maps from preceding layers. This connectivity results in the extraction of 1920 highly relevant features, contributing to a rich feature set.

Beyond feature extraction, these CNN architectures are fine-tuned as end-to-end deep learning models. Fine-tuning optimizes the parameters. $\:{\theta\:}_{fine-tuned}$ for the specific task of malaria diagnosis by minimizing a loss function as shown in Eq. (5).

$$\:{\theta\:}_{fine-tuned}=\text{arg}\text{min}L(f\left(X;\theta\:\right),y)$$

(5)

where $\:f\left(X;\theta\:\right)$ is the predicted output of the model, $\:y$ is the ground truth label, and $\:L$ is the cross-entropy loss function. This process ensures that the models learn both low-level features (such as textures and shapes) and high-level features (such as parasitic patterns) unique to blood smear microscopy.

For more information, Table 3 summarizes the training hyperparameters used for the training of the proposed models, reflecting a carefully tuned configuration that balances learning stability, efficiency, and generalization performance through mini-batch optimization, learning rate scheduling, and periodic validation. As the selection of training hyperparameters may directly influence the proposed models²⁵. Moreover, Fig. 3 provides a flow chart of the proposed algorithm, including the training and testing phases.

Table 3 Different training hyperparameters for the proposed CNN models.

3.

Phase (b): Feature Fusion and dimensionality reduction.

Following the independent extraction of features using ResNet-50, VGG-16, and DenseNet-201, the next step in the methodology involves feature fusion. This process combines the diverse feature sets generated by these models into a unified, comprehensive feature vector. The fusion operation is performed through concatenation, which aggregates the feature vectors from the three models according to Eq. (6).

$$\:{F}_{fused}=[{F}_{ResNet}\:,{F}_{VGG\:16}\:,\:{F}_{DenseNet}]$$

(6)

Where $\:{F}_{ResNet}\:,{F}_{VGG\:16}\:,\:{F}_{DenseNet}$ are the feature vectors extracted from ResNet-50, VGG-16, and DenseNet-201, respectively, $\:{F}_{fused}$ Represents the fused feature vector. The total number of features after concatenation is 8064.

This fused feature vector provides a richer representation by combining both global patterns (captured by ResNet-50) and local spatial details (captured by VGG-16 and DenseNet-201). While the combined features significantly enhance the model’s capability to distinguish between malaria-infected and healthy cells, their high dimensionality poses challenges in terms of computational complexity and overfitting.

To address these challenges, Principal Component Analysis (PCA) is employed for dimensionality reduction. PCA transforms the high-dimensional fused feature vector into a lower-dimensional space by finding the principal components that capture the maximum variance in the data. The transformation can be represented as shown in Eq. (7).

$$\:{F}_{PCA}=\:{F}_{fused}\:W$$

(7)

Where $\:{F}_{fused}$ is the original fused feature matrix of size $\:N\times\:8064,\:$N is the number of samples. The projection matrix W is computed by solving the eigenvalue decomposition of the covariance matrix of $\:{F}_{fused}$.

$$\:C=\frac{1}{N}{f}_{fused}^{T}{F}_{fused}$$

(8)

Equation 8 introduces the covariance matrix (C) of the fused feature matrix. It is a square matrix that captures the relationships between the features in the fused feature space. By selecting only the top $\:k$ Principal components that capture 95% of the variance, PCA ensures that the most discriminative features are preserved while reducing the feature space. This reduction minimizes computational complexity, mitigates overfitting, and accelerates subsequent classification tasks, all while maintaining high diagnostic accuracy.

4.

Phase (c): Hybrid Classification Framework.

After dimensionality reduction, the refined feature vector is processed through a hybrid classification framework that combines the strengths of traditional machine learning and deep learning models. This framework consists of two components: Support Vector Machine (SVM) and Long Short-Term Memory Networks (LSTM). Each component plays a vital role in ensuring robust and accurate classification, complementing each other’s capabilities.

a.

Support Vector Machine (SVM).

SVM is a powerful supervised learning algorithm that excels in identifying optimal decision boundaries between classes, even in high-dimensional spaces²⁶. In this framework, the reduced feature vector is provided as input to the SVM classifier, which works by finding a hyperplane that maximally separates the two classes: malaria-infected and healthy samples. The SVM classifier aims to maximize the margin, which is the distance between the hyperplane and the nearest data points on either side, known as support vectors. This approach ensures that the classifier generalizes well to unseen data, making it particularly effective for binary classification tasks like malaria diagnosis. The robustness and reliability of SVM make it an essential component of this hybrid framework. The core objective of SVM is to maximize the margin between the separating hyperplane and the closest data points from each class, known as the support vectors. The optimal hyperplane can be mathematically expressed in Eqs. (9–11)

$$\:f\left(x\right)={w}^{T}x+b=0$$

(9)

Where $\:x$ is the input feature vector, $\:w$ is the weight vector orthogonal to the hyperplane, $\:b$ is the bias term.

To find the optimal hyperplane, SVM solves the following convex optimization problem:

$$\:\underset{a,{\:b}}{\text{min}}\frac{1}{2}{\parallel w\parallel}^{2}\:\:\:\:$$

(10)

$$\:subject\:to:\:{y}_{i}\left({w}^{T}{x}_{i}+b\right)\ge\:1,\:\:\:{\forall\:}_{i}$$

(11)

Where $\:{y}_{i}\in\:\left\{1,-1\right\}$ are the class labels, $\:{x}_{i}$ are the training instances.

By maximizing the margin $\:\frac{2}{\parallel w\parallel}$, SVM minimizes overfitting and improves generalization performance on unseen test data. For non-linearly separable data, SVM incorporates kernel functions $\:K({x}_{i},\:{x}_{j})$to map data into a higher-dimensional space, where linear separation becomes feasible. In the context of this study, the SVM classifier processes the transformed PCA feature set to learn the best separating hyperplane between infected and uninfected cells. Its mathematical rigor, geometric interpretation, and generalization capability make SVM an indispensable component of the proposed hybrid classification system.

b.

Long Short-Term Memory Networks (LSTM).

LSTM, a type of recurrent neural network (RNN), is used in this framework to capture patterns and contextual relationships in the data. Although the feature vector itself is not sequential, LSTM networks excel at modelling dependencies among features, allowing the system to recognize intricate patterns indicative of malaria-infected cells^12,19. LSTMs are designed with specialized mechanisms, such as gates that regulate the flow of information, ensuring the model retains only the most relevant patterns and discards unnecessary details. This makes LSTMs particularly adept at learning complex and nuanced representations from data, adding an extra layer of interpretability and robustness to the classification process. In the context of malaria diagnosis, LSTM adds an interpretive layer capable of capturing complex feature interactions that may not be linearly separable²⁷.

LSTM achieves this through a gated memory cell architecture, which consists of the forget gate, input gate, cell state update, and output gate. These gates control the flow of information, selectively remembering or forgetting parts of the input and prior hidden states, thereby enabling the network to focus on the most informative patterns while discarding irrelevant ones. The internal operations of an LSTM cell at time step $\:t$ are described by the following Eqs. (12–17):

$$\:{f}_{t}=\sigma\:\left({W}_{f}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{f}\right)$$

(12)

Input gate.

$$\:{i}_{t}=\sigma\:\left({W}_{i}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{i}\right)\:\:\:\:$$

(13)

$$\:{\stackrel{\sim}{C}}_{t}=tanh\left({W}_{c}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{c}\right)\:\:$$

(14)

Cell state update.

$$\:{C}_{t}={f}_{t}\cdot\:{C}_{t-1}+\:{i}_{t}\cdot\:\:{\stackrel{\sim}{C}}_{t-1}\:\:\:$$

(15)

Output gate.

$$\:{o}_{t}=\sigma\:\left({W}_{o}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{o}\right)\:\:$$

(16)

$$\:{h}_{t}=\:{o}_{t}\cdot\:\text{tanh}\left({C}_{t}\right)\:\:$$

(17)

Where $\:{x}_{t}\:$is the input at time step $\:t$, $\:{h}_{t-1}$ is the previous hidden state, $\:{C}_{t}$ is the current cell state, $\:\sigma\:$ is the sigmoid activation function, $\:W\:and\:b$ are the trainable weights and biases, $\:tanh$ is the hyperbolic tangent activation. These mechanisms enable LSTM to retain long-term dependencies and filter out noise, which is critical when learning from complex biological patterns such as variations in red blood cell morphology in malaria.

The designed RNN architecture incorporates an LSTM-based deep learning structure tailored for classifying malaria-infected versus uninfected samples using the reduced feature set obtained from PCA. The network begins with a sequence input layer configured for an input dimension of 3135 features, representing the principal components derived from the original high-dimensional fused feature vector. This is followed by an LSTM layer with 128 hidden units, allowing the network to capture temporal or contextual dependencies across the input features, even though they are not time-series data in the classical sense.

The LSTM layer is succeeded by a fully connected layer that maps the learned feature representations to the desired number of output classes (in this case, two). A SoftMax layer follows, which converts the raw class scores into normalized probabilities, enabling probabilistic interpretation of the classification outputs. Finally, the classification layer computes the loss during training and evaluates the model’s prediction performance. This architecture effectively leverages the LSTM’s capacity to model complex inter-feature relationships, improving the system’s ability to differentiate between malaria-infected and uninfected blood samples. Figure 4 depicts the detailed structure of the proposed network.

5.

Decision Aggregation.

To further improve the reliability and accuracy of the classification process, the methodology incorporates a majority voting mechanism. This mechanism is used to aggregate the predictions from multiple classifiers, including the three end-to-end deep learning models (ResNet-50, VGG-16, and DenseNet-201), as well as the SVM and LSTM classifiers. By combining the outputs of these diverse models, majority voting ensures a consensus-based decision, reducing the likelihood of errors arising from the limitations of any single model²⁸.

In the majority voting process, each classifier independently predicts whether the input blood film image corresponds to a malaria-infected or non-infected sample. These predictions are treated as votes, with each model contributing one vote to the final decision. The class label with the majority of votes is selected as the final classification outcome. For instance, if the five models produce predictions as follows:

ResNet-50: Malaria-infected.
VGG-16: Malaria-infected.
DenseNet-201: Non-infected.
SVM: Malaria-infected.
LSTM: Non-infected.

Hence, the majority class (Malaria-infected in this case) is chosen as the final decision.

This approach is particularly effective in leveraging the complementary strengths of the models involved. The deep learning models provide robust feature extraction and pattern recognition capabilities, while the SVM contributes precise decision boundary optimization, and the LSTM captures complex relationships within the data. By aggregating these diverse perspectives, majority voting minimizes the influence of outlier predictions and improves overall classification robustness.

Moreover, majority voting is inherently adaptable and scalable. Additional classifiers can be integrated into the voting process without significant alterations to the system, further enhancing its versatility. This mechanism also reduces the impact of noisy data or model-specific biases, as the final decision reflects a consensus rather than relying on a single classifier’s output. The outcome of the majority voting process provides the final classification, reliably determining whether the input blood film image is malaria-infected or non-infected. This consensus-based strategy ensures that the diagnostic system delivers high levels of accuracy and confidence, making it suitable for real-world clinical applications where reliability is critical.

Source link