A publicly available pharyngitis dataset and baseline evaluations for bacterial or nonbacterial classification

Categories: Disease & Virus

August 14, 2025

Data Collection

The data for this study were meticulously gathered during patient visits to general practitioners across two geographically distinct regions of Iran. One region, located in the mountainous areas, experiences a cold climate, while the other, situated near the coast of Persian Gulf, is characterized by a hot and humid environment. These contrasting climatic conditions provided a unique opportunity to collect data from a diverse range of environmental contexts, enriching the dataset and enhancing its applicability across various population segments. Data collection occurred from October 2023 to May 2024, focusing on patients with common cold related symptoms. Participation was entirely voluntary, with patients being fully informed about the research objectives and potential outcomes. Patients provided informed written consent, after being informed by the attending physician of the study’s purpose, voluntary nature, and confidentiality measures, including permission for anonymized clinical data and images to be used in current and future research and publications internationally. Ethical approval was obtained prior to the study (research ethics committee Approval ID: IR.BPUMS.REC.1403.282), and informed consent was secured from all participants.

All collected data, including throat images and associated demographic information, were anonymized and used solely for research purposes to advance the development of deep learning diagnostics. The risk of participant identification was carefully considered. All personally identifiable information, including names, dates of birth, contact details, and any other direct identifiers, was removed from the dataset. Each participant was assigned a unique code to replace identifying information. Additionally, facial features in medical images (if present) were obscured or cropped where applicable.

Image Processing

Throat images were captured using two smartphone models—Samsung Galaxy S21 Ultra and Xiaomi Redmi 8 Pro—selected for their high-quality cameras. Using two smartphone models introduced natural variations in image quality, enriching the dataset. Additionally, the differing lighting conditions in the two cities further contributed to the robustness of the dataset. Each throat image was taken in a well-lit room, utilizing the smartphones’ flashlight function to ensure clear visibility of the throat area. The camera was positioned directly in front of the patient’s open mouth, focusing on the back of the throat to capture the most relevant region for pharyngitis diagnosis.

Alongside the images, key demographic and clinical data were recorded, including the patient’s age, gender, and symptoms. These additional data points allowed for a more comprehensive analysis and facilitated the exploration of potential correlations between demographic characteristics and the different types of pharyngitis.

Following the collection, the images underwent a rigorous quality control process. Each image was carefully reviewed for clarity, with blurred or poorly lit images excluded from the dataset. Misaligned images were manually corrected through rotation and cropping to ensure uniformity. The throat area was emphasized to focus on the regions most indicative of pharyngitis. The final images in the repository are cropped sections of the original images and were not resized. Therefore, they each have varying resolutions. After this initial review, a second round of quality control was performed to ensure the dataset’s reliability and uniformity. Ultimately, images that at least 3 physicians agreed were not suitable for diagnosis have been excluded from the dataset, and 742 high-quality images were selected from the original 860 collected.

Diagnostic Process

Upon finalizing the dataset, a standardized diagnostic process was used to classify each image based on the type of pharyngitis. This classification was performed by a team of experienced physicians. On average, each image was reviewed by six physicians, although the number of reviewers varied, with some images being assessed by as few as four and others by as many as nine experts. The goal was to ensure a reliable and accurate classification by leveraging the collective expertise of the physician team. One doctor made the diagnosis by examining the patient’s throat in person, while the others based their diagnoses solely on images.

Pharyngitis is typically classified into two main categories: bacterial and nonbacterial (e.g., viral, allergic, fungal, and normal). Similarly, our records are divided into these two categories: bacterial and nonbacterial. Since there have been no cases of fungal pharyngitis in this dataset, our nonbacterial category includes only viral, allergic, and normal cases. In general, the treatment for all these cases is supportive care. A majority vote among the physicians determined the final classification for each image. In cases where significant discrepancies in diagnosis were observed, the image was reassessed by an additional physician who provided an independent evaluation. This step helped resolve any inconsistencies and ensured a high degree of accuracy in the final dataset. The overall workflow and summary of the dataset creation process are presented in Fig. 2 and Table 1.