AI-QuIC machine learning for automated detection of misfolded proteins in seed amplification assays

Categories: Disease & Virus

April 8, 2025

Data Generation and Processing

The dataset was sourced from Milstein, Gresch, et. al.⁴³, a study that tested different disinfectants on CWD and consisted of the RT-QuIC analysis of swab samples after disinfection of various surfaces. The portion of the dataset used for this study comprised 8028 individual wells, representing 573 samples, each prepared with 2-3 dilutions and assayed in 4-8 technical replicates/reactions. Data from a total of 7331 sample reactions and 984 control reactions were used to evaluate the ML models. Among the 573 samples, 19 were withheld due to ambiguous labels, being classified as negative by statistical tests but positive by a “blind” human evaluation. These samples were further reviewed to exclude those with complex identities for additional characterization. The final dataset used for training and initial testing of the models included 7027 sample wells and 984 control wells. Each of these wells was characterized by taking a fluorescence reading 65 times at 45-minute intervals over a 48-hour run. Considering this sourcing, the ground-truth CWD positive/negative status of each sample was not directly known due to the introduction of disinfectant and the indirect nature of the swabbing method used to generate the RT-QuIC data (Fig. 7). The “true” sample identities were instead determined by experienced researchers using statistical assessments, providing an excellent opportunity to compare the performance of AI solutions against existing methods for identifying CWD prions in RT-QuIC data.

Three labels were introduced for this study based on the results of each RT-QuIC reaction: negative, false positive, and true positive. Each of these labels is related to sample identity and the threshold defined in the original study⁴³. A negative label represents a reaction involving a negative or a positive sample that never crossed the threshold (i.e. was identified as negative by analysis of the RT-QuIC results). A false positive label indicates a reaction with a CWD negative sample that crossed the threshold after being identified as a negative sample in the Milstein, Gresh, et. al. study⁴³. A true positive label was identified as a reaction involving a CWD positive sample that crossed the positive threshold after being considered as positive in the original study⁴³. In this dataset, there were 6781 negative reactions, 239 false positive reactions, and 991 positive reactions.

This formation of the dataset imposes a few limitations to the utility of the models evaluated in this study. Firstly, without knowing the identity or “ground truth” of the samples, these models can only be compared against the labels of annotators and, in the case of the supervised models, will be trained with any implicit biases held by the annotators. Additionally, the ambiguity in sample identity also limits the results; rather than classifying a reaction or sample as a whole, this method is limited to the classification of each replicate individually based on its annotation. While this individual replicate identity, or well-level identity, can inform sample identity, this limitation prevents the model from being an all-inclusive diagnostic tool. Next, there is additional classification difficulty in the extreme similarity between some false positive samples and true positive samples. A false positive RT-QuIC reaction is characterized as a reaction that was identified as positive but is known to be CWD negative, resulting from false seeding. As such, many of the false positive reactions were difficult to distinguish from true positive reactions (Fig. 8). Despite these limitations, the models still learned meaningful patterns which could be invaluable to future sample-level identification and serve as a proof of concept for models that could potentially surpass current annotator/metrics-based classification methods.

**Fig. 6: Model results on external validation data.**

In addition to this dataset, a smaller, external validation dataset was used to assert the validity of these methods by testing them on data generated completely independently from the training dataset. This data featured 122 negative reactions, 15 false positive reactions, and 31 positive reactions, with 156 wells being sourced from samples and 12 being control reactions. For this dataset, the samples were derived directly from tissue samples (including ear, muscle, and blood), making the labels more reliable^27,44,45.

Every method chosen for this study attempts to contribute to two goals: (1) to evaluate classification methods based on prevailing knowledge on the analysis of RT-QuIC data, and (2) to create a new method with the automation capabilities that machine learning represents. These two goals were supported by two datasets—a dataset of precalculated metrics and a raw fluorescence dataset (Fig. 1c). The first of these datasets, in line with the first goal of the study, consisted of four features extracted from the raw fluorescence values of each well. The feature set included the time to a precalculated threshold (TTT), the rate of amyloid formation (RAF), the maximum slope of the fluorescence curve (MS), and the maximum point ratio (MPR) between the smallest and largest fluorescence readings. These features were selected for their ability to characterize the curves RT-QuIC generates and for their relevance to the current method of using a precalculated threshold for classifying a test result as positive or negative. RAF and TTT are reciprocals of one another, with TTT measuring the “time required for fluorescence to reach twice the background fluorescence”⁴³ and RAF being the rate at which this threshold is met. Both were included in the metrics dataset to give the feature selection algorithms used in this study the freedom to select whatever combination of these it deems best. The inclusion of RAF also gave models an indication of the dilution factor of the sample in the reaction as the two features are linearly related³⁴. Additionally, reactions that did not reach the threshold were set to a TTT of 0 as this value is not physically realizable and works better for scaling than choosing some other large value that would need to depend on the runtime of the assay. MS was calculated with a sliding window with a width of 3.75 hours or 6 cycles. By finding the difference between the fluorescence reading at a current point and one taken 3.75 hours previously, the entire curve can be characterized into slopes. The maximum of these is selected. MPR was determined by dividing the maximum fluorescence reading by the background fluorescence reading⁴³. In contrast to this first, distilled dataset, the second dataset consisted of the raw fluorescence readouts, preserving all the information obtained from a particular run for the algorithms to extract. Once the two datasets were generated in the aforementioned formats, a few universal preprocessing steps were applied to ensure the data was ready for machine learning.

Firstly, RT-QuIC data in its raw form is not well suited to machine learning due to the wide variabilities in the absolute fluorescence. Different machines and concentrations, for example, can cause fluorescence curves to reach higher in some plates than in others, making it difficult to derive meaningful patterns in the larger dataset. The Scikit-Learn (sklearn) Standard Scaler⁵⁴ was used to remedy this, centering the means of each well at 0 and scaling the dataset to unit variance. This was done separately to the training and testing partitions of the data. Secondly, as this study focused on the behavior of individual wells, it was crucial to remove any patterns in the data that could be derived from the ordering of the dataset. To account for this, the samples were shuffled into a random order using tools in the Numpy package⁵⁵.

Creation of training/testing sets

All of the models utilized distinct training and testing sets to ensure the models could generalize beyond the training data. To accomplish this, 20% of the dataset was selected at random and withheld from training to be used in the evaluation of the models. The separation was done at the sample level to ensure different reactions from the same sample were not present in both the training and testing sets. The model predictions on this testing dataset are the source of the evaluation metrics and plots generated for each model. All of the models were configured to use the same data for training and testing to allow for direct comparison.

Principal component analysis

Before training two of the model types, SVM and K-Means, a Principal Component Analysis (PCA) was applied to the raw dataset. The PCA implementation used in this study was a part of the sklearn package⁵⁴. PCA is a linear dimensionality reduction method that extracts relationships in a dataset and creates a new set of features based on these relationships. The algorithm works by identifying linear combinations of the different features in the dataset, matching each variable with every other variable and assigning a score based on how related they are to one another. This score is called the covariance. These scores are stored in a matrix that can be used to calculate eigenvectors, which transform the original features into a set of linear combinations of features. These linear combinations are called Principal Components (PCs). The eigenvalues of these eigenvectors, then, represent the overall variance/feature importance of each individual feature. PCA can then select the features with the largest eigenvalues to represent the dataset⁵⁶. Typically, the number of PCs to incorporate into the final dataset is identified by using a scree plot and identifying a point at which adding additional features has greatly reduced improvement in overall variance (also called finding the “elbow”). Using this method on the PCs obtained from the raw data, it was determined that 4 or 5 PCs were all that was necessary to represent the dataset. 4 PCs were selected from this set to be used in training some of the models as this aligned with the 4 features of the metrics dataset and captured nearly 90% of the variance in the dataset. Both K-Means and SVMs train poorly on data with large numbers of features (such as the raw data with 65 timesteps used in this study), making PCA an effective way to give these algorithms an assist. Additionally, applying PCA enhances the variance in the data, making the training of K-Means and SVM more efficient. In contrast, deep learning models, like the MLP, inherently handle high-dimensional data through their robust feature extraction capabilities¹. Therefore, we did not apply PCA when training the MLP model, allowing it to learn directly from the raw data.

Unsupervised Learning Approaches

K-Means Clustering

A K-Means model was trained on both the raw dataset and on the dataset of metrics, yielding a total of two models. KMeans is a powerful clustering utility and one of the most famous unsupervised learning algorithms. The algorithm works by identifying cluster centers through various methods and adjusting these locations/scales to optimally separate the dataset into distinct sets. The KMeans algorithm evaluated in this study was developed using the Scikit-Learn package in Python and used Lloyd’s KMeans Clustering Algorithm (LKCA)⁵⁷. This algorithm uses an iterative optimization approach to approximate cluster centers for data with arbitrary numbers of features, making it more practical for multidimensional datasets like RT-QuIC reactions than manually calculated approaches to clustering. However, like many machine learning algorithms, LKCA frequently finds solutions that are only locally optimal. Practically, this means that the solution KMeans will find to a clustering task is highly influenced by the initial conditions of the algorithm. In order to generate these initial conditions, the model was set to use a random initialization, selecting the features of two random samples as cluster centers, which were iteratively adjusted by LKCA to find an optimal solution. LKCA was allowed to run up to 500 iterations or until the algorithm had converged on a solution – whichever required less time. Due to the high initialization sensitivity of KMeans, the algorithm was configured to select 150 random samples to use as initial cluster centers, selecting the best version. The goal was to greatly increase the odds that KMeans selected two samples, one which nearly epitomizes a negative sample and one which nearly epitomizes a positive sample. Then, with the optimization of LKCA, this increases the likelihood of finding a globally optimal solution. Considering the high efficiency of the KMeans algorithm, this process takes little time. The raw data was passed through a PCA algorithm to highlight the variance in the data and reduce dimensionality.

The model was configured to find three clusters, with the idea that one cluster would represent positive RT-QuIC reactions, another would represent false positive reactions, and the final cluster would represent negative reactions. As most of the analysis used to evaluate the models was not compatible with multiclass output, the model output was converted to yield a 0 or a 1, corresponding to negative or positive. False positive classifications were treated as negative except when evaluating performance on false positives. Since the model was not given labels, it randomly selected whether the positive label was a 0 or a 1. The rest of the evaluation required a uniform system of outputs, so a standard was selected (1 represented positive, and 0 represented negative). Any model that underperformed below what was predicted by random guessing had its labels flipped to match this standard.

Supervised learning approaches

Support vector machines

While SVMs are elegant in their simplicity, they often suffer from poor performance on datasets with many features (as in our case with 65 time steps). Considering this, the SVM trained on raw data was considered a prime candidate for the PCA preprocessing step, which was applied to the raw dataset via the Scikit-Learn package⁵⁴ prior to training and testing. This created a much smaller feature space of just the relationships for the SVM to learn. The SVM was constructed using the Scikit-Learn implementation⁵⁴. The model itself uses a radial basis function (RBF)⁵⁸ kernel. RBFs are a family of functions that are symmetric about a mean, such as a Gaussian distribution. These functions allow SVMs to classify data that is not separable by a polynomial function (such as concentric circles), making the SVM more versatile but also more complex⁵⁸.

SVMs implemented with Scikit-Learn are inherently binary classifiers and do not support multiple classification boundaries. While it is possible to implement multiple classes by training multiple models, the implementation in this study did not use that approach in order to simplify the training and testing of the model. The PCA/SVM combo was run on the raw data and the SVM alone was run on the feature-extracted data.

Deep learning: multilayer perceptron

Multilayer Perceptrons (MLPs) are one of the simpler types of Deep Neural Networks¹ (DNNs) developed for classification tasks. MLPs consist of a set of 1-dimensional layers of neurons, featuring an input layer, an output layer, and at least one hidden layer in between. The weights between each layer are used to identify patterns that correspond to the training task⁴².

The MLP was selected for this task due to its natural compatibility with the dimensions of the problem. Since the study examined each replicate individually, the data was one-dimensional, consisting of just the fluorescence reading at each of the 65 time steps. MLPs are able to compare each timestep with every other timestep, allowing them to extract a wide range of features from the dataset, even when these features are not linearly separable by class.

The MLP that produced the results in this paper used the Keras package in Python. Keras is a machine learning package that makes model development simpler and is part of the Tensorflow package⁵⁹. The model consisted of 3 hidden layers. The input layer matched the input shape (65 nodes, one for each timestep). The next three hidden layers used 65 neurons each, matching the shape of the input, which proved effective in testing. Each of these layers used a Rectified Linear Unit (ReLU) activation function, a popular activation that eliminates negative inputs completely⁶⁰. The model, while originally tested as a binary classification setup, is configured with 3 output neurons, each corresponding to a class (negative, false positive, or positive). This forces the MLP to learn distinctions between positive and false positive classes in a more meaningful way. The model uses a softmax activation function in the final layer, converting the input to this layer into a score from 0 to 1 for each of the three classes. These scores add up to 1 and can be seen as the model’s “confidence” that a reaction belongs to a given class. With this information, the highest scoring class can be taken as the model’s prediction or can be processed more carefully to intuit new insights about the data.

Before the model is trained on the data, the inputs are first processed according to the steps outlined in the Data Generation and Processing section of the main manuscript. An important note, however, is that the MLP is the only model type explored in this study that does not require additional feature extraction. Neural networks are themselves feature extractors, so attempting to train the model on PCA extracted features or metrics would limit the potential features from which the model can extract information. This represents a massive advantage DNNs have over other ML methods, as DNNs generally do not require time-consuming feature engineering to achieve good performance¹.

Training an MLP involves passing training data through the network in groups called batches. For this implementation, the number of reactions in a batch was 64. Once the network obtains predictions on this data, the prediction is compared to the true class using an error function (more commonly known as the loss function). This loss is then used to update the weights in between neurons of the network through a process called backpropagation^1,61. This process continues until all the training data has been passed through the network, completing one iteration or epoch of training. The MLP used in this study was set to train for a maximum of 100 epochs. While training, a portion of the training dataset is withheld (10% in this case) for validation of the model after each epoch. Should the model begin to overfit, a process in which the model loses the ability to generalize beyond the training data without learning more generally applicable patterns, the loss of the model on the validation data will begin to increase even as the training data loss decreases. The loss of the model on this validation set was monitored during training (Fig. 9) and used to limit overfitting. At each epoch, the model was saved as a checkpoint if the validation loss at that epoch was less than the best previous validation loss. At the end of training, the last checkpoint was restored to the model iteration that performed best on the validation set. For the MLP used in this study, however, the model always improved on the validation set, so the checkpoint that was restored was the final state of the model after the 100th epoch.

**Fig. 7: Visualization of data generation and processing.**

Once the model was trained, the outright classification was a non-optimal representation of the data in this study. The false positive samples and the positive samples were not clearly distinguished by the model, meaning any well with a curve would score highly in both the false positive and positive classes. This behavior was undesirable as many positives were mistaken for false positives and vice versa. In order to rectify this issue, a weighted average was applied to the output to get a single, unified score. The model output used the function shown in Eq. 1 to calculate the final score used for final classification. In the equation, Y_neg is the score for the negative class, Y_fp is the score for the false positive class, Y_pos is the score for the positive class, and Y_out is the overall binary score used in the results. This smoothed out the poorly defined class boundary into a spectrum from 0 to 2 (noting that the sum of the score outputs produced by the model are normalized), with any score less than 0.5 being labeled a negative prediction, any value between 0.5 and 1.5 being labeled a false positive prediction, and anything above 1.5 being labeled a positive prediction.

$${Y}_{{out}}=0\times {Y}_{{neg}}+1\times {Y}_{{fp}}+2\times {Y}_{{pos}}$$

Equation 1: Weighted Average Applied to MLP Output

Evaluation criteria

Evaluation of the models primarily considered five metrics, accuracy, sensitivity/recall, specificity, precision, and F1-score. The accuracy metric corresponds to how many wells were classified the same as the human labels. As this study used 2 classes, the worst possible accuracy score should be 50% (when adjusted for class weight), corresponding to a random guess. A significantly lower accuracy would likely be the result of a procedural issue, so for each accuracy metric, it is imperative to consider the improvement over 50% as the true success rate of the model. While accuracy evaluates the dataset holistically, sensitivity (or recall) evaluates the performance of the model on identifying positive samples correctly. Specifically, sensitivity corresponds to the fraction of human-labeled positive wells that were classified correctly. Low sensitivity corresponds to a model that frequently classifies annotator-labeled positive samples as negative. Specificity, contrary to sensitivity, identifies the fraction of human-labeled negative wells that were classified correctly. Low specificity, then, correlates to a model that frequently classifies negative samples as positive. Another metric, precision, provides additional insight into how well a model is distinguishing between classes. Precision is defined as the fraction of model-labeled positive samples that were actually positive. This metric is of particular interest in this application as the number of negative reactions in the dataset is much greater than the number of positive reactions – an environment that can allow the previous metrics to be high despite a poorly performing model. In particular, precision measures a model’s false positive rate, with a low precision indicating a high rate of false positives relative to the number of true positives, making a positive prediction less meaningful. F1-score is a combination of sensitivity and precision, creating a metric that evaluates the performance of a model on positive samples considering these factors together. While less useful for identifying patterns of mistakes in models, F1-score can be used to put sensitivity and precision in context together. F1-score is calculated as the harmonic mean of precision and sensitivity, meaning a particularly low precision or sensitivity will create a low F1-score.