Model metrics
When PANDORA builds a machine learning model, it provides a set of metrics to help you evaluate its performance. Understanding these metrics is crucial for knowing how well your model is working and whether it's suitable for your research questions in systems vaccinology and immunology.
The Golden Rule: Train vs. Test Performance
You'll notice many metrics have a Train...
prefix (e.g., TrainAccuracy
) and a version without it (e.g., Accuracy
).
Train...
Metrics: Performance on the data the model was trained on.Non-
Train...
Metrics (Test/Validation Metrics): Performance on new, unseen data. These are the most important for judging real-world performance!Ideal Scenario: High
Train...
scores AND high non-Train...
scores, with both sets of scores being similar. This means your model has learned well and generalizes to new data.Overfitting: High
Train...
scores but much lower non-Train...
scores. The model learned the training data too well (including its noise) and won't perform well on new samples.Underfitting: Low scores on both
Train...
and non-Train...
metrics. The model is too simple and hasn't learned the underlying patterns.TrainMean_...
Metrics: These are typically averages from cross-validation during training. They give a more robust estimate of training performance than a single train run.
I. Core metrics for classification
These metrics often depend on a chosen probability threshold (usually 0.5) to decide the predicted class. They are derived from a "confusion matrix" which counts:
True Positives (TP): Correctly predicted positive (e.g., correctly identified as "Responder").
True Negatives (TN): Correctly predicted negative (e.g., correctly identified as "Non-Responder").
False Positives (FP): Incorrectly predicted positive (e.g., "Non-Responder" mistakenly called "Responder").
False Negatives (FN): Incorrectly predicted negative (e.g., "Responder" mistakenly called "Non-Responder").
Accuracy
(TrainAccuracy
)
Overall, what proportion of predictions were correct?
0 to 1
Higher
"How often is the model right?"
Can be misleading if your classes are imbalanced (e.g., 90% Non-Responders, 10% Responders).
Balanced Accuracy
(TrainBalanced_Accuracy
, TrainMean_Balanced_Accuracy
)
Average accuracy for each class.
0 to 1
Higher
"How well does the model perform on average for each group?"
Much better for imbalanced datasets than regular Accuracy. A score of 0.5 is like random guessing.
Precision / Positive Predictive Value (PPV)
(TrainPrecision
, TrainMean_Precision
, TrainPos_Pred_Value
)
When the model predicts "positive" (e.g., "Responder"), how often is it correct?
0 to 1
Higher
"Of those predicted as 'Responder', how many actually were?"
Important when the cost of a False Positive is high (e.g., wrongly starting an expensive follow-up).
Recall / Sensitivity / True Positive Rate (TPR)
(TrainRecall
, TrainMean_Recall
, TrainMean_Sensitivity
)
Of all the actual "positives", how many did the model correctly identify?
0 to 1
Higher
"Of all actual 'Responders', how many did we find?"
Crucial when missing a positive is bad (e.g., failing to identify individuals who will benefit from a vaccine).
F1-Score
(TrainF1
, TrainMean_F1
)
A balance between Precision and Recall.
0 to 1
Higher
"How good is the model considering both finding positives and being right when it does?"
Useful when you care about both Precision and Recall, especially with imbalanced classes.
Specificity / True Negative Rate (TNR)
(TrainSpecificity
, TrainMean_Specificity
)
Of all the actual "negatives" (e.g., "Non-Responders"), how many did the model correctly identify?
0 to 1
Higher
"Of all actual 'Non-Responders', how many did we correctly identify?"
Important when correctly identifying negatives is key.
Negative Predictive Value (NPV)
(TrainNeg_Pred_Value
)
When the model predicts "negative", how often is it correct?
0 to 1
Higher
"Of those predicted as 'Non-Responder', how many actually were?"
Complements PPV.
Detection Rate
Proportion of the entire dataset that are true positives.
0 to 1
Higher
"What fraction of all samples were correctly identified as positive?"
Influenced by how common the positive class is.
II. Threshold-independent metrics
These metrics evaluate the model's ability to discriminate between classes across all possible classification thresholds, rather than just one.
AUC / ROC AUC
(PredictAUC
, TrainAUC
)
Area Under the Receiver Operating Characteristic Curve - The ROC curve plots Recall (Sensitivity) vs. (1 - Specificity) at all thresholds. AUC measures the model's ability to distinguish between classes.
0.5 to 1
Higher
"How well can the model tell the difference between a 'Responder' and a 'Non-Responder' across all possible cutoff points?"
0.5 = random guessing, 1.0 = perfect separation. A good general measure of discriminative power.
prAUC / AUPRC
(TrainprAUC
)
Area Under the Precision-Recall Curve - This curve plots Precision vs. Recall at all thresholds.
Baseline to 1
Higher
"How well can the model achieve high precision (correct positive predictions) and high recall (finding all positives) simultaneously?"
More informative than ROC AUC for highly imbalanced datasets where the positive class is rare. Baseline is the proportion of positives in the data.
III. Other useful metrics
Kappa (Cohen's Kappa)
How much better the model's predictions are compared to random chance, accounting for class imbalance.
Approx -1 to 1
Higher
"How much better is the model than just guessing randomly?"
Good for imbalanced classes. 0 = like random chance, >0.6 is often considered substantial agreement.
LogLoss
(TrainlogLoss
)
Logarithmic Loss. Measures how far off the model's predicted probabilities are from the actual outcomes. It heavily penalizes confident wrong predictions.
0 to ∞
Lower
"How well do the model's predicted probabilities match the true outcomes?"
Directly optimized by many models (like logistic regression). Good for evaluating the calibration of probabilities.
How to know if your model is "Good"?
There's no single magic number. Here's how to think about it:
Define "Good" for YOUR Research Question:
In vaccinology, is it more critical to find all potential responders, even if you misclassify some non-responders (prioritize Recall/Sensitivity)?
Or is it more important that when you claim someone is a responder, you are very likely correct, even if you miss some (prioritize Precision)?
Are your groups (e.g., responders vs. non-responders) imbalanced in size? If yes, Accuracy is misleading! Focus on
BalancedAccuracy
,F1-Score
,prAUC
,Kappa
, andRecall
/Specificity
for each class.
Look at the Test/Validation Metrics (non-
Train...
): These tell you how your model will likely perform on new, unseen individuals.Compare to a Baseline: How would a very simple model perform (e.g., always predicting the majority class, or random guessing)? Your PANDORA model should be significantly better.
Don't Rely on a Single Metric: Look at a collection of relevant metrics. A model might have high
Accuracy
but terribleRecall
for a rare but important group.Consider the Trade-offs: Often, improving
Precision
can lowerRecall
, and vice-versa. TheAUC
metrics help evaluate performance independent of picking a specific threshold, while metrics likeF1-Score
try to balance this trade-off.Iterate and Refine: Use these metrics to guide further model improvements, feature selection, or even how you define your groups.
References
Beyer, W. H. CRC Standard Mathematical Tables, 31st ed. Boca Raton, FL: CRC Press, pp. 536 and 571, 2002. Dodge, Y. (2008). The Concise Encyclopedia of Statistics. Springer. Everitt, B. S.; Skrondal, A. (2010), The Cambridge Dictionary of Statistics, Cambridge University Press. Kotz, S.; et al., eds. (2006), Encyclopedia of Statistical Sciences, Wiley.
Last updated
Was this helpful?