Model metrics

When PANDORA builds a machine learning model, it provides a set of metrics to help you evaluate its performance. Understanding these metrics is crucial for knowing how well your model is working and whether it's suitable for your research questions in systems vaccinology and immunology.

The Golden Rule: Train vs. Test Performance

You'll notice many metrics have a Train... prefix (e.g., TrainAccuracy) and a version without it (e.g., Accuracy).

Train... Metrics: Performance on the data the model was trained on.
Non-Train... Metrics (Test/Validation Metrics): Performance on new, unseen data. These are the most important for judging real-world performance!
- Ideal Scenario: High Train... scores AND high non-Train... scores, with both sets of scores being similar. This means your model has learned well and generalizes to new data.
- Overfitting: High Train... scores but much lower non-Train... scores. The model learned the training data too well (including its noise) and won't perform well on new samples.
- Underfitting: Low scores on both Train... and non-Train... metrics. The model is too simple and hasn't learned the underlying patterns.
- TrainMean_... Metrics: These are typically averages from cross-validation during training. They give a more robust estimate of training performance than a single train run.

I. Core metrics for classification

These metrics often depend on a chosen probability threshold (usually 0.5) to decide the predicted class. They are derived from a "confusion matrix" which counts:

True Positives (TP): Correctly predicted positive (e.g., correctly identified as "Responder").
True Negatives (TN): Correctly predicted negative (e.g., correctly identified as "Non-Responder").
False Positives (FP): Incorrectly predicted positive (e.g., "Non-Responder" mistakenly called "Responder").
False Negatives (FN): Incorrectly predicted negative (e.g., "Responder" mistakenly called "Non-Responder").

Metric

What it Measures (Simpler Terms)

Range

Ideal

Key Question Answered

Good For/Cautions

Accuracy (TrainAccuracy)

Overall, what proportion of predictions were correct?

0 to 1

Higher

"How often is the model right?"

Can be misleading if your classes are imbalanced (e.g., 90% Non-Responders, 10% Responders).

Balanced Accuracy (TrainBalanced_Accuracy, TrainMean_Balanced_Accuracy)

Average accuracy for each class.

0 to 1

Higher

"How well does the model perform on average for each group?"

Much better for imbalanced datasets than regular Accuracy. A score of 0.5 is like random guessing.

Precision / Positive Predictive Value (PPV) (TrainPrecision, TrainMean_Precision, TrainPos_Pred_Value)

When the model predicts "positive" (e.g., "Responder"), how often is it correct?

0 to 1

Higher

"Of those predicted as 'Responder', how many actually were?"

Important when the cost of a False Positive is high (e.g., wrongly starting an expensive follow-up).

Recall / Sensitivity / True Positive Rate (TPR) (TrainRecall, TrainMean_Recall, TrainMean_Sensitivity)

Of all the actual "positives", how many did the model correctly identify?

0 to 1

Higher

"Of all actual 'Responders', how many did we find?"

Crucial when missing a positive is bad (e.g., failing to identify individuals who will benefit from a vaccine).

F1-Score (TrainF1, TrainMean_F1)

A balance between Precision and Recall.

0 to 1

Higher

"How good is the model considering both finding positives and being right when it does?"

Useful when you care about both Precision and Recall, especially with imbalanced classes.

Specificity / True Negative Rate (TNR) (TrainSpecificity, TrainMean_Specificity)

Of all the actual "negatives" (e.g., "Non-Responders"), how many did the model correctly identify?

0 to 1

Higher

"Of all actual 'Non-Responders', how many did we correctly identify?"

Important when correctly identifying negatives is key.

Negative Predictive Value (NPV) (TrainNeg_Pred_Value)

When the model predicts "negative", how often is it correct?

0 to 1

Higher

"Of those predicted as 'Non-Responder', how many actually were?"

Complements PPV.

Detection Rate

Proportion of the entire dataset that are true positives.

0 to 1

Higher

"What fraction of all samples were correctly identified as positive?"

Influenced by how common the positive class is.

II. Threshold-independent metrics

These metrics evaluate the model's ability to discriminate between classes across all possible classification thresholds, rather than just one.

Metric Name(s)

What it Measures (Simpler Terms)

Range

Ideal

Key Question Answered

Good For/Cautions

AUC / ROC AUC (PredictAUC, TrainAUC)

Area Under the Receiver Operating Characteristic Curve - The ROC curve plots Recall (Sensitivity) vs. (1 - Specificity) at all thresholds. AUC measures the model's ability to distinguish between classes.

0.5 to 1

Higher

"How well can the model tell the difference between a 'Responder' and a 'Non-Responder' across all possible cutoff points?"

0.5 = random guessing, 1.0 = perfect separation. A good general measure of discriminative power.

prAUC / AUPRC (TrainprAUC)

Area Under the Precision-Recall Curve - This curve plots Precision vs. Recall at all thresholds.

Baseline to 1

Higher

"How well can the model achieve high precision (correct positive predictions) and high recall (finding all positives) simultaneously?"

More informative than ROC AUC for highly imbalanced datasets where the positive class is rare. Baseline is the proportion of positives in the data.

III. Other useful metrics

Metric Name(s)

What it Measures (Simpler Terms)

Range

Ideal

Key Question Answered

Good For/Cautions

Kappa (Cohen's Kappa)

How much better the model's predictions are compared to random chance, accounting for class imbalance.

Approx -1 to 1

Higher

"How much better is the model than just guessing randomly?"

Good for imbalanced classes. 0 = like random chance, >0.6 is often considered substantial agreement.

LogLoss (TrainlogLoss)

Logarithmic Loss. Measures how far off the model's predicted probabilities are from the actual outcomes. It heavily penalizes confident wrong predictions.

0 to ∞

Lower

"How well do the model's predicted probabilities match the true outcomes?"

Directly optimized by many models (like logistic regression). Good for evaluating the calibration of probabilities.

How to know if your model is "Good"?

There's no single magic number. Here's how to think about it:

Define "Good" for YOUR Research Question:
- In vaccinology, is it more critical to find all potential responders, even if you misclassify some non-responders (prioritize Recall/Sensitivity)?
- Or is it more important that when you claim someone is a responder, you are very likely correct, even if you miss some (prioritize Precision)?
- Are your groups (e.g., responders vs. non-responders) imbalanced in size? If yes, Accuracy is misleading! Focus on BalancedAccuracy, F1-Score, prAUC, Kappa, and Recall/Specificity for each class.
Look at the Test/Validation Metrics (non-Train...): These tell you how your model will likely perform on new, unseen individuals.
Compare to a Baseline: How would a very simple model perform (e.g., always predicting the majority class, or random guessing)? Your PANDORA model should be significantly better.
Don't Rely on a Single Metric: Look at a collection of relevant metrics. A model might have high Accuracy but terrible Recall for a rare but important group.
Consider the Trade-offs: Often, improving Precision can lower Recall, and vice-versa. The AUC metrics help evaluate performance independent of picking a specific threshold, while metrics like F1-Score try to balance this trade-off.
Iterate and Refine: Use these metrics to guide further model improvements, feature selection, or even how you define your groups.

References

Beyer, W. H. CRC Standard Mathematical Tables, 31st ed. Boca Raton, FL: CRC Press, pp. 536 and 571, 2002. Dodge, Y. (2008). The Concise Encyclopedia of Statistics. Springer. Everitt, B. S.; Skrondal, A. (2010), The Cambridge Dictionary of Statistics, Cambridge University Press. Kotz, S.; et al., eds. (2006), Encyclopedia of Statistical Sciences, Wiley.

PreviousExploration NextSensitivity, specificity, and predictive values: what do they mean?

Last updated 2 months ago

Was this helpful?