Phase 5: Model evaluation

In this phase, we will evaluate and assess model performance with statistical methods and explainable AI techniques.

Assess and compare model performance using statistical metrics like AUC, and explainable AI techniques to understand model predictions. This allows for identification of the most reliable models and extraction of biologically meaningful insights from models.

1) Select models for evaluation

Step 1. Navigate to the Dashboard and select your predictive analysis from the queue

The queue number selected is indicated in the pink box at the top right of the PANDORA interface.

Step 2. Navigate to Predictive -> Exploration

Select the dataset

Step 3. Configure model metrics

Select all Response outcomes
Select metrics of interest
1. For our analysis, we will select metrics AUC Accuracy, and Precision

How to select appropriate metrics for our model

To select metrics most appropriate for the model, we need to consider our driving immunological question and dataset balance. This will help identify areas where error minimization is critical and where certain metrics may be misleading. For more information on determining which metrics are best for evaluating your model, see our Model Metrics page

Let's consider our immunological question and dataset:

Immune Question: Can we utilize certain immune parameters measured early after infection to predict whether an individual builds a durable immune response to SARS-CoV-2?
- Think about the applications of our immune question. Applications could include deciding who to vaccinate or placement of healthcare workers. In these cases, we must be correct in assuming someone is a durable responder, even if we miss assigning some durable responders. Thus, Precision is key here.
Dataset: Even though the dataset is evenly split between high and low responders, the split between sex and disease severity is imbalanced. Therefore, measures assuming a balanced dataset, such as accuracy, should be taken with a grain of salt.

Step 4. Select models for evaluation

Select models to evaluate
- For this example, we will select the top three models: cforest, sparseLDA and dwdRadial to evaluate their performance.

2) Evaluate model performance

Purpose: Evaluate model metrics, which will tell us how well early immune signatures predict durable antibody response.

Step 1. Tabular comparison of metrics

In the exploration space, you will notice that each selected model is part of a table containing metrics. The metrics in this table are the same as those selected in the prior step. You can sort models in this table by metric values if there is a particular metric you care most about.

Each metric tells us something important about the model, and comparing metrics within a model can reveal even more information. The table of metrics selected for our models is shown below:

From the metrics in this table, we can deduce the following for each model:

cforest: Initially, accuracy may seem the best in this model, but we see training accuracy is about 20% lower, indicating a potential imbalance in the data split between training and testing or complexity. Thus, accuracy may not be a good measure here. With high AUCs in training (0.8639) and testing (1), the model is good at distinguishing between classes (durable vs non-durable). The model is also decently good at correctly identifying those who are responders with a precision of 0.8417.
sparseLDA: Overall, this model is best at correctly identifying positive responders with a precision of 0.9333. With the highest AUCs (train=0.9056, test=1), this model is also best at distinguishing between durable and non-durable responders.
dwdRadial: Although the AUC in this model is comparable to the other models (train=0.8444, test=0.9583), it has the lowest precision (0.7939). Indicating that the model is good at distinguishing between responders, but struggles more in correctly identifying positive (durable) responders.

For more information on what each metric tells us, please see the documentation section Model Metrics

Step 2. Evaluate ROC curves

ROC curves provide vital information about model performance with both training and testing datasets.

To view and compare ROC curves for a model, choose your desired model(s) select the ROC Curve Analysis in Exploration

Observe the shape of the ROC curve and the AUC for each classification category (high or low responder) by clicking on the graph to expand.
1. If choosing multiple models, select the model name to view its ROC curve
2. Compare curves of multiple models by selecting the Comparison tab

Ideally, AUC scores to equal 1 or very close to 1 are preferred. Furthermore, you want the testing AUC to be higher than the training AUC, as that confirms the model is able to classify accurately on unseen data.

Below, we can see that the ROC curves for both train and test datasets fit these criteria best for sparseLDA. The ROC curves for this model also match closely with the ROC curve in figure 7c of the reference paper.

Step 3. Evaluate Training Summary

Along with ROC curves and AUC, there are other important metrics that help determine whether a model is good for answering the specific research question. Many of these metrics are derived from a confusion matrix.

To view these metrics, choose two or more models and navigate to the Training summary tab in Exploration

View the box plots to compare various metrics related to the models' performance. The example below compares the measurements for the models cforest, sparseLDA and dwdRadial . Further details on each of these metrics can be found in our documentation

Scrolling down, you can view the Performance measurements to determine if there are significant differences between model metric values.
1. From this, for our metrics of interest, we see no significant differences between training AUC values, but we do see significant differences between dwdRadial and sparseLDA for accuracy & precision values.
The Model fitting results summary, located to the right of the performance measurements, provides the five-number summary of each model that is visualized in the box plots.

3) Identify key early predictors

We can identify early predictors using PANDORA's Variable Importance feature, where explainable AI is used to assign a score to the predicting features based on how important the model considered that feature in its classification task. A higher score indicates higher importance.

Step 1. Identify Important Variables

Choose your best model and select the Variable Importance tab in Exploration
1. Depending on your purpose, such as getting a generalized view of features agreed upon by multiple models, you can choose the top few models to explore their variable importance.
2. The example below follows looking at variable importance for the models cforest, dwdRadial and sparseLDA
Select the Variable Importance sub-tab with the main Variable Importance tab

The variable importance plot generated is shown below.

Many of the top features on this plot are the same as those reported in the reference paper in Figure 7c. Just like the paper, N-IgG is the most important feature. The features ADCD, psuedoNA Abs, S-IgG, S1 T cells EliSpot, Total pos T cells ELISpot, and S2 T Cell ELISPOT also account for a similar level of importance as in the paper figure 7c. A notable difference is that the feature M T cells elispot is rated at a higher level of feature importance in this model.

Step 2. Assess feature distribution across the dataset

The Features across dataset tab shows the distribution of high and low responders for any selected features. Looking at these plots will tell us how the quantitative values of early predictive features vary between durable and non-durable responders.

From the table, select the most important features identified in part 1. We will use N-IgG, ADCD, psuedoNA Abs, S-IgG, S1 T cells EliSpot, Total pos T cells ELISpot, M T cells elispot, and S2 T Cell ELISPOT
Select the Features across dataset tab and click the Redraw Plot button

The resulting dot plots for each selected feature are shown below:

Every dot plot shows the distribution of high and low responders for a selected feature. From this, we see the starkest difference between high and low responders for N-IgG, wherein N-IgG is elevated for high responders. Generally, we are also seeing that feature levels are elevated and distributions are more spread for high responders in comparison to low responders.

This concludes the analysis steps for our dataset. The next step is to combine all of the findings to meet the initial objectives of our analysis by describing immune trajectories, reporting the best model, and listing the most important early immune signatures. These objectives were:

Visualize the trajectories of diverse immune responses over 6 months after infection by analyzing how trajectories differ based on initial disease severity and the correlations between different immune parameters
Predict pre-defined long-term antibody responder status based on early immune signatures

PreviousAlgorithms for biomedical data NextPhase 6: Results

Last updated 1 month ago

Was this helpful?

hashtagHow to select appropriate metrics for our model

How to select appropriate metrics for our model