Phase 1: Data import
In this phase of the workflow, you will upload the dataset downloaded in the intro, and inspect the dataset for use in the analysis.
Perform an initial exploratory data analysis on the flu_fighters.csv dataset, including data upload, inspection of missing values, visualization of variable distributions, and identification of key correlations to guide further analysis.
1. Launch PANDORA (if needed)
Open Docker and run PANDORA container if not running

Access PANDORA:
Open your browser and navigate to http://localhost:3010
2. Inspect data
Navigate to Workspace

Upload the
flu_fighters.csvdataset to WorkspaceSelect the uploaded
flu_fighters.csvdataset

With the dataset selected, navigate to Discovery -> Start
Select the Data Overview tab

Select up to 5 variables for inspection
The first variable selected will be set as the sorting variable
Examine missing values - The number of NAs per feature is provided when selecting your columns, a star next to that number indicates <10% of values are NA for a given feature
In this example, baseline CD4+ IFN-γ responses to H1 (
h1_v0_cd4_ifng)is set as the sorting variable and compared to CD4 cytokine fold change variables (h1_cd4_ifng_fold_change,h3_cd4_ifng_fold_change,h1_cd4_il2_fold_change)

Handling Missing Values
Caution should be taken when using median imputation for features containing more than 10% missing values (NA). In these cases, you will want to check the dataset to ensure no bias in the missing values (ie, all high responders are missing a selected baseline measurement).
Plot image for the selected data
Examine the Distribution Plot
This plot provides information about skewness, potential outliers, and correlations between variables.
Based on the distribution plot generated in our example below, we see:
The distribution plot for every selected feature is right-skewed, as shown in the figures along the diagonal.
There is a significant correlation, as shown in the red boxes, between:
h1_v0_cd4_ifng&h1_cd4_ifng_fold_changeh1_cd4_ifng_fold_change&h1_cd4_il2_fold_changeh1_cd4_ifng_fold_change&h3_cd4_ifng_fold_changeh1_cd4_il2_fold_change&h3_cd4_ifng_fold_change
There are significant outliers in some of the correlation plots, as shown by the red circles.

Select the Table Plot tab and examine the table plot
This plot can be used to understand columns (predictors vs. outcomes), data types, and unique value counts.
The leftmost variable is the sorting variable, arranging all rows from its largest to smallest values.
Based on the table plot generated in our example below, we see:
No apparent correlation (positive or negative) between the fold change variables and decreasing baseline cytokine levels.
The data types for each variable are continuous and tend to range between -0.5 and 1 for the log of every variable.

Repeat this process for all key baseline and outcome features of interest.
3. Explore Outcome Variable Relationships (Optional)
Navigate to Discovery -> Correlation

Expand Column Selection
Select all outcome columns (
fold_change)

b. Choose Correlation Method
Spearman
Expand Preprocessing
Remove the
medianimpute

Expand Correlation Settings
Select NA Action
pairwise.complete.obsfrom the dropdown

b. Select a desired Plot method for visualization

Set Text size to 1
Click the Plot Image button
Observe the correlation plot
See documentation on Correlation for more information about interpreting the plot.

This correlogram visualizes the pairwise correlations between the variables listed on both axes; in this instance, these are various _fold_change immune parameters. The diagonal line of large, dark red circles represents each variable's perfect positive correlation (+1) with itself. For all other pairs, reddish circles indicate a positive correlation (meaning as one variable's fold change increases, the other's also tends to increase), while bluish circles signify a negative correlation (as one increases, the other tends to decrease). The size of each circle and the intensity of its color directly reflect the strength of this relationship, with the exact correlation coefficient values corresponding to the color bar on the right (ranging from -1 for strong negative to +1 for strong positive). Users should look for clusters or blocks of similarly colored and sized circles, as these highlight groups of immune responses whose magnitudes of change are often interlinked or coordinated within the studied cohort; the variables are typically reordered to make such patterns more visually apparent.
You've now uploaded your dataset and completed an initial inspection to understand variable types, distributions, and missing values. These initial steps ensure your data is clean and well understood before deriving any responder features and running predictive models.
Last updated
Was this helpful?