In this phase of the workflow, you will upload the dataset downloaded in the intro, and inspect the dataset for use in the analysis.
Perform an initial exploratory data analysis on the flu_fighters.csv dataset, including data upload, inspection of missing values, visualization of variable distributions, and identification of key correlations to guide further analysis.
1. Launch PANDORA (if needed)
Open Docker and run PANDORA container if not running
The first variable selected will be set as the sorting variable
Examine missing values - The number of NAs per feature is provided when selecting your columns, a star next to that number indicates <10% of values are NA for a given feature
In this example, baseline CD4+ IFN-γ responses to H1 (h1_v0_cd4_ifng)is set as the sorting variable and compared to CD4 cytokine fold change variables (h1_cd4_ifng_fold_change, h3_cd4_ifng_fold_change, h1_cd4_il2_fold_change)
Handling Missing Values
Caution should be taken when using median imputation for features containing more than 10% missing values (NA). In these cases, you will want to check the dataset to ensure no bias in the missing values (ie, all high responders are missing a selected baseline measurement).
Select NA Actionpairwise.complete.obs from the dropdown
b. Select a desired Plot method for visualization
Set Text size to 1
Click the Plot Image button
Observe the correlation plot
See documentation on Correlation for more information about interpreting the plot.
Flu Fighters correlation plot for all fold_change variables
This correlogram visualizes the pairwise correlations between the variables listed on both axes; in this instance, these are various _fold_change immune parameters. The diagonal line of large, dark red circles represents each variable's perfect positive correlation (+1) with itself. For all other pairs, reddish circles indicate a positive correlation (meaning as one variable's fold change increases, the other's also tends to increase), while bluish circles signify a negative correlation (as one increases, the other tends to decrease). The size of each circle and the intensity of its color directly reflect the strength of this relationship, with the exact correlation coefficient values corresponding to the color bar on the right (ranging from -1 for strong negative to +1 for strong positive). Users should look for clusters or blocks of similarly colored and sized circles, as these highlight groups of immune responses whose magnitudes of change are often interlinked or coordinated within the studied cohort; the variables are typically reordered to make such patterns more visually apparent.
You've now uploaded your dataset and completed an initial inspection to understand variable types, distributions, and missing values. These initial steps ensure your data is clean and well understood before deriving any responder features and running predictive models.