Phase 1: Data import

In this phase of the workflow, you will upload the dataset downloaded in the intro, and inspect the dataset for use in the analysis.

Perform an initial exploratory data analysis on the flu_fighters.csv dataset, including data upload, inspection of missing values, visualization of variable distributions, and identification of key correlations to guide further analysis.

1. Launch PANDORA (if needed)

Open Docker and run PANDORA container if not running

Access PANDORA:
1. Open your browser and navigate to http://localhost:3010

2. Inspect data

Navigate to Workspace

Upload the flu_fighters.csv dataset to Workspace

Select the uploaded flu_fighters.csv dataset

With the dataset selected, navigate to Discovery -> Start
1. Select the Data Overview tab

Select up to 5 variables for inspection
1. The first variable selected will be set as the sorting variable
2. Examine missing values - The number of NAs per feature is provided when selecting your columns, a star next to that number indicates <10% of values are NA for a given feature
3. In this example, baseline CD4+ IFN-γ responses to H1 (h1_v0_cd4_ifng)is set as the sorting variable and compared to CD4 cytokine fold change variables (h1_cd4_ifng_fold_change, h3_cd4_ifng_fold_change, h1_cd4_il2_fold_change)

Handling Missing Values

Caution should be taken when using median imputation for features containing more than 10% missing values (NA). In these cases, you will want to check the dataset to ensure no bias in the missing values (ie, all high responders are missing a selected baseline measurement).

Plot image for the selected data

Examine the Distribution Plot
1. This plot provides information about skewness, potential outliers, and correlations between variables.
2. Based on the distribution plot generated in our example below, we see:
  1. The distribution plot for every selected feature is right-skewed, as shown in the figures along the diagonal.
  2. There is a significant correlation, as shown in the red boxes, between:
    h1_v0_cd4_ifng & h1_cd4_ifng_fold_change
    h1_cd4_ifng_fold_change & h1_cd4_il2_fold_change
    h1_cd4_ifng_fold_change & h3_cd4_ifng_fold_change
    h1_cd4_il2_fold_change & h3_cd4_ifng_fold_change
  3. There are significant outliers in some of the correlation plots, as shown by the red circles.

Select the Table Plot tab and examine the table plot
1. This plot can be used to understand columns (predictors vs. outcomes), data types, and unique value counts.
2. The leftmost variable is the sorting variable, arranging all rows from its largest to smallest values.
3. Based on the table plot generated in our example below, we see:
  1. No apparent correlation (positive or negative) between the fold change variables and decreasing baseline cytokine levels.
  2. The data types for each variable are continuous and tend to range between -0.5 and 1 for the log of every variable.

Repeat this process for all key baseline and outcome features of interest.

3. Explore Outcome Variable Relationships (Optional)

Navigate to Discovery -> Correlation

Expand Column Selection
1. Select all outcome columns (fold_change)
b. Choose Correlation Method Spearman

Expand Preprocessing
1. Remove the medianimpute

Expand Correlation Settings
1. Select NA Action pairwise.complete.obs from the dropdown
b. Select a desired Plot method for visualization

Set Text size to 1

Click the Plot Image button

Observe the correlation plot
1. See documentation on Correlation for more information about interpreting the plot.

This correlogram visualizes the pairwise correlations between the variables listed on both axes; in this instance, these are various _fold_change immune parameters. The diagonal line of large, dark red circles represents each variable's perfect positive correlation (+1) with itself. For all other pairs, reddish circles indicate a positive correlation (meaning as one variable's fold change increases, the other's also tends to increase), while bluish circles signify a negative correlation (as one increases, the other tends to decrease). The size of each circle and the intensity of its color directly reflect the strength of this relationship, with the exact correlation coefficient values corresponding to the color bar on the right (ranging from -1 for strong negative to +1 for strong positive). Users should look for clusters or blocks of similarly colored and sized circles, as these highlight groups of immune responses whose magnitudes of change are often interlinked or coordinated within the studied cohort; the variables are typically reordered to make such patterns more visually apparent.

You've now uploaded your dataset and completed an initial inspection to understand variable types, distributions, and missing values. These initial steps ensure your data is clean and well understood before deriving any responder features and running predictive models.

PreviousFlu Fighters workflow NextPhase 2: Define responders

Last updated 2 months ago

Was this helpful?