# Phase 1: Data import

Perform an initial exploratory data analysis on the `flu_fighters.csv` dataset, including data upload, inspection of missing values, visualization of variable distributions, and identification of key correlations to guide further analysis.

<details>

<summary>1. Launch PANDORA (if needed)</summary>

1. Open Docker and run PANDORA container if not running

<figure><img src="https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2FrruNH135y4XvajwHxF1H%2FFF_Phase1_Launch%20Docker_annotated.png?alt=media&#x26;token=b61a8511-6756-415d-8083-b4ccb12ae527" alt=""><figcaption></figcaption></figure>

4. Access PANDORA:
   1. Open your browser and navigate to <http://localhost:3010>

</details>

<details>

<summary>2. Inspect data</summary>

1. Navigate to [**Workspace**](https://app.gitbook.com/s/9LdC62ZpkxqvCBTPwVZU/general/workspace)

<figure><img src="https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2FldiHrJ4cyYrvQ9Ytsfbp%2FFF_Phase1_Workspace_annotated.png?alt=media&#x26;token=3ea004da-6c8a-4112-b928-3b117866020c" alt=""><figcaption></figcaption></figure>

2. Upload the `flu_fighters.csv` dataset to [**Workspace**](https://app.gitbook.com/s/9LdC62ZpkxqvCBTPwVZU/general/workspace)
3. Select the uploaded `flu_fighters.csv` dataset

<figure><img src="https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2F6LOikukRxIez1Uc9ytoI%2FFF_Phase1_Workspace_Select%20Dataset_annotated.png?alt=media&#x26;token=5f8c3033-5e06-4347-9356-a6f331d29e86" alt=""><figcaption></figcaption></figure>

4. With the dataset selected, navigate to[ **Discovery** -> **Start**](https://app.gitbook.com/s/9LdC62ZpkxqvCBTPwVZU/data-analysis/discovery)
   1. Select the[ **Data Overview**](https://app.gitbook.com/s/9LdC62ZpkxqvCBTPwVZU/data-analysis/discovery/data-overview) tab

<figure><img src="https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2FBtInKrvXEzikriKzKmlK%2FFF_Phase1_Discovery_Data%20Overview_annotated.png?alt=media&#x26;token=c8fe1c83-f1a3-4512-8557-77b15fbc1338" alt=""><figcaption></figcaption></figure>

5. Select up to 5 variables for inspection
   1. The first variable selected will be set as the sorting variable
   2. Examine missing values - The number of NAs per feature is provided when selecting your columns, a star next to that number indicates <10% of values are NA for a given feature
   3. In this example, baseline CD4+ IFN-γ responses to H1 (`h1_v0_cd4_ifng`)is set as the sorting variable and compared to CD4 cytokine fold change variables (`h1_cd4_ifng_fold_change`, `h3_cd4_ifng_fold_change`, `h1_cd4_il2_fold_change`)

<figure><img src="https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2FD1tiGGkOXLQApkoLaNDH%2FFF_Phase1_Data%20Discovery_Column%20Selection.png?alt=media&#x26;token=5baf656c-a8ec-4c73-8cb6-fdc9b49b8b5a" alt="" width="375"><figcaption></figcaption></figure>

{% hint style="warning" %}

### Handling Missing Values

Caution should be taken when using median imputation for features containing more than 10% missing values (NA). In these cases, you will want to check the dataset to ensure no bias in the missing values (ie, all high responders are missing a selected baseline measurement).
{% endhint %}

6. Plot image for the selected data
7. Examine the [**Distribution Plot**](https://app.gitbook.com/s/9LdC62ZpkxqvCBTPwVZU/data-analysis/discovery/data-overview#distribution-plot)
   1. This plot provides information about skewness, potential outliers, and correlations between variables.
   2. Based on the distribution plot generated in our example below, we see:
      1. The distribution plot for every selected feature is right-skewed, as shown in the figures along the diagonal.
      2. There is a significant correlation, as shown in the red boxes, between:
         1. `h1_v0_cd4_ifng` & `h1_cd4_ifng_fold_change`
         2. `h1_cd4_ifng_fold_change` & `h1_cd4_il2_fold_change`
         3. `h1_cd4_ifng_fold_change` & `h3_cd4_ifng_fold_change`
         4. `h1_cd4_il2_fold_change` & `h3_cd4_ifng_fold_change`
      3. There are significant outliers in some of the correlation plots, as shown by the red circles.

<figure><img src="https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2Fz8ss740QYTdoTtmOfncS%2FFF_Phase1_Distribution%20Plot_annotated.png?alt=media&#x26;token=18b9c006-d1dd-4df3-a1f2-361ad850c811" alt=""><figcaption></figcaption></figure>

8. Select the [**Table Plot**](https://app.gitbook.com/s/9LdC62ZpkxqvCBTPwVZU/data-analysis/discovery/data-overview#table-plot) tab and examine the table plot
   1. This plot can be used to understand columns (predictors vs. outcomes), data types, and unique value counts.
   2. The leftmost variable is the sorting variable, arranging all rows from its largest to smallest values.
   3. Based on the table plot generated in our example below, we see:
      1. No apparent correlation (positive or negative) between the fold change variables and decreasing baseline cytokine levels.
      2. The data types for each variable are continuous and tend to range between -0.5 and 1 for the log of every variable.

<figure><img src="https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2FXzWiz85xKI64m55SYBGQ%2FFF_Phase1_Table%20Plot.png?alt=media&#x26;token=771ff46e-84ea-4287-86dd-0953db290310" alt=""><figcaption></figcaption></figure>

**Repeat this process for all key baseline and outcome features of interest.**

</details>

<details>

<summary>3. Explore Outcome Variable Relationships (Optional)</summary>

1. Navigate to [**Discovery** -> **Correlation**](https://app.gitbook.com/s/9LdC62ZpkxqvCBTPwVZU/data-analysis/discovery#correlation)

<figure><img src="https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2FAz8IVFDE0R8x6ZtZEqKF%2FFF_Phase1_Dicsovery_Correlation_annotated.png?alt=media&#x26;token=126047e9-3fb7-448d-a3c2-71cb814bf67d" alt=""><figcaption></figcaption></figure>

2. Expand **Column Selection**

   1. Select all outcome columns (`fold_change`)

   ![](https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2FoXLpxA6TgK9iLOjjHlrx%2FFF_Phase1_Correlation_Column%20Selection.png?alt=media\&token=13e52fbd-2b4a-4f67-9047-328d60cf05a8)

   b.  Choose **Correlation Method** `Spearman`

   ![](https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2FyyXcOho7JHJjHp2aKuaX%2FFF_Phase1_Correlation_Correlation%20Method.png?alt=media\&token=c62656e6-4835-40f4-9a28-12fbf578bbb5)
3. Expand **Preprocessing**

   1. Remove the `medianimpute`

   ![](https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2F6h0SVY5TRwmI2wV1chcN%2FFF_Phase1_Correlation_Remove%20medianimpute.png?alt=media\&token=1e22b3cb-09f6-4e49-a513-52e9a6c25fbc)
4. Expand **Correlation Settings**

   1. Select **NA Action** `pairwise.complete.obs` from the dropdown

   ![](https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2FO8Yegi6mdgX0nmpvwIUG%2FFF_Phase1_Correlation_NA%20Action.png?alt=media\&token=71e12438-9cc2-4121-9b0d-a793f57a6e6a)

   b.  Select a desired **Plot method** for visualization

   ![](https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2FFKW7agQEV2cy0qwPMDQe%2FFF_Phase1_Correlation_Plot%20Method.png?alt=media\&token=4a81fe87-51bf-48a1-ab1f-f1f92f8b4cf3)
5. Set **Text size** to 1
6. Click the **Plot Image** button
7. Observe the correlation plot
   1. See documentation on [Correlation](https://app.gitbook.com/s/9LdC62ZpkxqvCBTPwVZU/data-analysis/discovery/correlation) for more information about interpreting the plot.

<figure><img src="https://1845146574-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZMrkCA3Bqd62Gp0kAk79%2Fuploads%2FZxrvEQoiddWlUilfysLb%2FFF_Phase1_Correlation_Correlation%20Plot.png?alt=media&#x26;token=2f5f907c-ca4e-4b57-b093-cc3129c19bc2" alt=""><figcaption><p>Flu Fighters correlation plot for all fold_change variables</p></figcaption></figure>

This correlogram visualizes the pairwise correlations between the variables listed on both axes; in this instance, these are various \_fold\_change immune parameters. The diagonal line of large, dark red circles represents each variable's perfect positive correlation (+1) with itself. For all other pairs, **reddish circles indicate a positive correlation** (meaning as one variable's fold change increases, the other's also tends to increase), while **bluish circles signify a negative correlation** (as one increases, the other tends to decrease). The **size of each circle and the intensity of its color directly reflect the strength** of this relationship, with the exact correlation coefficient values corresponding to the color bar on the right (ranging from -1 for strong negative to +1 for strong positive). Users should look for **clusters or blocks of similarly colored and sized circles**, as these highlight groups of immune responses whose magnitudes of change are often interlinked or coordinated within the studied cohort; the variables are typically reordered to make such patterns more visually apparent.

</details>

You've now uploaded your dataset and completed an initial inspection to understand variable types, distributions, and missing values. These initial steps ensure your data is clean and well understood before deriving any responder features and running predictive models.
